How should the Machine Learning Specialist transform the dataset to minimize query runtime?
Convert the records to Apache Parquet format.
Convert the records to JSON format.
Convert the records to GZIP CSV format.
Convert the records to XML format.
Explanations:
Apache Parquet is a columnar storage format, which means it only reads the relevant columns required by the query. This reduces I/O and improves query performance. It also supports compression and schema evolution, making it an efficient choice for querying datasets.
JSON is a verbose format that does not support efficient columnar access, and it typically increases storage size compared to formats like Parquet. It would lead to slower query performance for this use case.
GZIP CSV is still row-based, meaning that even if only a subset of columns is needed, the entire row must be read. While compression reduces storage space, it doesn’t optimize query performance for columnar retrieval like Parquet does.
XML is a hierarchical format and is less efficient for querying than columnar formats. It would result in increased parsing time and larger file sizes, making it unsuitable for performance in this case.