How should the Machine Learning Specialist transform the dataset to minimize query runtime?

By: study aws cloud

On: January 8, 2025

Tagged: Machine Learning Specialty

With: 0 Comments

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena.The dataset contains more than 800,000 records stored as plaintext CSV files.Each record contains 200 columns and is approximately 1.5 MB in size.Most queries will span 5 to 10 columns only.

How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Convert the records to Apache Parquet format.

Convert the records to JSON format.

Convert the records to GZIP CSV format.

Convert the records to XML format.

Explanations:

Apache Parquet is a columnar storage format, which means it only reads the relevant columns required by the query. This reduces I/O and improves query performance. It also supports compression and schema evolution, making it an efficient choice for querying datasets.

JSON is a verbose format that does not support efficient columnar access, and it typically increases storage size compared to formats like Parquet. It would lead to slower query performance for this use case.

GZIP CSV is still row-based, meaning that even if only a subset of columns is needed, the entire row must be read. While compression reduces storage space, it doesn’t optimize query performance for columnar retrieval like Parquet does.

XML is a hierarchical format and is less efficient for querying than columnar formats. It would result in increased parsing time and larger file sizes, making it unsuitable for performance in this case.

Previous Post: Which solution will meet these requirements?

Next Post: Which AWS service should the company use to obtain these recommendations?

Leave a Reply Cancel reply