Which option meets these requirements with the LEAST operational overhead?

A data scientist is working on a model to predict a company’s required inventory stock levels.All historical data is stored in .csv files in the company’s data lake on Amazon S3.The dataset consists of approximately 500 GB of data The data scientist wants to use SQL to explore the data before training the model.The company wants to minimize costs.

Which option meets these requirements with the LEAST operational overhead?

Create an Amazon EMR cluster. Create external tables in the Apache Hive metastore, referencing the data that is stored in the S3 bucket. Explore the data from the Hive console.

Use AWS Glue to crawl the S3 bucket and create tables in the AWS Glue Data Catalog. Use Amazon Athena to explore the data.

Create an Amazon Redshift cluster. Use the COPY command to ingest the data from Amazon S3. Explore the data from the Amazon Redshift query editor GUI.

Create an Amazon Redshift cluster. Create external tables in an external schema, referencing the S3 bucket that contains the data. Explore the data from the Amazon Redshift query editor GUI.

Explanations:

Amazon EMR is a managed cluster service, but it involves significant operational overhead for managing the cluster, especially for a data scientist who just wants to explore the data. Setting up external tables in Hive and managing the EMR cluster adds complexity and cost compared to alternatives.

AWS Glue can automatically crawl the S3 bucket and create tables in the AWS Glue Data Catalog, which can then be queried directly using Amazon Athena, a serverless, SQL-based query service. This option minimizes operational overhead and cost, as Athena charges based on the amount of data scanned.

Amazon Redshift is a powerful data warehouse, but it is overkill for exploring the data in this use case. It requires ingesting large amounts of data via the COPY command, which would result in additional storage and operational costs, and would be more cumbersome than using a serverless query service like Athena.

Although this option uses Amazon Redshift with external tables, it still requires provisioning and managing a Redshift cluster, which involves more operational overhead and higher costs compared to Athena, which is serverless. The need to create and manage a Redshift cluster doesn’t align with the goal of minimizing costs and operational overhead.

Learn & move to cloud

Which option meets these requirements with the LEAST operational overhead?

Explanations:

Leave a Reply Cancel reply