Which option will meet the company’s requirements?
Launch a transient Amazon EMR cluster daily and develop an Apache Hive script to analyze the files on Amazon S3. Shut down the Amazon EMR cluster when the job is complete. Then use Amazon QuickSight to connect to Amazon EMR and perform the visualization.
Develop a stored procedure invoked from a MySQL database running on Amazon EC2 to analyze the files in Amazon S3. Then use a fast in-memory BI tool running on Amazon EC2 to visualize the data.
Develop a script that uses Amazon Athena to query and analyze the files on Amazon S3. Then use Amazon QuickSight to connect to Athena and perform the visualization.
Use a commercial extract, transform, load (ETL) tool that runs on Amazon EC2 to prepare the data for processing. Then switch to a faster and cheaper BI tool that runs on Amazon EC2 to visualize the data from Amazon S3.
Explanations:
Launching a transient Amazon EMR cluster daily involves management overhead, as it requires script development and operational tasks for starting and stopping the cluster. While EMR can be cost-effective, it does not eliminate the cluster management overhead.
Using a MySQL database on Amazon EC2 introduces additional management complexity and cost for maintaining the database instance. This option also requires a separate BI tool, which may not be as cost-effective as using serverless solutions.
Amazon Athena allows for serverless querying of data in S3, eliminating the need for cluster management. It is cost-effective as you pay only for the queries you run, and it integrates seamlessly with Amazon QuickSight for visualization, fulfilling the company’s requirements with minimal effort.
This option introduces complexity by requiring an ETL tool to prepare the data and additional management overhead for running a new BI tool on EC2. It does not provide a cost-effective or simplified solution as it adds extra processing steps and infrastructure management.