Which solution will meet these requirements with the LEAST operational overhead?

A company uses a legacy application to produce data in CSV format.The legacy application stores the output data in Amazon S3.The company is deploying a new commercial off-the-shelf (COTS) application that can perform complex SQL queries to analyze data that is stored in Amazon Redshift and Amazon S3 only.However, the COTS application cannot process the .csv files that the legacy application produces.The company cannot update the legacy application to produce data in another format.The company needs to implement a solution so that the COTS application can use the data that the legacy application produces.

Which solution will meet these requirements with the LEAST operational overhead?

Create an AWS Glue extract, transform, and load (ETL) job that runs on a schedule. Configure the ETL job to process the .csv files and store the processed data in Amazon Redshift.

Develop a Python script that runs on Amazon EC2 instances to convert the .csv files to .sql files. Invoke the Python script on a cron schedule to store the output files in Amazon S3.

Create an AWS Lambda function and an Amazon DynamoDB table. Use an S3 event to invoke the Lambda function. Configure the Lambda function to perform an extract, transform, and load (ETL) job to process the .csv files and store the processed data in the DynamoDB table.

Use Amazon EventBridge to launch an Amazon EMR cluster on a weekly schedule. Configure the EMR cluster to perform an extract, transform, and load (ETL) job to process the .csv files and store the processed data in an Amazon Redshift table.

Explanations:

Creating an AWS Glue ETL job is an efficient way to process CSV files stored in Amazon S3 and load the transformed data into Amazon Redshift. AWS Glue is a fully managed service that simplifies ETL processes, requiring minimal operational overhead. Scheduling the job ensures regular updates without manual intervention, aligning with the requirement for low operational overhead.

Developing a Python script on Amazon EC2 introduces more operational overhead as it requires managing EC2 instances, monitoring the script, and handling potential failures. This solution is less automated compared to AWS Glue, and it does not directly load the processed data into Amazon Redshift, which is a requirement of the COTS application.

While an AWS Lambda function can process CSV files, using DynamoDB as a storage destination is not suitable since the COTS application requires data in Amazon Redshift or S3. Additionally, managing the transformation logic in a Lambda function can become complex, leading to increased operational overhead compared to using AWS Glue.

Using Amazon EMR to process the data would require setting up and managing an EMR cluster, which increases operational complexity. Although it could perform the necessary ETL tasks, the requirement specifies minimizing operational overhead. Also, launching EMR on a schedule could lead to unnecessary costs if the frequency of data updates is low.

Learn & move to cloud

Which solution will meet these requirements with the LEAST operational overhead?

Explanations:

Leave a Reply Cancel reply