Which solution will meet these requirements with the LEAST development effort?

A company’s reporting system delivers hundreds of .csv files to an Amazon S3 bucket each day.The company must convert these files to Apache Parquet format and must store the files in a transformed data bucket.

Which solution will meet these requirements with the LEAST development effort?

Create an Amazon EMR cluster with Apache Spark installed. Write a Spark application to transform the data. Use EMR File System (EMRFS) to write files to the transformed data bucket.

Create an AWS Glue crawler to discover the data. Create an AWS Glue extract, transform, and load (ETL) job to transform the data. Specify the transformed data bucket in the output step.

Use AWS Batch to create a job definition with Bash syntax to transform the data and output the data to the transformed data bucket. Use the job definition to submit a job. Specify an array job as the job type.

Create an AWS Lambda function to transform the data and output the data to the transformed data bucket. Configure an event notification for the S3 bucket. Specify the Lambda function as the destination for the event notification.

Explanations:

Although Amazon EMR with Apache Spark is capable of transforming data, it involves significant setup, maintenance, and higher development effort than other AWS options for this task.

AWS Glue provides a managed ETL service with a crawler to discover schema and an ETL job that can easily convert CSV files to Parquet. This option requires minimal setup, as AWS Glue handles the transformation and output to the target bucket automatically.

AWS Batch is a robust solution for large-scale batch jobs but requires a custom job definition and setup, which is more complex than using AWS Glue for simple data transformation.

While AWS Lambda can be used for file transformation, it has a limitation on execution time and memory, which may not be suitable for processing hundreds of files daily. This approach would require additional handling for larger files and concurrency, increasing development effort.

Learn & move to cloud

Which solution will meet these requirements with the LEAST development effort?

Explanations:

Leave a Reply Cancel reply