Which solution will meet these requirements with the LEAST operational overhead?

A company has an application that places hundreds of .csv files into an Amazon S3 bucket every hour.The files are 1 GB in size.Each time a file is uploaded, the company needs to convert the file to Apache Parquet format and place the output file into an S3 bucket.

Which solution will meet these requirements with the LEAST operational overhead?

Create an AWS Lambda function to download the .csv files, convert the files to Parquet format, and place the output files in an S3 bucket. Invoke the Lambda function for each S3 PUT event.

Create an Apache Spark job to read the .csv files, convert the files to Parquet format, and place the output files in an S3 bucket. Create an AWS Lambda function for each S3 PUT event to invoke the Spark job.

Create an AWS Glue table and an AWS Glue crawler for the S3 bucket where the application places the .csv files. Schedule an AWS Lambda function to periodically use Amazon Athena to query the AWS Glue table, convert the query results into Parquet format, and place the output files into an S3 bucket.

Create an AWS Glue extract, transform, and load (ETL) job to convert the .csv files to Parquet format and place the output files into an S3 bucket. Create an AWS Lambda function for each S3 PUT event to invoke the ETL job.

Explanations:

AWS Lambda has a 15-minute execution time limit and a 6 MB payload limit. Converting large 1 GB CSV files to Parquet within this limit is not feasible, leading to potential timeouts or failures.

Using AWS Lambda to invoke an Apache Spark job for each S3 PUT event adds unnecessary complexity and operational overhead. Spark is designed for larger, batch-oriented processing rather than real-time event-driven tasks.

Athena is a query service, not a file conversion tool. While it can query CSV files in S3, converting query results to Parquet format via Lambda adds complexity and does not meet the real-time processing requirement effectively.

AWS Glue provides a managed ETL service that is ideal for converting large files in S3 from CSV to Parquet format. It integrates well with S3 and can be triggered by Lambda for each file upload, minimizing operational overhead.

Learn & move to cloud

Which solution will meet these requirements with the LEAST operational overhead?

Explanations:

Leave a Reply Cancel reply