Which solution will meet these requirements MOST cost-effectively?
Use AWS Glue with a Scala job
Use Amazon EMR with an Apache Spark script
Use AWS Lambda with a Python script
Use AWS Glue with a PySpark job
Explanations:
AWS Glue with a Scala job is not ideal due to the overhead associated with Glue jobs. Glue is designed for larger, more complex ETL workloads and may introduce unnecessary complexity and cost for processing small data volumes like the 2 MB nightly data from each mattress.
Amazon EMR with Apache Spark is more suited for large-scale, distributed data processing. EMR can be cost-effective for big data but introduces significant overhead for processing small amounts of data (2 MB), making it less efficient and more costly for this use case.
AWS Lambda with a Python script is the most cost-effective and efficient solution. Lambda can handle the small, 2 MB data payloads efficiently, with processing time under 30 seconds and memory usage well within the 1 GB requirement. It’s serverless, so you only pay for execution time, minimizing costs.
AWS Glue with a PySpark job is similar to option A in terms of overkill for this task. Glue with PySpark is suited for larger datasets and more complex ETL jobs, and it would introduce unnecessary complexity and cost for processing small data volumes in this scenario.