Which solution will accomplish the necessary transformation to train the Amazon SageMaker model with the LEAST amount of administrative overhead?
Launch an Amazon EMR cluster. Create an Apache Hive external table for the DynamoDB table and S3 data. Join the Hive tables and write the results out to Amazon S3.
Crawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output to an Amazon Redshift cluster.
Enable Amazon DynamoDB Streams on the sensor table. Write an AWS Lambda function that consumes the stream and appends the results to the existing weather files in Amazon S3.
Crawl the data using AWS Glue crawlers. Write an AWS Glue ETL job that merges the two tables and writes the output in CSV format to Amazon S3.
Explanations:
While using Amazon EMR with Hive can achieve the data transformation, it adds significant administrative overhead by requiring cluster setup, management, and monitoring. This approach is overly complex and not the most efficient choice for a relatively small dataset.
AWS Glue and Amazon Redshift can perform the data merge and transformation, but creating an Amazon Redshift cluster specifically for this task adds unnecessary complexity and cost, especially since the dataset size does not warrant a data warehouse solution.
Using DynamoDB Streams and Lambda to directly update S3 files would be complex and inefficient. It’s not suitable for combining the datasets for training, as DynamoDB Streams captures incremental changes rather than enabling a full merge of historical data with S3.
AWS Glue crawlers can catalog the data, and an AWS Glue ETL job can perform the necessary transformations. This approach requires minimal management and allows the combined dataset to be written in CSV format to S3, where it can easily be accessed by SageMaker for training.