Which solution meets these requirements?

A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs.The data will be stored in Amazon S3.The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset.A portion of the schema is as follows:TransactionTimestamp (Timestamp)CardName (Varchar)CardNo (Varchar)The data engineer must provide the data so that any row with a CardNo value of NULL is removed.Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column.Finally, the CardName column must be renamed to NameOnCard.The data will be extracted on a monthly basis and will be loaded into an S3 bucket.The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation.The solution also must be automated and must minimize the load on the Amazon Redshift cluster.

Which solution meets these requirements?

Set up an Amazon EMR cluster. Create an Apache Spark job to read the data from the Amazon Redshift cluster and transform the data. Load the data into the S3 bucket. Schedule the job to run monthly.

Set up an Amazon EC2 instance with a SQL client tool, such as SQL Workbench/J, to query the data from the Amazon Redshift cluster directly Export the resulting dataset into a file. Upload the file into the S3 bucket. Perform these tasks monthly.

Set up an AWS Glue job that has the Amazon Redshift cluster as the source and the S3 bucket as the destination. Use the built-in transforms Filter, Map, and RenameField to perform the required transformations. Schedule the job to run monthly.

Use Amazon Redshift Spectrum to run a query that writes the data directly to the S3 bucket. Create an AWS Lambda function to run the query monthly.

Explanations:

While Amazon EMR with Apache Spark can handle large datasets, it requires significant setup and infrastructure management. It is also more complex than necessary, given the simpler alternatives available.

Using an EC2 instance with a SQL client tool is a manual and error-prone process. It requires manual intervention each month, and doesn’t minimize the load on Amazon Redshift, as it involves direct querying and export.

AWS Glue provides a serverless ETL solution that integrates well with both Amazon Redshift and S3. It supports the required transformations, such as filtering NULL values, renaming fields, and splitting columns, and can be scheduled to run monthly with minimal infrastructure setup.

Amazon Redshift Spectrum is designed to query data directly from S3, not for extracting data from Redshift and loading it into S3. Additionally, the process would still require a Lambda function to trigger the query, which adds unnecessary complexity compared to using AWS Glue.

Learn & move to cloud

Which solution meets these requirements?

Explanations:

Leave a Reply Cancel reply