What should the solutions architect do to prevent AWS Glue from reprocessing old data?

A company has an AWS Glue extract, transform, and load (ETL) job that runs every day at the same time.The job processes XML data that is in an Amazon S3 bucket.New data is added to the S3 bucket every day.A solutions architect notices that AWS Glue is processing all the data during each run.

What should the solutions architect do to prevent AWS Glue from reprocessing old data?

Edit the job to use job bookmarks.

Edit the job to delete data after the data is processed.

Edit the job by setting the NumberOfWorkers field to 1.

Use a FindMatches machine learning (ML) transform.

Explanations:

Job bookmarks allow AWS Glue to track which data has been processed, preventing the reprocessing of old data. By enabling job bookmarks, only new data added since the last run will be processed.

Deleting data after processing does not prevent the job from reprocessing old data; it only removes data from S3. The job would still run on all data unless job bookmarks are implemented.

Setting the NumberOfWorkers field to 1 does not impact the processing of old data; it only affects the parallelism of the job. The issue of reprocessing all data remains unaddressed.

A FindMatches ML transform is used for deduplication and data matching but does not inherently prevent AWS Glue from reprocessing old data. It does not address the core issue of data tracking.

Learn & move to cloud

What should the solutions architect do to prevent AWS Glue from reprocessing old data?

Explanations:

Leave a Reply Cancel reply