Which solutions will meet these requirements?
Trigger an AWS Lambda function on file delivery that extracts each record and writes it to an Amazon SQS queue. Trigger another Lambda function when new messages arrive in the SQS queue to process the records, writing the results to a temporary location in Amazon S3. Trigger a final Lambda function once the SQS queue is empty to transform the records into JSON format and send the results to another S3 bucket for internal processing.
Trigger an AWS Lambda function on file delivery that extracts each record and writes it to an Amazon SQS queue. Configure an AWS Fargate container application to automatically scale to a single instance when the SQS queue contains messages. Have the application process each record, and transform the record into JSON format. When the queue is empty, send the results to another S3 bucket for internal processing and scale down the AWS Fargate instance.
Create an AWS Glue crawler and custom classifier based on the data feed formats and build a table definition to match. Trigger an AWS Lambda function on file delivery to start an AWS Glue ETL job to transform the entire record according to the processing and transformation requirements. Define the output format as JSON. Once complete, have the ETL job send the results to another S3 bucket for internal processing.
Create an AWS Glue crawler and custom classifier based upon the data feed formats and build a table definition to match. Perform an Amazon Athena query on file delivery to start an Amazon EMR ETL job to transform the entire record according to the processing and transformation requirements. Define the output format as JSON. Once complete, send the results to another S3 bucket for internal processing and scale down the EMR cluster.
Explanations:
While using AWS Lambda with SQS is scalable, invoking multiple Lambda functions in sequence for this task adds complexity, especially with state management. Additionally, Lambda functions are not ideal for handling large volumes of data (5,000 records every 15 minutes) due to execution time limits and memory constraints. This design might face performance issues with future scaling needs.
AWS Fargate is suitable for containerized applications but introducing a container for transforming records via SQS adds unnecessary complexity and overhead, especially for small-scale operations. Lambda is more suited to lightweight, serverless processing and does not require managing infrastructure like Fargate.
AWS Glue is a managed ETL service designed for handling such data transformation tasks. The use of an AWS Glue crawler, custom classifier, and ETL job allows for scalable and flexible processing of the data, including masking the PAN, removing/merging fields, and transforming to JSON format. It is easy to extend for future data feeds and integrates well with S3.
While Amazon EMR can process large datasets, using it for simple transformations (like masking and merging fields) introduces unnecessary complexity compared to AWS Glue. EMR is more suited to big data workloads that require distributed processing, and the additional overhead of managing clusters makes it less ideal for this scenario.