Which architecture should the Data Scientist use to build this solution?
Write the raw data to Amazon S3. Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule. Use the existing PySpark logic to run the ETL job on the EMR cluster. Output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
Write the raw data to Amazon S3. Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3. Write the Lambda logic in Python and implement the existing PySpark logic to perform the ETL process. Have the Lambda function output the results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
Use Amazon Kinesis Data Analytics to stream the input data and perform real-time SQL queries against the stream to carry out the required transformations within the stream. Deliver the output results to a ג€processedג€ location in Amazon S3 that is accessible for downstream use.
Explanations:
While using Amazon EMR is a valid approach, managing a persistent EMR cluster may require more server management than desired. AWS Lambda can submit Spark steps to EMR, but this setup does not minimize server management, as the cluster needs to be maintained and can incur higher costs.
AWS Glue is a serverless ETL service that supports PySpark, allowing the reuse of existing PySpark logic. It can be scheduled easily using triggers, meets the requirement for minimizing server management, and outputs processed data to S3 for downstream use, aligning perfectly with the requirements.
AWS Lambda has limitations on execution time and memory, which may not be suitable for large ETL processes typically handled by PySpark. Additionally, rewriting the ETL logic in Python instead of reusing PySpark logic contradicts the requirement to reuse existing code. This approach would not effectively handle large data sources.
Amazon Kinesis Data Analytics is designed for real-time processing rather than batch ETL processes. It would not be suitable for combining multiple large data sources on a scheduled basis. This option fails to meet the requirement of reusing existing PySpark logic and processing large datasets in a manner that fits the existing schedule.