What should a solutions architect do to transmit and process the clickstream data?
Design an AWS Data Pipeline to archive the data to an Amazon S3 bucket and run an Amazon EMR cluster with the data to generate analytics.
Create an Auto Scaling group of Amazon EC2 instances to process the data and send it to an Amazon S3 data lake for Amazon Redshift to use for analysis.
Cache the data to Amazon CloudFront. Store the data in an Amazon S3 bucket. When an object is added to the S3 bucket, run an AWS Lambda function to process the data for analysis.
Collect the data from Amazon Kinesis Data Streams. Use Amazon Kinesis Data Firehose to transmit the data to an Amazon S3 data lake. Load the data in Amazon Redshift for analysis.
Explanations:
AWS Data Pipeline is not ideal for real-time data processing and handling high-velocity clickstream data. It is more suitable for batch processing and does not provide the scalability and low-latency needed for daily analysis of 30 TB of clickstream data.
While using an Auto Scaling group of EC2 instances can process large amounts of data, it requires more management overhead and does not leverage managed services designed for real-time streaming analytics. Additionally, the architecture may not efficiently handle the continuous flow of clickstream data.
Caching data to Amazon CloudFront is not appropriate for processing clickstream data. Although Lambda can be triggered by S3 events, this approach may introduce latency and is not scalable for handling 30 TB of data daily. It’s more suited for smaller, event-driven tasks rather than large-scale data processing.
Using Amazon Kinesis Data Streams for collecting clickstream data is optimal as it is designed for real-time data ingestion. Kinesis Data Firehose can seamlessly transmit this data to Amazon S3, allowing for a data lake setup. Once the data is in S3, it can be efficiently loaded into Amazon Redshift for analysis, supporting the requirement for handling large volumes of clickstream data daily.