Which solution should the Data Scientist build to satisfy the requirements?

A Data Scientist needs to create a serverless ingestion and analytics solution for high-velocity, real-time streaming data.The ingestion process must buffer and convert incoming records from JSON to a query-optimized, columnar format without data loss.The output datastore must be highly available, and Analysts must be able to run SQL queries against the data and connect to existing business intelligence dashboards.

Which solution should the Data Scientist build to satisfy the requirements?

Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.

Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to BI tools using the Athena Java Database Connectivity (JDBC) connector.

Explanations:

This solution uses Kinesis Data Firehose for real-time data streaming, and the AWS Glue Data Catalog for schema management and transformation into a query-optimized format (Apache Parquet/ORC). Amazon S3 provides highly available storage, and Athena allows SQL querying, which meets the requirements of real-time ingestion, transformation, and BI tool connectivity.

Although this solution uses AWS Lambda for transformation and Amazon S3 for storage, the process is not fully serverless for high-velocity streaming data. Lambda might struggle with large-scale, real-time data ingestion and transformation, and there is no automatic buffering or optimized data transformation before delivery.

While it uses Lambda for transformation, writing data into an Amazon RDS PostgreSQL database isn’t suitable for high-velocity, real-time data processing. RDS is not as scalable or performant for this use case compared to a more distributed solution like Amazon S3 or Kinesis. Additionally, RDS is not a highly optimized, queryable storage format for large datasets in BI tools.

Kinesis Data Analytics is a good tool for real-time SQL queries, but it does not directly support buffering and converting JSON records into Parquet format in the same way as Kinesis Data Firehose. It also lacks seamless integration with Athena for querying large datasets, which would be a more suitable approach for this use case.

Learn & move to cloud

Which solution should the Data Scientist build to satisfy the requirements?

Explanations:

Leave a Reply Cancel reply