Which solution takes the LEAST effort to implement?
Ingest .CSV data using Apache Kafka Streams on Amazon EC2 instances and use Kafka Connect S3 to serialize data as Parquet
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Glue to convert data into Parquet.
Ingest .CSV data using Apache Spark Structured Streaming in an Amazon EMR cluster and use Apache Spark to convert data into Parquet.
Ingest .CSV data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose to convert data into Parquet.
Explanations:
This option involves using Apache Kafka Streams on Amazon EC2 instances, which requires managing infrastructure and additional setup for Kafka Connect to serialize data as Parquet. This approach introduces more complexity and operational overhead compared to other options.
While this option uses Amazon Kinesis Data Streams and Amazon Glue, it requires the setup of Glue jobs for transformation, which adds complexity. Glue also involves additional configuration and management compared to using Kinesis Data Firehose directly for format conversion.
Using Apache Spark Structured Streaming on an Amazon EMR cluster requires provisioning and managing the EMR cluster, which can be complex and resource-intensive. The setup for Spark jobs also adds to the operational overhead, making this option less efficient.
This option uses Amazon Kinesis Data Streams and Kinesis Data Firehose, which allows for straightforward ingestion and conversion of .CSV data to Parquet with minimal setup. Firehose automatically handles the data transformation, resulting in the least amount of effort to implement.