Which solution will meet these requirements with the LEAST operational overhead?
Create external tables in a Spark catalog. Configure jobs in AWS Glue to query the data.
Configure an AWS Glue crawler to crawl the data. Configure Amazon Athena to query the data.
Create external tables in a Hive metastore. Configure Spark jobs in Amazon EMR to query the data.
Configure an AWS Glue crawler to crawl the data. Configure Amazon Kinesis Data Analytics to use SQL to query the data.
Explanations:
While creating external tables in a Spark catalog can enable querying, it requires additional setup and operational management compared to using managed services like AWS Glue and Athena. This option has higher operational overhead.
Configuring an AWS Glue crawler to catalog the data automates the schema discovery and prepares it for querying with Amazon Athena. Athena is a serverless query service, which means it has minimal operational overhead, making this the most efficient solution.
Creating external tables in a Hive metastore and using Amazon EMR requires more management of the EMR cluster and operational overhead, such as scaling and maintenance, compared to AWS Glue and Athena.
While AWS Glue can crawl the data, using Amazon Kinesis Data Analytics for SQL queries is less suitable for batch data analysis of clickstream data stored in S3, as Kinesis is more geared towards real-time data streams, adding unnecessary complexity.