What is the MOST cost-effective solution?
Create a new Amazon Redshift cluster. Create an AWS Glue ETL job to copy data from the RDS databases to the Amazon Redshift cluster. Use Amazon Redshift to run the query.
Create an Amazon EMR cluster with enough core nodes. Run an Apache Spark job to copy data from the RDS databases to a Hadoop Distributed File System (HDFS). Use a local Apache Hive metastore to maintain the table definition. Use Spark SQL to run the query.
Use an AWS Glue ETL job to copy all the RDS databases to a single Amazon Aurora PostgreSQL database. Run SQL queries on the Aurora PostgreSQL database.
Use an AWS Glue crawler to crawl all the databases and create tables in the AWS Glue Data Catalog. Use an AWS Glue ETL job to load data from the RDS databases to Amazon S3, and use Amazon Athena to run the queries.
Explanations:
While Amazon Redshift is suitable for analytics and can run complex queries, setting up a Redshift cluster may incur high costs, especially for a large dataset like 100 TB, compared to other options. Additionally, the cost of data transfer and maintenance of a separate cluster adds to the overall expense.
Using Amazon EMR for this use case can be complex and costly due to the requirement for managing the EMR cluster and the additional overhead of using Apache Spark and HDFS. This approach may also involve more configuration and operational management than necessary for running ad-hoc SQL queries.
While Amazon Aurora is a scalable relational database, consolidating all RDS databases into a single Aurora database may not be the most cost-effective solution. The complexity of migration and the ongoing operational costs for a single large database may outweigh the benefits, especially if the teams require their own database instances.
This option leverages AWS Glue for data integration and Amazon Athena for querying, making it a cost-effective solution. Using AWS Glue to crawl and catalog the data allows for a flexible architecture without the need for a separate data warehouse or cluster. Athena’s serverless nature means that you only pay for the queries you run, making it ideal for ad-hoc querying on large datasets like 100 TB.