What would be the MOST cost-effective, high availability storage solution for this workflow?
Store the data files in Amazon S3 and use Range GET for each file’s metadata, then index the relevant data.
Store the data files in Amazon EFS mounted by the EC2 fleet and EMR nodes.
Store the data files on Amazon EBS volumes and allow the EC2 fleet and EMR to mount and unmount the volumes where they are needed.
Store the content of the data files in Amazon DynamoDB tables with the metadata, index, and data as their own keys.
Explanations:
Storing data files in Amazon S3 is cost-effective and provides high availability. Using Range GET allows efficient access to the metadata, enabling the EC2 instances to read the necessary information without needing to load the entire file. S3 is designed for durability and scalability, making it ideal for large data files.
Amazon EFS is not the most cost-effective solution for large data files, as it incurs higher costs compared to S3 for storage. While it provides high availability, it may not be as efficient for the workflow described, where large files are read infrequently.
Amazon EBS volumes can be attached to EC2 instances, but they are not shared between instances without complex management. EBS is also more expensive for large-scale data warehousing compared to S3, and it does not provide the same level of durability and availability as S3.
Using DynamoDB for storing large data files is not cost-effective due to high costs associated with storing large binary objects. DynamoDB is designed for key-value and document storage rather than large file storage, making it unsuitable for this scenario.