Which storage scheme is MOST adapted to this scenario?

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models.As DataScientists may create an arbitrary number of new datasets every day, the solution has to scale automatically and be cost-effective.Also, it must be possible to explore the data using SQL.

Which storage scheme is MOST adapted to this scenario?

Store datasets as files in Amazon S3.

Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.

Store datasets as tables in a multi-node Amazon Redshift cluster.

Store datasets as global tables in Amazon DynamoDB.

Explanations:

Amazon S3 is a highly scalable and cost-effective object storage service suitable for storing large amounts of training data. It allows for easy data management and retrieval, and supports SQL querying through services like Amazon Athena, making it ideal for exploration and analytics.

Storing datasets in an Amazon EBS volume is not ideal for scalability as EBS volumes are limited to the size of the attached EC2 instance and require management of instances. This option lacks cost-effectiveness and does not inherently support SQL querying for data exploration.

Amazon Redshift is a powerful data warehousing solution, but it is designed for structured data and may incur higher costs with the need for provisioning. It may not scale automatically as needed for rapidly changing datasets and is not the most cost-effective for storing numerous datasets created daily.

Amazon DynamoDB is a NoSQL database service that offers global tables but is not optimized for SQL queries and may not handle arbitrary datasets efficiently. It also has limits on data types and is more suited for key-value or document-based storage rather than arbitrary datasets used in training machine learning models.

Learn & move to cloud

Which storage scheme is MOST adapted to this scenario?

Explanations:

Leave a Reply Cancel reply