Which storage strategy is the MOST cost-effective and meets the design requirements?

A solutions architect is designing the data storage and retrieval architecture for a new application that a company will be launching soon.The application is designed to ingest millions of small records per minute from devices all around the world.Each record is less than 4 KB in size and needs to be stored in a durable location where it can be retrieved with low latency.The data is ephemeral and the company is required to store the data for 120 days only, after which the data can be deleted.The solutions architect calculates that, during the course of a year, the storage requirements would be about 10-15 TB.

Which storage strategy is the MOST cost-effective and meets the design requirements?

Design the application to store each incoming record as a single .csv file in an Amazon S3 bucket to allow for indexed retrieval. Configure a lifecycle policy to delete data older than 120 days.

Design the application to store each incoming record in an Amazon DynamoDB table properly configured for the scale. Configure the DynamoDB Time to Live (TTL) feature to delete records older than 120 days.

Design the application to store each incoming record in a single table in an Amazon RDS MySQL database. Run a nightly cron job that executes a query to delete any records older than 120 days.

Design the application to batch incoming records before writing them to an Amazon S3 bucket. Update the metadata for the object to contain the list of records in the batch and use the Amazon S3 metadata search feature to retrieve the data. Configure a lifecycle policy to delete the data after 120 days.

Explanations:

While storing records as .csv files in S3 allows for simple storage and retrieval, it does not support efficient indexed retrieval for millions of small records. Additionally, managing many small files in S3 can lead to performance issues, especially for high throughput applications. Although a lifecycle policy can delete older data, the overall efficiency and retrieval speed may be inadequate.

Using Amazon DynamoDB is a suitable option as it is designed for high throughput and can handle millions of small records. The DynamoDB Time to Live (TTL) feature automates the deletion of records after 120 days, ensuring data management is streamlined. This approach offers low-latency access and is cost-effective for the scale required, making it the best option.

Amazon RDS MySQL is not ideal for storing millions of small records due to potential performance bottlenecks and higher costs associated with relational databases at scale. Running a nightly cron job to delete old records is inefficient and does not provide the necessary automated lifecycle management. This method may also lead to latency issues during peak ingestion times.

Although batching records before storing them in S3 can reduce the number of requests made to S3, the retrieval process using metadata search is not straightforward for efficient access to specific records. This approach may introduce additional complexity in managing the records and metadata, and it still relies on S3’s eventual consistency model, which can lead to latency issues. While it can leverage lifecycle policies for deletion, it does not match the efficiency of DynamoDB for this use case.

Learn & move to cloud

Which storage strategy is the MOST cost-effective and meets the design requirements?

Explanations:

Leave a Reply Cancel reply