Which approach allows the Specialist to use all the data to train the model?

A Machine Learning Specialist is developing a custom video recommendation model for an application.The dataset used to train this model is very large with millions of data points and is hosted in an Amazon S3 bucket.The Specialist wants to avoid loading all of this data onto an Amazon SageMaker notebook instance because it would take hours to move and will exceed the attached 5 GB Amazon EBS volume on the notebook instance.

Which approach allows the Specialist to use all the data to train the model?

Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.

Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset

Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.

Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.

Explanations:

This option allows the Specialist to first test the training code on a smaller subset of data in the SageMaker notebook without exceeding volume limits. By using Pipe input mode in a SageMaker training job, the model can stream data directly from S3, avoiding the need to load all data onto the instance. This approach efficiently utilizes the full dataset for training while confirming the code works.

While launching an EC2 instance with a Deep Learning AMI could allow for more resources and storage, it doesn’t utilize SageMaker’s training capabilities directly. The additional steps of training on a small dataset, then switching back to SageMaker for full dataset training adds unnecessary complexity and may still run into storage or performance issues when accessing large datasets.

AWS Glue is primarily used for data transformation and ETL processes rather than direct model training. Although it can confirm data compatibility, it does not provide a straightforward path for training the model in SageMaker, especially since the full dataset is to be used for training in SageMaker. Additionally, Glue does not handle model training effectively.

This option suggests training on a small subset in SageMaker before switching to EC2 for the full dataset. While it could work, it adds unnecessary complexity and does not take advantage of SageMaker’s capabilities for large datasets. The data would still need to be loaded into the EC2 instance, which may face similar limitations as the SageMaker instance regarding data volume.

Learn & move to cloud

Which approach allows the Specialist to use all the data to train the model?

Explanations:

Leave a Reply Cancel reply