Which approach allows the Specialist to use all the data to train the model?
Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to the instance. Train on a small amount of the data to verify the training code and hyperparameters. Go back to Amazon SageMaker and train using the full dataset
Use AWS Glue to train a model using a small subset of the data to confirm that the data will be compatible with Amazon SageMaker. Initiate a SageMaker training job using the full dataset from the S3 bucket using Pipe input mode.
Load a smaller subset of the data into the SageMaker notebook and train locally. Confirm that the training code is executing and the model parameters seem reasonable. Launch an Amazon EC2 instance with an AWS Deep Learning AMI and attach the S3 bucket to train the full dataset.
Explanations:
This option allows the Specialist to first test the training code on a smaller subset of data in the SageMaker notebook without exceeding volume limits. By using Pipe input mode in a SageMaker training job, the model can stream data directly from S3, avoiding the need to load all data onto the instance. This approach efficiently utilizes the full dataset for training while confirming the code works.
While launching an EC2 instance with a Deep Learning AMI could allow for more resources and storage, it doesn’t utilize SageMaker’s training capabilities directly. The additional steps of training on a small dataset, then switching back to SageMaker for full dataset training adds unnecessary complexity and may still run into storage or performance issues when accessing large datasets.
AWS Glue is primarily used for data transformation and ETL processes rather than direct model training. Although it can confirm data compatibility, it does not provide a straightforward path for training the model in SageMaker, especially since the full dataset is to be used for training in SageMaker. Additionally, Glue does not handle model training effectively.
This option suggests training on a small subset in SageMaker before switching to EC2 for the full dataset. While it could work, it adds unnecessary complexity and does not take advantage of SageMaker’s capabilities for large datasets. The data would still need to be loaded into the EC2 instance, which may face similar limitations as the SageMaker instance regarding data volume.