What should the ML specialist do to provide the training data to SageMaker with the LEAST development overhead?
Put the TFRecord data into an Amazon S3 bucket. Use AWS Glue or AWS Lambda to reformat the data to protobuf format and store the data in a second S3 bucket. Point the SageMaker training invocation to the second S3 bucket.
Rewrite the train.py script to add a section that converts TFRecord data to protobuf format. Point the SageMaker training invocation to the local path of the data. Ingest the protobuf data instead of the TFRecord data.
Use SageMaker script mode, and use train.py unchanged. Point the SageMaker training invocation to the local path of the data without reformatting the training data.
Use SageMaker script mode, and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the SageMaker training invocation to the S3 bucket without reformatting the training data.
Explanations:
This option introduces unnecessary complexity by requiring AWS Glue or AWS Lambda to reformat the data, which increases development overhead. There’s no need to convert the data format to protobuf for SageMaker to use TFRecord data.
Rewriting the train.py script to convert TFRecord data to protobuf is unnecessary and adds extra development effort. SageMaker can work directly with TFRecord data, so reformatting is not needed.
SageMaker script mode does not require modification of the training script, but it still requires the data to be in a format SageMaker can handle. TFRecord data needs to be accessed from S3, not from a local path in this context.
Using SageMaker script mode with the unchanged train.py script and pointing to the TFRecord data in an S3 bucket allows for the least development overhead. SageMaker supports TFRecord format natively, so no reformatting is needed.