Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead?

A Machine Learning Specialist is designing a scalable data storage solution for Amazon SageMaker.There is an existing TensorFlow-based model implemented as a train.py script that relies on static training data that is currently stored as TFRecords.

Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead?

Use Amazon SageMaker script mode and use train.py unchanged. Point the Amazon SageMaker training invocation to the local path of the data without reformatting the training data.

Use Amazon SageMaker script mode and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the Amazon SageMaker training invocation to the S3 bucket without reformatting the training data.

Rewrite the train.py script to add a section that converts TFRecords to protobuf and ingests the protobuf data instead of TFRecords.

Prepare the data in the format accepted by Amazon SageMaker. Use AWS Glue or AWS Lambda to reformat and store the data in an Amazon S3 bucket.

Explanations:

While using the local path of the data without reformatting may seem convenient, Amazon SageMaker requires data to be accessible in a scalable and distributed manner, which typically necessitates storing the data in Amazon S3. Thus, this option would not work effectively in a cloud-based environment.

This option meets the business requirement with the least development overhead by allowing the existingtrain.pyscript to remain unchanged. By placing the TFRecords in an S3 bucket and pointing the training job to that bucket, the model can utilize the existing data format directly without additional conversion or modifications.

This option involves significant development overhead as it requires rewriting thetrain.pyscript to convert the TFRecords to protobuf. This added complexity and development effort are unnecessary, especially since the existing TFRecord format can be used directly.

This option entails additional overhead by requiring data to be reformatted into a format accepted by Amazon SageMaker, which may involve using AWS Glue or AWS Lambda. The requirement to change the data format adds unnecessary complexity and development effort compared to directly using the existing TFRecords.

Learn & move to cloud

Which method of providing training data to Amazon SageMaker would meet the business requirements with the LEAST development overhead?

Explanations:

Leave a Reply Cancel reply