Which additional data preparation steps should the company take before uploading the files to Amazon S3?
Generate two Apache Parquet files, training.parquet and validation.parquet, by reading the images into a Pandas data frame and storing the data frame as a Parquet file. Upload the Parquet files to the training S3 bucket.
Compress the training and validation directories by using the Snappy compression library. Upload the manifest and compressed files to the training S3 bucket.
Compress the training and validation directories by using the gzip compression library. Upload the manifest and compressed files to the training S3 bucket.
Generate two RecordIO files, training.rec and validation.rec, from the manifest files by using the im2rec Apache MXNet utility tool. Upload the RecordIO files to the training S3 bucket.
Explanations:
While generating Parquet files can improve data loading performance for certain types of data, the SageMaker image classification algorithm does not directly support Parquet files as input format. It primarily requires image data in formats such as RecordIO or the specified manifest files.
Compressing the directories using the Snappy compression library is not a supported format for SageMaker. SageMaker requires images to be in specific formats for training, and using unsupported compression would likely lead to errors during training or data loading.
Similar to option B, using gzip compression for the training and validation directories is not compatible with the SageMaker image classification algorithm. The algorithm expects images in specific formats and not as compressed directories.
Generating RecordIO files from the manifest files using the im2rec utility is the appropriate preparation step. RecordIO is a preferred input format for SageMaker image classification, as it efficiently packages image data for training, allowing faster data loading and improved performance during model training.