What should the Machine Learning Specialist do to the training solution to allow it to scale for future demand?
Do not change the TensorFlow code. Change the machine to one with a more powerful GPU to speed up the training.
Change the TensorFlow code to implement a Horovod distributed framework supported by Amazon SageMaker. Parallelize the training to as many machines as needed to achieve the business goals.
Switch to using a built-in AWS SageMaker DeepAR model. Parallelize the training to as many machines as needed to achieve the business goals.
Move the training to Amazon EMR and distribute the workload to as many machines as needed to achieve the business goals.
Explanations:
Upgrading to a more powerful GPU may improve training speed, but it does not address the scalability issue as the dataset grows. Additionally, it doesn’t facilitate hourly updates, as it still relies on a single machine.
Implementing Horovod for distributed training allows the model to scale across multiple GPUs and machines, significantly speeding up the training process and accommodating larger datasets. This approach meets the requirement for more frequent updates with minimal infrastructure changes.
While switching to AWS SageMaker’s built-in DeepAR model may simplify implementation and leverage built-in optimizations, it requires a shift in model architecture and might not directly solve the training time issue with current code. It does not guarantee the same level of customization as the current TensorFlow model.
Moving to Amazon EMR can distribute workloads, but EMR is typically more suited for big data processing rather than deep learning model training. It would require more coding effort to adapt the existing TensorFlow model for EMR, which contradicts the goal of minimizing coding effort and infrastructure changes.