Which solution will meet these requirements?
Direct the requests from the API to a Network Load Balancer (NLB). Deploy the ML models as AWS Lambda functions that the NLB will invoke. Use auto scaling to scale the Lambda functions based on the traffic that the NLB receives.
Direct the requests from the API to an Application Load Balancer (ALB). Deploy the ML models as Amazon Elastic Container Service (Amazon ECS) services that the ALB will invoke. Use auto scaling to scale the ECS cluster instances based on the traffic that the ALB receives.
Direct the requests from the API into an Amazon Simple Queue Service (Amazon SQS) queue. Deploy the ML models as AWS Lambda functions that SQS events will invoke. Use auto scaling to increase the number of vCPUs for the Lambda functions based on the size of the SQS queue.
Direct the requests from the API into an Amazon Simple Queue Service (Amazon SQS) queue. Deploy the ML models as Amazon Elastic Container Service (Amazon ECS) services that read from the queue. Use auto scaling for Amazon ECS to scale both the cluster capacity and number of the services based on the size of the SQS queue.
Explanations:
AWS Lambda functions are limited in execution time and may not be suitable for ML models requiring substantial startup time (loading 1 GB of model data) and could incur cold start latency issues. Additionally, Lambda functions cannot maintain state between invocations, which may not suit the batch processing needs efficiently.
While using Amazon ECS with an ALB allows for managing irregular traffic and scaling, it may not be the most efficient solution for handling bursts of requests. ECS services would require more overhead in managing instances, and startup times could hinder performance under heavy load without pre-warmed instances.
Using SQS with AWS Lambda may lead to increased latency, especially for large payloads or when model loading times are significant. Lambda’s memory limits and execution time could also restrict the model’s ability to handle large datasets efficiently.
This option leverages SQS for decoupling the request handling from the model processing, allowing for better management of bursts in traffic. ECS can efficiently handle long-running processes and large memory requirements, making it suitable for loading the ML models. Auto scaling based on SQS queue size allows for flexibility and responsiveness to traffic patterns.