Which design should a solutions architect recommend to meet these requirements?
Direct the requests from the API to a Network Load Balancer (NLB). Deploy the models as AWS Lambda functions that are invoked by the NLB.
Direct the requests from the API to an Application Load Balancer (ALB). Deploy the models as Amazon Elastic Container Service (Amazon ECS) services that read from an Amazon Simple Queue Service (Amazon SQS) queue. Use AWS App Mesh to scale the instances of the ECS cluster based on the SQS queue size.
Direct the requests from the API into an Amazon Simple Queue Service (Amazon SQS) queue. Deploy the models as AWS Lambda functions that are invoked by SQS events. Use AWS Auto Scaling to increase the number of vCPUs for the Lambda functions based on the SQS queue size.
Direct the requests from the API into an Amazon Simple Queue Service (Amazon SQS) queue. Deploy the models as Amazon Elastic Container Service (Amazon ECS) services that read from the queue. Enable AWS Auto Scaling on Amazon ECS for both the cluster and copies of the service based on the queue size.
Explanations:
Using AWS Lambda functions invoked by a Network Load Balancer (NLB) may not be suitable due to the cold start problem and the memory requirements of the models (1 GB). Lambda functions are not ideal for scenarios requiring substantial memory and longer initialization times, especially given the irregular usage patterns.
While using an Application Load Balancer (ALB) and Amazon ECS is a valid option, this design does not utilize the asynchronous nature of the workload effectively. Models might remain idle for long periods, leading to unnecessary costs, and the scaling mechanism based solely on SQS queue size might not handle irregular loads optimally without further configuration.
Although directing requests to SQS and using AWS Lambda to process them appears efficient, AWS Lambda has a maximum memory limit of 10 GB, but the initial loading of the model data (1 GB) could lead to performance issues under high load. Additionally, the cold start latency and scaling might not align well with batch requests due to the asynchronous nature of the API.
This option effectively addresses the irregular usage patterns of the models. By using Amazon SQS to queue requests, it allows for decoupling and can handle burst traffic. Deploying the models as ECS services ensures they are always running and can load the 1 GB of model data into memory efficiently. AWS Auto Scaling can adapt the ECS service instances based on queue size, which allows for both scaling up during heavy usage and scaling down during periods of inactivity, optimizing cost and resource utilization.