What might cause this problem?
Lambda has very low memory assigned, which causes the function to fail at peak load.
Lambda is in a subnet that uses a NAT gateway to reach out of the internet, and the function instance does not have sufficient Amazon EC2 resources in the VPC to scale with the load.
The throttle limit set on API Gateway is very low. During peak load, the additional requests are not making their way through to Lambda.
DynamoDB is set up in an auto scaling mode. During peak load, DynamoDB adjusts capacity and throughput behind the scenes, which is causing the temporary downtime. Once the scaling completes, the retries go through successfully.
Explanations:
While low memory could cause Lambda functions to fail, the scenario describes successful invocations after retries, indicating that the function is executing but experiencing delays, not outright failures.
Using a NAT gateway can introduce latency, but it wouldn’t cause failures. Moreover, this option implies that Lambda is unable to scale, which is not typically the case unless there are specific resource limitations set in the VPC, which isn’t indicated here.
API Gateway has a default throttle limit, and if this limit is reached during peak load, subsequent requests would be rejected or delayed. This would explain why user requests fail multiple times before succeeding, as they would be throttled until they could be processed.
Although DynamoDB can auto scale, if it’s set up correctly, it should handle increased load without causing temporary downtime. The delays in this case are more likely linked to throttling at the API Gateway level rather than DynamoDB scaling issues.