What is the MOST likely reason for this failure and how can it be mitigated in the future?
The network ACL for one subnet is blocking outbound web traffic. Open the network ACL and prevent administration from making future changes through IAM.
The fault is in the third-party environment. Contact the third party that provides the maps and request a fix that will provide better uptime.
One NAT instance has become overloaded. Replace both EC2 NAT instances with a larger-sized instance and make sure to account for growth when making the new instance size.
One of the NAT instances failed. Recommend replacing the EC2 NAT instances with a NAT gateway.
Explanations:
While a network ACL could potentially block outbound traffic, the problem is not consistent enough to suggest a configuration issue with the network ACL. Since users can log in and reach the site, it indicates that the overall network configuration is likely fine. Additionally, resolving ACL issues wouldn’t necessarily involve preventing administrative changes through IAM.
Although the third-party API could be a source of the issue, the problem is specifically related to the NAT instances and their configuration in the architecture, not necessarily a fault in the third-party service. If the API were down or unreliable, it would likely affect all calls, not just 50% of them after refreshing the page.
While it’s possible that the NAT instance is overloaded, simply replacing them with larger instances without addressing the underlying architecture (e.g., failover) would not guarantee that this issue won’t recur. It does not address the potential for high availability and resilience.
The most likely cause of the intermittent failures is that one of the NAT instances may have failed or is not properly functioning, leading to inconsistent outbound API calls. Replacing EC2 NAT instances with a NAT gateway would provide a more reliable solution as NAT gateways are managed by AWS, ensuring better availability and scaling without the operational overhead of managing EC2 instances.