How can I troubleshoot packet loss for my Direct Connect connection?

7 minute read
0

I'm using AWS Direct Connect to transfer data. I'm experiencing packet loss transferring data to my Amazon Elastic Compute Cloud (Amazon EC2) instance. I need to isolate the packet loss.

Resolution

Packet loss occurs when transmitted data packets fail to arrive at their destination resulting in network performance issues. Packet loss is caused by low signal strength at the destination, excessive system utilization, network congestion, or network route misconfigurations.

Run the following checks for your network devices and Direct Connect connection.

Check the AWS Personal Health Dashboard for scheduled maintenance or events

The AWS Personal Health Dashboard displays relevant information regarding the resources that are under maintenance and also provides notifications for activities. For more information, see How can I get notifications for Direct Connect scheduled maintenance or events?

Check metrics for the Direct Connect endpoint, customer gateway, and intermediate device (layer 1)

With customer gateway and intermediate devices, the issue might be local to the on-premises network or the transit path towards AWS. Check the following on the on-premises node and intermediate devices:

  • The customer gateway logs for Interface flaps
  • CPU utilization for the customer gateway when the issue occurred
  • The light signal reading on the device that the Direct Connect connection terminates
  • The device that the Direct Connect connection terminates for input errors, incrementing framing errors, cyclic redundancy (CRC) errors, runts, giants, or throttles

Check Direct Connect connection metrics (layer 1)

Check for the following Direct Connect metrics:

  • ConnectionErrorCount: Apply the sum statistic for this metric. Note that non-zero values indicate MAC level errors in the AWS device.
  • ConnectionLightLevelTX and ConnectionLightLevelRX: Check the light signal recorded on the Direct Connect connection when the issue occurred. The acceptable range is between -14.4 and 2.50 dBm.
  • ConnectionBpsEgress and ConnectionBpsIngress: Check the amount of traffic on the Direct Connect connection when the packet loss occurred for congestion on the link. If you use 100% capacity of the interface, then you might experience packet loss excess traffic.

For more information, see Direct Connect Connection metrics.

Check for asymmetric sub optimal routing (layer 3)

Asymmetric routing occurs when network traffic enters through one connection and exits through another connection. This routing might cause packet loss if the on-premises firewall performs unicast reverse path forwarding, which causes network traffic to drop.

  • If you have a backup redundant Direct Connect connection or a backup AWS Site-to-Site VPN connection, then check for any asymmetric routing that might be happening.
  • Suppose that you have a backup Site-to-Site VPN connection and advertised similar prefixes over both the Direct Connect and VPN connections. In this case, traffic from AWS to on-premises is routed through Direct Connect. To avoid asymmetric routing, be sure to send the traffic over only Direct Connect from on-premises to AWS.
  • If you have a backup Direct Connect connection, asymmetric routing might happen depending on how you advertise your prefixes over both Direct Connect connections.
  • Suboptimal routing with the on-premises network might cause packet loss.

For more information, see How can I resolve asymmetric routing issues when I create a VPN as a backup to Direct Connect in a transit gateway?

End-to-end bidirectional trace route between the on-premises host and the AWS host (layer 3)

Running trace route between the hosts determines the network path taken in both directions. Trace results also determine if the routing is asymmetric, load balanced, and so on.

1.    Run the following command to install traceroute:

Linux:

sudo yum install traceroute

Ubuntu:

sudo apt-get install traceroute

2.    Run a command similar to the following for the TCP traceroute:

sudo traceroute -T -p <destination Port> <IP of destination host>

Windows OS:

  1. Download WinPcap and tracetcp.
  2. Extract the Tracetcp ZIP file.
  3. Copy tracetcp.exe to your C drive.
  4. Install WinPcap.
  5. Open the command prompt and root WinPcap to your C drive using the C:\Users\username>cd \ command.
  6. Run tracetcp using the following commands: tracetcp.exe hostname:port or tracetcp.exe ip:port.

End-to-end bidirectional MTR test between the on-premises host and the AWS host (layer 3)

MTR tests are similar to traceroute for allowing the discovery of each router in the network connection pathway between the hosts. MTR tests also provide information on each node in the path such as packet loss.

Check the MTR results for packet loss and network latency. A network loss percentage at a hop can indicate an issue with the router. Some service providers limit the ICMP traffic that MTR uses. To determine if the packet loss is due to rate limits, review the subsequent hops. If the subsequent hop shows a loss of 0.0%, this can indicate ICMP rate limiting.

1.    Run the following command to install MTR:

Amazon Linux/REHEL:

$ sudo yum install mtr -y

Ubuntu:

sudo apt install mtr -y

Windows OS:

Download and install WinMTR.

Note: For Windows OS, WinMTR doesn't support TCP-based MTR.

2.    For the on-premises to AWS direction, run MTR on the on-premises host (ICMP and TCP based):

$ mtr -n -c 100 <private IP of EC2> --report
$ mtr -n -T -P <EC2 instance open TCP port> -c 100 <private IP of EC2> --report

3. For the AWS to on-premises direction, run MTR on the EC2 instance (ICMP and TCP based):

$ mtr -n -c 100 <private IP of the local host> --report
$ mtr -n -T -P <local host open TCP port> -c 100 <private IP of the local host> --report

Review the path MTU between the on-premises host and AWS host (layer 3)

The maximum transmission unit (MTU) is the size of the largest permissible packet that was passed over the network connection. Any packet that's greater than the MTU size is dropped on the interface. Therefore, packet loss can occur if the packet is too large.

Path MTU Discovery (PMTUD) determines the MTU path. For more information, see Path MTU Discovery.

You can check the path MTU between two hosts using tracepath.

1.    For the on-premises to AWS direction, run tracepath on port 80 from the localhost:

$ tracepath -n -p 80 <EC2 private instance IP>

2.    For the AWS to on-premises direction, run tracepath on port 80 from the EC2 instance:

$ tracepath -n -p 80 <private IP of local host>

Check for possible routing issues with BGP

The Direct Connect connection uses the dynamic routing protocol Border Gateway Protocol (BGP) for routing and communication between AWS and on-premises.

Check for any regular flaps in BGP that might be causing intermittent packet loss.

Check for the route age of learned routes from AWS to customer network in the customer gateway device. When the routes are refreshed in the customer gateway device, the route age is updated in the BGP route table. You can review this information to check if packet loss happened briefly when the route is refreshed.

To check the route age on a Cisco router, run the following command:

Router#sh ip bgp 1.1.1.1       
BGP routing table entry for 1.1.1.1/32, version 3
Paths: (1 available, best #1, table default)
  Not advertised to any peer
  Refresh Epoch 1
  64512, (received & used)
    169.254.92.181 from 169.254.92.181 (169.254.92.181)
      Origin IGP, metric 100, localpref 100, valid, external, best
      rx pathid: 0, tx pathid: 0x0
      Updated on Mar 31 2023 08:08:00 UTC    >> Last time that the route was updated

-or-

Router#sh ip route | in 1.1.1.1
B    1.1.1.1 [20/100] via 169.254.92.181, 01:37:46   >> You can see the route age or when the route was last refreshed

If you use a hosted connection, check with your partner or service provider to find out if maintenance on their end is causing the packet loss.

Related information

Best practices for configuring network interfaces

How can I monitor packet loss and latency from AWS to an on-premises network over an internet gateway or NAT gateway?

Troubleshooting AWS Direct Connect

How can I troubleshoot Direct Connect network performance issues?Download WinPcap and tracetcp.

AWS OFFICIAL
AWS OFFICIALUpdated a year ago