An issue that I’ve encountered fairly often are complaints of slow network performance, especially when transferring large files. Although there are many issues that can affect network throughput, the most common issue is related to Large Send Offload.

Large Send Offload (also known as Large Segmentation Offload, and LSO for short) is a feature that allows the operating system TCP\IP network stack to build a large TCP message of up to 64KB in length before sending to the Ethernet adapter. Then the hardware on the Ethernet adapter — what I’ll call the LSO engine — segments it into smaller data packets (known as “frames” in Ethernet terminology) that can be sent over the wire. This is up to 1500 bytes for standard Ethernet frames and up to 9000 bytes for jumbo Ethernet frames. (The actual sizes are bit larger to accommodate the overhead – header and frame check sequence – in the packet). This is designed to free up the CPU on the server from having to handle segmenting large TCP messages into smaller packets required by the frame size. Sounds like a good deal. What could possibly go wrong?

Quite a lot, as it turns out.  In order for this to work, the other network devices — the Ethernet switches through which all traffic flows — have to agree on the frame size. The server cannot send frames that are larger than the Maximum Transmission Unit (MTU) supported by the switches. And this is where everything can, and often does, fall apart.

The server can discover the MTU by asking the switch for the frame size, but there is no way for the server to pass this information to the Ethernet adapter. The LSO engine doesn’t have ability to use a dynamic frame size.  It simply uses the defaults to its specified value, generally 1500 bytes for standard frame and up to 9000 bytes for jumbo frames. But if the Ethernet adapter sends a frame that is larger than switch supports, it silently drops the frame. And this is where network performance can drop off a cliff.

To understand why this hits network performance so hard, let’s follow a typical large TCP message as it traverses the network between two systems.

  1. With LSO enabled, the TCP/IP network stack on the server builds a large TCP message.
  2. The server sends the large TCP message to the Ethernet adapter to be segmented by its LSO engine. Because the LSO engine cannot discover the actual MTU supported by the switch, it uses its default value.
  3. The LSO engine sends each of the frame segments that make up the large TCP message to the switch.
  4. The switch receives the frame segments, but because LSO sent frames larger than the MTU, they are silently discarded.
  5. On the remote system – the one waiting to receive the TCP message – the timeout clock reaches zero when the expected message is not received and sends a request for the message to be retransmitted. Although this timeout is very short in human terms, it rather long in computer terms.
  6. The sending server receives the retransmission request and rebuilds the TCP message. But because this is a retransmission request, the server does not use the Large Send Offload. Instead, it handles the segmentation process itself. This appears to be designed to overcome failures caused by segmentation failures on the Ethernet adapter.
  7. The switch receives the retransmitted frames from the server, which are the proper size because the server is able to discover the correct MTU, and forwards them on to the router.
  8. The remote system finally receives the TCP message intact.

This can basically be summed up as offload data, segment data, discard data, wait for timeout, request retransmission, segment retransmitted data, resend data. The big delay is waiting for the timeout on the remote system to reach zero. And the whole process is repeated the very next time a large TCP message is sent. It’s no wonder that this can cause severe network performance issues.

Fortunately, both Linux and Windows have settings to disable use of Large Send Offload. Although it means a bit more work for the operating system and CPU, it’s by far much faster that waiting on a timeout and doing a retransmission of every data packet.