It was a timely article for me to find because I happened to have been going around in circles with a few different user groups who were sure that jumbo frames are the solution to all of their problems, but they're unwilling to do the administrative work required for implementation, nor the analytic work to see whether there's anything to be gained.
The gist of Denton's argument is that jumbo frames are just one way of reducing the amount of work required for a server to send a large volume of data. Modern NICs and drivers have brought us easier to support ways of accomplishing the same result. The new techniques work even when the system we're talking to doesn't support jumbos, and they even work across intermediate links with a small MTU.
Jumbo frames reduce the server workload because larger frames means fewer per-frame operations need to be performed. TCP offload tricks reduce workload by eliminating per-frame operations altogether. The only remaining advantage for jumbo frames is the minuscule amount of bandwidth saved from sending fewer headers in an end-to-end jumbo-enabled environment.
There's a small facet to this discussion that's been niggling at me, but I've been hesitant to bring it up because I'm not sure how significant it is.
Why not just use Jumbos?
I'm always hesitant to enable jumbo frames for a customer because tends to be a difficult configuration to support. Sure, typing in the configuration is easy, but that configuration takes us down a non-standard rabbit hole where too few people understand the rules.
Every customer I've worked with has made mistakes in this regard. It's a support nightmare that leads to lots of trouble tickets because somebody always forgets to enable jumbos when they deploy a new server/router/switch/coffeepot.
The Rules
- All IP hosts sharing an IP subnet need to agree on the IP MTU in use on that subnet.
- All L2 gear supporting that subnet must be able to handle the largest frame one of those hosts might generate.
Rule 1 means that if you're going to enable jumbo frames on a subnet, you need to configure all systems at the same time. All servers, desktops, appliances, routers on the segment need to agree. This point is not negotiable. PMTUD won't fix things if they don't agree. Nor will TCP's MSS negotiation mechanism. Just make them match.
Rule 2 means that all switches and bridges have to support at least the largest frame. Larger is okay, smaller is not. The maximum frame size value will not be the same as the IP MTU, because it needs to take into account the L2 header.
For extra amusement, different products (even within a single vendor's product lineup) don't agree about how the MTU configuration directives are supposed to be interpreted, making the rules tough to follow.
So, what's been niggling at me?For extra amusement, different products (even within a single vendor's product lineup) don't agree about how the MTU configuration directives are supposed to be interpreted, making the rules tough to follow.
In a modern (Nexus) Cisco data center, we push servers towards using LACP instead of active/standby redundancy. There are various reasons for this preference relating to orphan ports, optimal switching path, the relative high cost of an East/West trip across the vPC peer-link, being confident that the "standby" NIC and switchport are configured correctly, etc... LACP to the server is good news for all these reasons.
But it's bad news for another reason. While aggregate links on a switch are free because they're implemented in hardware, aggregation at the server is another story. Generally speaking, servers instantiate a virtual NIC, and apply their IP configuration to it. The virtual NIC is a software wedge between the upper stack layers and the hardware. It's not free, and it is required to process every frame/PDU/write/whatever handed down from software to hardware, and vice versa.
So , when we turn on LACP on the server, we add per-PDU software processing that wasn't there before, re-kindling the notion that larger PDUs are better. The various TCP offload features can probably be retained, and the performance of aggregate links is generally good. YMMV, check with your server OS vendor.
I'm not sure that we're forcing the server folks to take a step backwards in terms of performance, but I'm afraid that we're supplying a foothold for the pro-jumbo argument which should have ended years ago.
Switch LACP links choose the specific physical link to use by hashing packet headers. Packets on the same flow will hash to the same value and choose the same link, to keep them in order.
ReplyDeleteServer LACP links do the same thing: they hash the outgoing frames to choose a NIC. Packets from the same flow will choose the same NIC, and stay in order. That hashing can be done on a large segment just as effectively as an MTU-sized packet. For example, I find references to LSO support in Broadcom's LACP implementation so long as all of the underlying NICs support LSO.
Though LACP adds a small amount of overhead to calculate its hash function and deal with LACP protocol frames, I don't think it rekindles the need for jumbo frames. LSO appears to work with LACP.
Hey Denton,
ReplyDeleteBoth the Broadcom aggregate driver for Windows and the linux bonding driver appear to support the lowest common denominator of all the links in the bundle. This could be important because (for example) HP servers sometimes come with a blend of onboard Ethernet chipsets with varying capabilities.
I never did get to the bottom of LSO support in the project CrossBow (Solaris) context. And then there's a whole universe of OSes that I don't care about.
I'm with ya on the hashing for link selection, though I'm not sure where it fits in with my per-PDU operations paranoia. IP fragments are an interesting corner case when it comes to intra-flow ordering.
I agree that the bonding driver overhead is probably small, and not worth mentioning. That's why I only bring it up here, among network friends, and not with the server guys :-)
"relative high cost of an East/West trip across the vPC peer-link"
ReplyDeleteChris can you explain what that means?
Will, frames which traverse the vPC peer link incur extra header. DCE header, I think?
ReplyDeleteThe extra header solves problems like "don't forward a broadcast frame towards where it came from"
in the vPC world, the receiving switch needs to know whether a frame arriving on the peer link originated from a vPC.
The header isn't all that big (16 or 20 bytes, I think), but if you're dealing with a lot of small frames, this header can represent significant portion of the available bandwidth.
Chris,
ReplyDeleteThanks for the explanation. I never considered that.