Network Fabric? Photo and weave by Travis Meinolf |
The Cliffs Notes version of the Brocade presentation is that they make data center network switches with Ethernet ports, and because of Brocade's storied history with SAN fabrics, doing multipath bridged Ethernet fabrics is second nature for them.
Three things have stood out about the Brocade presentations that I've seen:
- Brocade is the only vendor I've seen who makes a point of talking about power consumed per unit of bandwidth. I presume that the numbers must be compelling, or else they wouldn't bring it up, but I have not done a comparison on this point.
- Per-packet load balancing on aggregate links. This is really cool, see below
- MLAG attachment of non-fabric switches to arbitrary fabric nodes. Also really cool, maybe I'll get around to covering this one day.
Per-packet Load Balancing on Aggregate Links
We all know that the Link Selection Algorithms (LSA) used by aggregate links (LACP, EtherChannel, bonded interfaces... Some vendors even call them trunks) choose the egress link by hashing the frame/packet header.
LSAs work this way in order to maintain ordered delivery of frames: Putting every frame belonging to a particular flow onto the same egress interface ensures that the frames won't get mixed up on their way to their destination. Ordered delivery is critical, but strict flow -> link mapping means that loads don't get balanced evenly most of the time. It also means that:
- You may have to play funny games with the number of links you're aggregating.
- Each flow is limited to the maximum bandwidth of a single link member.
- Fragments of too-big IP packets might get mis-ordered if your LSA uses protocol header data
Brocade's got a completely different take on this problem, and it kind of blew my mind: They do per-packet load balancing!
The following animation illustrates why per-packet load balancing is helpful:
The following animation illustrates why per-packet load balancing is helpful:
Pay particular attention to the two frames belonging to the green flow. Don't pay attention to the aggregation's oval icon moving around and alternating transparency. I'm a better network admin than I am an animator :)
When the first green flow frame arrives at the left switch, only one link is free, so it is forwarded on the lower link because it's available.
When the 2nd green frame arrives at the left switch, both transmit interfaces are busy, so it sits in a buffer for a short time, until the upper link finishes transmitting the blue frame. As soon as the upper link becomes available, the 2nd green frame makes use of it, even though the earlier green frame used the lower link.
This is way cool stuff. It allows for better utilization of resources, and lower congestion-induced jitter than you'd get with other vendors implementations.
In the example above, there's little possibility that frames belonging to any given flow will get mis-ordered because the packets are queued in order, and the various links in the aggregation have equal latency.
But what if the latency on the links isn't equal?
Now the two links in our aggregation are of different lengths, so they exhibit different latencies.
Just like in the previous example, the first green frame uses the lower link (now with extra latency) and the second green frame is queued due to congestion. But when the blue frame clears the upper interface, the green frame doesn't follow directly on its heels. Instead, the green frame sits in queue a bit longer (note the long gap between blue and green on the upper link) to ensure that it doesn't arrive at the right switch until after the earlier green frame has completely arrived.
Neato.
Now, does the Brocade implementation really work this way? I have no idea :) Heck, I don't even know if these are cut-through switches, but that's how I've drawn them. Even if this isn't exactly how it works, the per-packet load balancing scheme and the extra delay to compensate for mismatched latency are real, and they're really cool.
The gotcha about this stuff? All member links need to terminate on a single ASIC within each switch. You're not going to be spreading these aggregate links across line cards, so these sort of aggregations are strictly for scaling bandwidth, and not useful for guarding against failure scenarios where a whole ASIC is at risk.
Can a 4 port bundle have two ports on one ASIC and another 2 ports on another ASIC?
ReplyDelete-Farick
Hey Farick,
ReplyDeleteYes, but the magic load balancing stuff goes out the window in that case.
Hi Chris,
ReplyDeleteIn a Fabric network, best practises will push to connect an end device to 2 different devices of a Fabric, so you will in most cases have different latency.
Do you know if it is possible to change per-packet LSA to a flow-like-based-LSA on Brocade fabric?
Brocade has a huge experience in FC SAN, do you know what is the LSA used in FC fabric?
Thanks, Fabian.
Hey Fabian,
ReplyDelete100% agree: multiple links to a single neighbor device isn't going to cut it. This per-packet balancing stuff is appropriate only where performance requires you to establish multiple parallel links to each neighbor device. In those cases, 1+1 now equals 2, where it didn't before:
http://packetpushers.net/the-scaling-limitations-of-etherchannel-or-why-11-does-not-equal-2/
Brocade aggregate links which terminate on different ASICs (cards, chassis, etc...) on either end will do their "normal" link selection / flow hashing. I don't know the specifics about the algorithm, but I'd wager that L2-L4 information would be considered.