802.11 driver/stack TX path notes
The 802.11 TX path is quirky. Here's some notes about it.
TX path
There's a few TX path entry points:
Normal data
- if_start / if_transmit - goes in via a vap, gets queued as appropriate, 802.11 state gets handled (eg sequence number assignment, padding, etc)
- .. then frames go via the driver input method;
- the driver then finishes driver-specific encapsulation and adds whatever crypto state is needed - this includes the CCMP IV.
- then the frame gets pushed onto the driver hardware or software queue as appropriate.
Fragments
This is the dirty bit. It occurs via ieee80211_fragment(). This breaks the mbuf into a chain of mbufs, linked via m->m_nextpkt.
However the moment one calls if_start or if_transmit with this chain of mbuf fragments, the m_nextpkt pointer gets cleared and only the first part of the frame is transmitted. The rest of the fragments aren't transmitted; the non-first fragments themselves are silently leaked (as m_nextpkt gets cleared rather than freeing the mbuf.)
In the past, the driver would get the 802.3 frame and it would then do the net80211 encapsulation, fragmenting and encryption. So it could loop over a list of fragments and transmit those.
This changed as part of the work to enable VAP transmission, where frames go into the VAP first and encapsulation happens at this layer. This includes fast-frames handling and fragmentation.
Raw (management) frames
This path is rather annoying - if it's a "raw" frame, it goes straight to the driver. Otherwise it goes via if_transmit() to get encapsulated.
There's also an ic_raw_xmit() function; it gets used directly by some management frame transmission routines.
Power save queue
Power save queue packets get pushed onto the queue here from the net80211 TX path. It gets drained (normally) by the receive path - either on a ps-poll or a data frame with the power management bit cleared.
Fast-frames staging queue
This happens via ff_transmit(); which can be called by ieee80211_ff_check() and ff_flush().
This gets called by a variety of contexts - the TX path, the TX completion path in ath(4) calls the FF flush routine; and the ath(4) RX path also does this.
Serialisation issues
The big thing to realise here is that if_transmit() and if_start() do not have any kind of serialisation guarantees. Even if_start(), which uses the ifnet send queue, can and will have multiple threads calling the stack and driver start method at the same time. Overlapping _and_ preempting. Keep that in mind.
The first problem here is the lack of TX serialisation. Specifically, TX can occur from multiple thread contexts and can be preempted. This includes multiple TX sending threads (eg multiple iperf processes), the RX context (eg an RX of an ACK kick-starts the TX of the next frame), various timers/callouts (eg BAR retransmission, etc.)
There's also the raw versus non-raw TX path. Some raw frames (eg EAPOL frames, as part of WPA/WPA2 handling) have sequence numbers assigned to them and are encrypted. These need to be pushed into the TX path in the "right" order; any out of order handling here will cause the receiver to drop frames.
The next problem is the disconnect between net80211 TX handling and encryption. The encryption is done in the driver, not the net80211 TX path. Thus the frames pushed into the driver has to be handled in order by the driver, otherwise there may be a slight disconnect between the 802.11 sequence numbers and the CCMP IV sequence number.
Solutions?
The linux solution is "hold a lock in the TX code, so that frames get handled in-order from the point where it's grabbed by mac80211 and processed (sequence number, 802.11 state) all the way through the driver, including encryption handling.
The "better" solution is to serialise both the net80211 TX path and the driver TX path so there's a TX queue that frames get pushed into, then either:
- A single thread is allowed to continue transmitting, the others just queue and quit;
- There's a TX taskqueue that is woken up whenever TX needs to occur.
The side-issues:
- Fragments still need to be handled somehow;
- The power save code can call the driver vap method direct (if the frames are already encapsulated);
- The raw / management frame API pushes frames in via the side, which should also queue frames.
There are significant issues that creep up when you do this:
- A single queue may cause issues with higher priority traffic (eg management frames, voice frames, etc.)
- When trying to implement a "single thread only can progress" model, there's a subtle race condition where the other threads think that it can't continue so it just queues frames, and the sending thread is in the process of actively wrapping up and quitting, so the work stalls until the next TX occurs. (This happens with buf_ring and some of the 10GE drivers, fwiw.)
- When trying to implement a "TX taskqueue entry" model, you have issues with increased latency decreasing TX throughput.
The problem with taskqueues
I've tried the TX taskqueue thing with ath(4), out of curiousity. (The real solution involves doing it with net80211 too, and either having a separate taskqueue for ath(4), or having the net80211 TX taskqueue call the driver direct and have the driver TX _only_ ever called from that context.)
The initial problem is "dramatic TX throughput reduction."
After tracing things a bit, I discovered the cause: when TX isn't direct dispatch, there is significant delays between TXing a frame and scheduling/running the taskqueue. Yes, even if it's a separate taskqueue (rather than a task in the same taskqueue as the RX task.)
The normal case:
- you RX a frame
- it's an ACK and it causes a TX to occur;
- the network layer calls TX, and that pushes a frame up through the stack to be transmitted;
- it will call if_transmit / if_start, and that'll cause ath_start() to run and start transmitting;
- .. if there's more RX frames that occur, then they do occur and further TX frames are buffered if the TX queue is filled;
- TX completion gets scheduled;
- RX frame tasklet finishes and it yields to the TX completion thread;
- that completes frames, which calls ath_start() and/or the TX software queue task again to transmit more frames.
The shared RX/TX/TX completion taskqueue case:
- You RX a frame
- it's an ACK and it causes a TX to occur;
- a frame is pushed onto the queue and the TX task is woken up;
- .. but it can't run until the RX taskqueue has completed;
- So more frames are received and processed; the TX task stays there waiting to be run;
- RX completes, the TX task wakes up and bursts out frames;
- TX completion tasklet runs and completes frames;
- etc, etc.
Now, there's added latency between the TX occuring and the TX tasklet running because they run in the same shared taskqueue context. That extra latency causes an noticable performance dip. Very noticable.
The shared RX/TX taskqueue case:
- You RX a frame;
- it's an ACK and causes a TX to occur;
- a frame is pushed onto the queue and the TX task is woken up;
- .. since it's the same priority as the RX task (say), it won't run until the RX task finishes.
- if the TX task is higher priority then sure, it'll preempt the RX taskqueue and run until the TX task has completed.
This is a similar problem to the shared TX/RX taskqueue. The higher priority means that the RX tasklet (and TX completion tasklet) will get preempted.
However, it's not a magic bullet: Specifically, the TX taskqueue will get woken up on every packet and if it can't do anything (eg the hardware TXQ is filled) it'll just software queue the frame and go to sleep. This is a lot of scheduler thrashing.
Now, my ideal solution could be:
- The driver TX path would decide which node and TID queue the frame needs to be pushed onto;
- The driver TX path would just grab that lock, queue the frame and unlock;
- Then if the underlying hardware queue wasn't busy, it would schedule the TX taskqueue, otherwise it would not;
- The TX taskqueue would then be utterly responsible for the encapsulation, encryption stuff, etc, which it would do serialised.
This is still suboptimal:
- The TX taskqueue would need to know which software queues need to be handled when it ran;
- There's still a race window where the TX taskqueue is scheduled by various driver threads because they don't (yet) see a busy hardware queue.
The last point could be addressed by tracking how deep the hardware queue was about to be, rather than how deep the hardware queue _is_. That's just playing with fire though (if you get all of that wrong, you may screw up the queue depth figure and never handle things.)
.. And this is all just for the ath(4) TX serialisation. This doesn't at all address the net80211 TX serialisation, which includes many of the same issues here.