I ran into a problem with libnl and sequence numbers while working on my wifi scanner application. My app would subscript to the NL80211_CMD_NEW_SCAN_RESULTS and NL80211_CMD_SCHED_SCAN_RESULTS events. On receiving those events, I would send a NL80211_CMD_GET_SCAN.
Sometimes, I would start getting a -NLE_SEQ_MISMATCH on receiving the scan survey data. Once that occurred, I would receive no more scan data.
The libnl sequence numbers are described briefly here: https://www.infradead.org/~tgr/libnl/doc/core.html#core_seq_num They’re a simple way to associate a request with a response. They’re not bulletproof and do not claim to be.
The sequence numbers are tracked per socket in the private ‘struct nl_sock’. Both are initialized to time(0) in __alloc_socket().
struct nl_sock
{
...
unsigned int s_seq_next;
unsigned int s_seq_expect;
...
};
The nlmsghdr contains the sequence number. Note the same sequence number can be used multiple times. When there is more data that can fit in a single nl_msg, the data is broken across multiple nl_msg, indicated by flag NLM_F_MULTI (“Multipart message, terminated by NLMSG_DONE”).
struct nlmsghdr {
__u32 nlmsg_len; /* Length of message including header */
__u16 nlmsg_type; /* Message content */
__u16 nlmsg_flags; /* Additional flags */
__u32 nlmsg_seq; /* Sequence number */
__u32 nlmsg_pid; /* Sending process port ID */
};
The nlmsghdr->nlmsg_seq is assigned in nl_complete_msg() which is called before the nl_msg is sent to the nl_sock. The socket ‘next’ is incremented at this time.
if (nlh->nlmsg_seq == NL_AUTO_SEQ) {
nlh->nlmsg_seq = sk->s_seq_next++;
NL_DBG(3, "nl_complete_msg(%p): Increased next " \
"sequence number to %d\n",
sk, sk->s_seq_next);
}
The sequence number is checked in recvmsgs(), which is the core of libnl’s nl_msg receive handling. The recvmsgs() is responsible for calling several callbacks and for checking sequence numbers.
if (hdr->nlmsg_type == NLMSG_DONE ||
hdr->nlmsg_type == NLMSG_ERROR ||
hdr->nlmsg_type == NLMSG_NOOP ||
hdr->nlmsg_type == NLMSG_OVERRUN) {
/* We can't check for !NLM_F_MULTI since some netlink
* users in the kernel are broken. */
sk->s_seq_expect++;
NL_DBG(3, "recvmsgs(%p): Increased expected " \
"sequence number to %d\n",
sk, sk->s_seq_expect);
}
When the NLMSG_DONE is received, the expected sequence number is increased. If that DONE or ERROR aren’t received, the expected sequence number is never incremented.
The sequence number seq_next is advanced when a new nl_msg is created. The sequence number seq_expect is advanced when an incoming nl_msg is DONE or ERROR (or NOOP or OVERRUN, which I haven’t encountered yet).
The sequence number checking occurs also in recvmsgs(). In my code, I’m not using the NL_CB_SEQ_CHECK callback and leaving auto-ack mode enabled, so the sequence number checking in recvmsgs() is enforced. (As I’m still learning libnl, I was using the same pattern as the i
w
library.) Note this check happens before the DONE+ERROR check which increments the seq_expect.
/* Sequence number checking. The check may be done by
* the user, otherwise a very simple check is applied
* enforcing strict ordering */
if (cb->cb_set[NL_CB_SEQ_CHECK]) {
NL_CB_CALL(cb, NL_CB_SEQ_CHECK, msg);
/* Only do sequence checking if auto-ack mode is enabled */
} else if (!(sk->s_flags & NL_NO_AUTO_ACK)) {
NL_DBG(3, "recvmsgs(%p) : nlmsg_seq=%d s_seq_expect=%d\n",
sk, hdr->nlmsg_seq, sk->s_seq_expect);
if (hdr->nlmsg_seq != sk->s_seq_expect) {
if (cb->cb_set[NL_CB_INVALID])
NL_CB_CALL(cb, NL_CB_INVALID, msg);
else {
err = -NLE_SEQ_MISMATCH;
nl_msg_dump(msg, stdout);
goto out;
}
}
}
A simple transaction could look like:
sk->seq_next | sk->seq_expect | hdr->seq |
---|---|---|
72968 | 72968 | new nl_msg; assigned 72968; seq_next++ |
72969 | 72968 | 72968 (MULTI) |
72969 | 72968 | 72968 (MULTI) |
72969 | 72968 | 72968 (MULTI+DONE); seq match! seq_expect++ |
72969 | 72969 |
Moving on to the problem I encountered. The trigger of the problem is a 2nd CMD_NEW_SCAN_RESULTS or CMD_SCHED_SCAN_RESULTS received while already reading a CMD_GET_SCAN. The kernel doesn’t like interleaving the get-scan-results apparently so refuses with an -EBUSY error. The request increments the sk->seq_next but the error response nl_msg->seq doesn’t match the sk->seq_expect (which is tracking the previous request) and so the nl_msg is dropped before hitting the DONE check that would increment sk->seq_expect.

Once the sequence numbers get into this state, there is no exit. The nl_socket is perpetually at the wrong sequence number. The only solution is to close/re-open the socket on receiving a -NLE_SEQ_MISMATCH.
A better solution would be to avoid getting into this state in the first place. Perhaps not send a new CMD_GET_SCAN while a previous fetch is already running. I’m still tinkering with solutions.