Message sequence number mismatch in libnl.

I ran into a problem with libnl and sequence numbers while working on my wifi scanner application. My app would subscript to the NL80211_CMD_NEW_SCAN_RESULTS and NL80211_CMD_SCHED_SCAN_RESULTS events. On receiving those events, I would send a NL80211_CMD_GET_SCAN.

Sometimes, I would start getting a -NLE_SEQ_MISMATCH on receiving the scan survey data. Once that occurred, I would receive no more scan data.

The libnl sequence numbers are described briefly here: https://www.infradead.org/~tgr/libnl/doc/core.html#core_seq_num They’re a simple way to associate a request with a response. They’re not bulletproof and do not claim to be.

The sequence numbers are tracked per socket in the private ‘struct nl_sock’. Both are initialized to time(0) in __alloc_socket().

struct nl_sock
{
...
     unsigned int s_seq_next;
     unsigned int s_seq_expect;
...
};

The nlmsghdr contains the sequence number. Note the same sequence number can be used multiple times. When there is more data that can fit in a single nl_msg, the data is broken across multiple nl_msg, indicated by flag NLM_F_MULTI (“Multipart message, terminated by NLMSG_DONE”).

struct nlmsghdr {
        __u32           nlmsg_len;      /* Length of message including header */
        __u16           nlmsg_type;     /* Message content */
        __u16           nlmsg_flags;    /* Additional flags */
        __u32           nlmsg_seq;      /* Sequence number */
        __u32           nlmsg_pid;      /* Sending process port ID */
};

The nlmsghdr->nlmsg_seq is assigned in nl_complete_msg() which is called before the nl_msg is sent to the nl_sock. The socket ‘next’ is incremented at this time.

        if (nlh->nlmsg_seq == NL_AUTO_SEQ) { 
                nlh->nlmsg_seq = sk->s_seq_next++;
                NL_DBG(3, "nl_complete_msg(%p): Increased next " \
                           "sequence number to %d\n",
                           sk, sk->s_seq_next);
        }

The sequence number is checked in recvmsgs(), which is the core of libnl’s nl_msg receive handling. The recvmsgs() is responsible for calling several callbacks and for checking sequence numbers.


                if (hdr->nlmsg_type == NLMSG_DONE ||
                    hdr->nlmsg_type == NLMSG_ERROR ||
                    hdr->nlmsg_type == NLMSG_NOOP ||
                    hdr->nlmsg_type == NLMSG_OVERRUN) {
                        /* We can't check for !NLM_F_MULTI since some netlink
                         * users in the kernel are broken. */
                        sk->s_seq_expect++;
                        NL_DBG(3, "recvmsgs(%p): Increased expected " \
                               "sequence number to %d\n",
                               sk, sk->s_seq_expect);
                }

When the NLMSG_DONE is received, the expected sequence number is increased. If that DONE or ERROR aren’t received, the expected sequence number is never incremented.

The sequence number seq_next is advanced when a new nl_msg is created. The sequence number seq_expect is advanced when an incoming nl_msg is DONE or ERROR (or NOOP or OVERRUN, which I haven’t encountered yet).

The sequence number checking occurs also in recvmsgs(). In my code, I’m not using the NL_CB_SEQ_CHECK callback and leaving auto-ack mode enabled, so the sequence number checking in recvmsgs() is enforced. (As I’m still learning libnl, I was using the same pattern as the iw library.) Note this check happens before the DONE+ERROR check which increments the seq_expect.

                /* Sequence number checking. The check may be done by
                 * the user, otherwise a very simple check is applied
                 * enforcing strict ordering */
                if (cb->cb_set[NL_CB_SEQ_CHECK]) {
                        NL_CB_CALL(cb, NL_CB_SEQ_CHECK, msg);

                /* Only do sequence checking if auto-ack mode is enabled */
                } else if (!(sk->s_flags & NL_NO_AUTO_ACK)) {
                        NL_DBG(3, "recvmsgs(%p) : nlmsg_seq=%d s_seq_expect=%d\n", 
                                        sk, hdr->nlmsg_seq, sk->s_seq_expect);
                        if (hdr->nlmsg_seq != sk->s_seq_expect) {
                                if (cb->cb_set[NL_CB_INVALID])
                                        NL_CB_CALL(cb, NL_CB_INVALID, msg);
                                else {
                                        err = -NLE_SEQ_MISMATCH;
                                        nl_msg_dump(msg, stdout);
                                        goto out;
                                }
                        }
                }

A simple transaction could look like:

sk->seq_nextsk->seq_expecthdr->seq
7296872968new nl_msg; assigned 72968; seq_next++
729697296872968 (MULTI)
729697296872968 (MULTI)
729697296872968 (MULTI+DONE); seq match! seq_expect++
7296972969

Moving on to the problem I encountered. The trigger of the problem is a 2nd CMD_NEW_SCAN_RESULTS or CMD_SCHED_SCAN_RESULTS received while already reading a CMD_GET_SCAN. The kernel doesn’t like interleaving the get-scan-results apparently so refuses with an -EBUSY error. The request increments the sk->seq_next but the error response nl_msg->seq doesn’t match the sk->seq_expect (which is tracking the previous request) and so the nl_msg is dropped before hitting the DONE check that would increment sk->seq_expect.

Once the sequence numbers get into this state, there is no exit. The nl_socket is perpetually at the wrong sequence number. The only solution is to close/re-open the socket on receiving a -NLE_SEQ_MISMATCH.

A better solution would be to avoid getting into this state in the first place. Perhaps not send a new CMD_GET_SCAN while a previous fetch is already running. I’m still tinkering with solutions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s