Category Archives: Uncategorized

Message sequence number mismatch in libnl.

I ran into a problem with libnl and sequence numbers while working on my wifi scanner application. My app would subscript to the NL80211_CMD_NEW_SCAN_RESULTS and NL80211_CMD_SCHED_SCAN_RESULTS events. On receiving those events, I would send a NL80211_CMD_GET_SCAN.

Sometimes, I would start getting a -NLE_SEQ_MISMATCH on receiving the scan survey data. Once that occurred, I would receive no more scan data.

The libnl sequence numbers are described briefly here: https://www.infradead.org/~tgr/libnl/doc/core.html#core_seq_num They’re a simple way to associate a request with a response. They’re not bulletproof and do not claim to be.

The sequence numbers are tracked per socket in the private ‘struct nl_sock’. Both are initialized to time(0) in __alloc_socket().

struct nl_sock
{
...
     unsigned int s_seq_next;
     unsigned int s_seq_expect;
...
};

The nlmsghdr contains the sequence number. Note the same sequence number can be used multiple times. When there is more data that can fit in a single nl_msg, the data is broken across multiple nl_msg, indicated by flag NLM_F_MULTI (“Multipart message, terminated by NLMSG_DONE”).

struct nlmsghdr {
        __u32           nlmsg_len;      /* Length of message including header */
        __u16           nlmsg_type;     /* Message content */
        __u16           nlmsg_flags;    /* Additional flags */
        __u32           nlmsg_seq;      /* Sequence number */
        __u32           nlmsg_pid;      /* Sending process port ID */
};

The nlmsghdr->nlmsg_seq is assigned in nl_complete_msg() which is called before the nl_msg is sent to the nl_sock. The socket ‘next’ is incremented at this time.

        if (nlh->nlmsg_seq == NL_AUTO_SEQ) { 
                nlh->nlmsg_seq = sk->s_seq_next++;
                NL_DBG(3, "nl_complete_msg(%p): Increased next " \
                           "sequence number to %d\n",
                           sk, sk->s_seq_next);
        }

The sequence number is checked in recvmsgs(), which is the core of libnl’s nl_msg receive handling. The recvmsgs() is responsible for calling several callbacks and for checking sequence numbers.


                if (hdr->nlmsg_type == NLMSG_DONE ||
                    hdr->nlmsg_type == NLMSG_ERROR ||
                    hdr->nlmsg_type == NLMSG_NOOP ||
                    hdr->nlmsg_type == NLMSG_OVERRUN) {
                        /* We can't check for !NLM_F_MULTI since some netlink
                         * users in the kernel are broken. */
                        sk->s_seq_expect++;
                        NL_DBG(3, "recvmsgs(%p): Increased expected " \
                               "sequence number to %d\n",
                               sk, sk->s_seq_expect);
                }

When the NLMSG_DONE is received, the expected sequence number is increased. If that DONE or ERROR aren’t received, the expected sequence number is never incremented.

The sequence number seq_next is advanced when a new nl_msg is created. The sequence number seq_expect is advanced when an incoming nl_msg is DONE or ERROR (or NOOP or OVERRUN, which I haven’t encountered yet).

The sequence number checking occurs also in recvmsgs(). In my code, I’m not using the NL_CB_SEQ_CHECK callback and leaving auto-ack mode enabled, so the sequence number checking in recvmsgs() is enforced. (As I’m still learning libnl, I was using the same pattern as the iw library.) Note this check happens before the DONE+ERROR check which increments the seq_expect.

                /* Sequence number checking. The check may be done by
                 * the user, otherwise a very simple check is applied
                 * enforcing strict ordering */
                if (cb->cb_set[NL_CB_SEQ_CHECK]) {
                        NL_CB_CALL(cb, NL_CB_SEQ_CHECK, msg);

                /* Only do sequence checking if auto-ack mode is enabled */
                } else if (!(sk->s_flags & NL_NO_AUTO_ACK)) {
                        NL_DBG(3, "recvmsgs(%p) : nlmsg_seq=%d s_seq_expect=%d\n", 
                                        sk, hdr->nlmsg_seq, sk->s_seq_expect);
                        if (hdr->nlmsg_seq != sk->s_seq_expect) {
                                if (cb->cb_set[NL_CB_INVALID])
                                        NL_CB_CALL(cb, NL_CB_INVALID, msg);
                                else {
                                        err = -NLE_SEQ_MISMATCH;
                                        nl_msg_dump(msg, stdout);
                                        goto out;
                                }
                        }
                }

A simple transaction could look like:

sk->seq_nextsk->seq_expecthdr->seq
7296872968new nl_msg; assigned 72968; seq_next++
729697296872968 (MULTI)
729697296872968 (MULTI)
729697296872968 (MULTI+DONE); seq match! seq_expect++
7296972969

Moving on to the problem I encountered. The trigger of the problem is a 2nd CMD_NEW_SCAN_RESULTS or CMD_SCHED_SCAN_RESULTS received while already reading a CMD_GET_SCAN. The kernel doesn’t like interleaving the get-scan-results apparently so refuses with an -EBUSY error. The request increments the sk->seq_next but the error response nl_msg->seq doesn’t match the sk->seq_expect (which is tracking the previous request) and so the nl_msg is dropped before hitting the DONE check that would increment sk->seq_expect.

Once the sequence numbers get into this state, there is no exit. The nl_socket is perpetually at the wrong sequence number. The only solution is to close/re-open the socket on receiving a -NLE_SEQ_MISMATCH.

A better solution would be to avoid getting into this state in the first place. Perhaps not send a new CMD_GET_SCAN while a previous fetch is already running. I’m still tinkering with solutions.

Building Kismet under Fedora

I’m learning Kismet https://www.kismetwireless.comĀ  I need a way to do remote wireless surveys. If I can send non-technical $customer a Magic Box that will send me a survey. WiFi problems are very environmental so having a second opinion of the environment will help me get my head around the issues.

I’m building Kismet from their git repo. The newer versions have a web interface that seems very enticing. With an internet connected device and a nice VPN overlay, I think I could get the remote information I need.

Kismet has instructions to build under Ubuntu. But for $reasons I’m running Fedora. I’ve been tinkering around with the list of required packages and I think I’ve hit the magic list:

sudo dnf install make automake gcc gcc-c++ kernel-devel git libmicrohttpd-devel pkg-config zlib-devel libnl3-devel libcap-devel libpcap-devel ncurses-devel NetworkManager-libnm-devel libdwarf libdwarf-devel elfutils-devel libsqlite3x-devel protobuf-devel protobuf-c-devel protobuf-compiler protobuf-c-compiler lm_sensors-devel libusb-devel fftw-devel

with some help from https://unix.stackexchange.com/questions/1338/what-is-the-fedora-equivalent-of-the-debian-build-essential-package

 

The authenticity of host ‘X’ can’t be established.

The authenticity of host 'nn.nn.nn.nn (nn.nn.nn.nn)' can't be established.
ECDSA key fingerprint is SHA256:wls31bhGHFgGLYT403xsznbNS53Gzjlwrda5v7OUEZ4.
Are you sure you want to continue connecting (yes/no)?

Um. Wut?

This message always annoys me. I don’t like blindly accepting unknown connections.

From the root shell of my linode instance:

[root@lixxxx-xxx ssh]# ssh-keygen -l -f ssh_host_ecdsa_key
256 SHA256:wls31bhGHFgGLYT403xsznbNS53Gzjlwrda5v7OUEZ4 ssh_host_ecdsa_key.pub ()

Matches!

Now I’ll write this down so I can remember how to verify this next time.