Fun with Python Regex

I’m a computer language nerd. I like programming languages. I’m in no way an expert. I do enjoy digging into new languages and even digging into the low level jiggery-pokery of languages I use day-to-day. Like C. But US$200 for the C standard language spec? Are you kidding me? No. Digging around on the committee website I found a draft: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf

My first pass at the C enum parser is incredibly simple. I don’t want to write a full language parser because (a) that’s a lot of work and (b) it’s already been done before. I want something that would take a couple days tops to create and let me get back to fiddling with nl80211.

I’m using Python for the parser because I know Python pretty well. I’ve tinkered with the the C++ std::regex library as well but not for this project.

C comments can be either block comments surrounded by /*  */  or line comments starting with //.  For my toy parser, I’m handling comments very naively.

I’m mostly interested in the nl80211.h so I’m making some assumptions about the code format as I’m puttering with regex.

A C enum BNF-ish is (from the above pdf):

Screen Shot 2018-12-09 at 9.05.57 AM

I’m amusing myself with WordPress’ text formatting. I get distracted way too easily. (I dug further into the doc to find the definition of enumeration-constant.)

enum-specifier:

enum identifieropt { enumerator-list }
enum identifieropt { enumerator-list , }
enum identifier

enumerator-list:

enumerator
enumerator-list , enumerator

enumerator:

enumeration-constant
enumeration-constant = constant-expression

enumeration-constant:

identifier

Focus, Dave. Focus. HTML is just another language and it’s easy to get distracted. The break between LHS and RHS above is driving me nuts but I will move on. Focus!

A C identifier can be described by the Python regex “[a-zA-Z_][a-zA-Z_0-9]*”  Python regex whitespace is “\s” Required whitespace would be “\s+”. My first naive regex that searched for a starting enum: “enum\s+([a-zA-Z_][a-zA-Z_0-9]*)\s+{”  I’m assuming the enum+identifier+openbrace is on the same line.

The enumerator-list is another regex but more complicated because of the optional expression. I started with: “([a-zA-Z_][a-zA-Z_0-9]*)\s+,”  for a simple match. The constant-expression match would be “([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*([a-zA-Z_0-9]*)?,” and the copy paste started getting on my nerves.

I started fiddling with a Python printf-y .format() and a stumbled across a brain blast. Python f-strings are amazing when used with regex. Instead of trying to build a .format() or a %s block, I can assign my regex to a var. And I have a very readable regex. I can build up my regex piece by piece (for greater or for ill).

# C-style variable name
identifier = "[a-zA-Z_][a-zA-Z_0-9]*"

# using f strings to save myself some confusion 
open_brace = "\{"
close_brace = "\}"
whitespace = "\s+"
number = "-?[0-9]+"
operator = "(?P<operator>\+|-|<<)" # XXX subset of actual C operators

# I'm very sure this is not the proper use of the term 'atom'
# atom := number | identifier 
atom = f"(?:{identifier}|{number})"

# expression := atom
# := atom operator atom 
expression = f"({atom}({whitespace}{operator}{whitespace}{atom})?)"

# enum member regex
symbol_matcher = re.compile(f"(?P<identifier>{identifier})({whitespace}={whitespace}(?P<expression>{expression}))?")

# start of an emum declaration (XXX assumes open brace on same line as the
# 'enum' keywoard
enum_matcher = re.compile(f"enum{whitespace}({identifier}){whitespace}{open_brace}")

 

The f-string uses variables from Python’s context. So f”{identifier}{whitespace}{operator}{whitespace}” will expand to “[a-zA-Z_][a-zA-Z_0-9]*\s+(\+|-|<<)\s+” The f-string is much easier to read. The ?P<name> is a Python regex feature that stores the grouped regex expression into a key “name”.

s = "NL80211_NAN_FUNC_ATTR_MAX = NUM_NL80211_NAN_FUNC_ATTR - 1"
robj = symbol_matcher.search(s)
print(robj)
print(robj.groups())
print(robj.groupdict())

The code snippet gives me the following. Definitely need a lot of testing.

<_sre.SRE_Match object; span=(0, 57), match='NL80211_NAN_FUNC_ATTR_MAX = NUM_NL80211_NAN_FUNC_>
('NL80211_NAN_FUNC_ATTR_MAX', ' = NUM_NL80211_NAN_FUNC_ATTR - 1', 'NUM_NL80211_NAN_FUNC_ATTR - 1', 'NUM_NL80211_NAN_FUNC_ATTR - 1', ' - 1', '-')
{'identifier': 'NL80211_NAN_FUNC_ATTR_MAX', 'expression': 'NUM_NL80211_NAN_FUNC_ATTR - 1', 'operator': '-'}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s