I’m a computer language nerd. I like programming languages. I’m in no way an expert. I do enjoy digging into new languages and even digging into the low level jiggery-pokery of languages I use day-to-day. Like C. But US$200 for the C standard language spec? Are you kidding me? No. Digging around on the committee website I found a draft: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
My first pass at the C enum parser is incredibly simple. I don’t want to write a full language parser because (a) that’s a lot of work and (b) it’s already been done before. I want something that would take a couple days tops to create and let me get back to fiddling with nl80211.
I’m using Python for the parser because I know Python pretty well. I’ve tinkered with the the C++ std::regex library as well but not for this project.
C comments can be either block comments surrounded by /* */ or line comments starting with //. For my toy parser, I’m handling comments very naively.
I’m mostly interested in the nl80211.h so I’m making some assumptions about the code format as I’m puttering with regex.
A C enum BNF-ish is (from the above pdf):
I’m amusing myself with WordPress’ text formatting. I get distracted way too easily. (I dug further into the doc to find the definition of enumeration-constant.)
enum-specifier:
enum identifieropt { enumerator-list }
enum identifieropt { enumerator-list , }
enum identifier
enumerator-list:
enumerator
enumerator-list , enumerator
enumerator:
enumeration-constant
enumeration-constant = constant-expression
enumeration-constant:
identifier
Focus, Dave. Focus. HTML is just another language and it’s easy to get distracted. The break between LHS and RHS above is driving me nuts but I will move on. Focus!
A C identifier can be described by the Python regex “[a-zA-Z_][a-zA-Z_0-9]*” Python regex whitespace is “\s” Required whitespace would be “\s+”. My first naive regex that searched for a starting enum: “enum\s+([a-zA-Z_][a-zA-Z_0-9]*)\s+{” I’m assuming the enum+identifier+openbrace is on the same line.
The enumerator-list is another regex but more complicated because of the optional expression. I started with: “([a-zA-Z_][a-zA-Z_0-9]*)\s+,” for a simple match. The constant-expression match would be “([a-zA-Z_][a-zA-Z_0-9]*)\s*=\s*([a-zA-Z_0-9]*)?,” and the copy paste started getting on my nerves.
I started fiddling with a Python printf-y .format() and a stumbled across a brain blast. Python f-strings are amazing when used with regex. Instead of trying to build a .format() or a %s block, I can assign my regex to a var. And I have a very readable regex. I can build up my regex piece by piece (for greater or for ill).
# C-style variable name identifier = "[a-zA-Z_][a-zA-Z_0-9]*" # using f strings to save myself some confusion open_brace = "\{" close_brace = "\}" whitespace = "\s+" number = "-?[0-9]+" operator = "(?P<operator>\+|-|<<)" # XXX subset of actual C operators # I'm very sure this is not the proper use of the term 'atom' # atom := number | identifier atom = f"(?:{identifier}|{number})" # expression := atom # := atom operator atom expression = f"({atom}({whitespace}{operator}{whitespace}{atom})?)" # enum member regex symbol_matcher = re.compile(f"(?P<identifier>{identifier})({whitespace}={whitespace}(?P<expression>{expression}))?") # start of an emum declaration (XXX assumes open brace on same line as the # 'enum' keywoard enum_matcher = re.compile(f"enum{whitespace}({identifier}){whitespace}{open_brace}")
The f-string uses variables from Python’s context. So f”{identifier}{whitespace}{operator}{whitespace}” will expand to “[a-zA-Z_][a-zA-Z_0-9]*\s+(\+|-|<<)\s+” The f-string is much easier to read. The ?P<name> is a Python regex feature that stores the grouped regex expression into a key “name”.
s = "NL80211_NAN_FUNC_ATTR_MAX = NUM_NL80211_NAN_FUNC_ATTR - 1" robj = symbol_matcher.search(s) print(robj) print(robj.groups()) print(robj.groupdict())
The code snippet gives me the following. Definitely need a lot of testing.
<_sre.SRE_Match object; span=(0, 57), match='NL80211_NAN_FUNC_ATTR_MAX = NUM_NL80211_NAN_FUNC_> ('NL80211_NAN_FUNC_ATTR_MAX', ' = NUM_NL80211_NAN_FUNC_ATTR - 1', 'NUM_NL80211_NAN_FUNC_ATTR - 1', 'NUM_NL80211_NAN_FUNC_ATTR - 1', ' - 1', '-') {'identifier': 'NL80211_NAN_FUNC_ATTR_MAX', 'expression': 'NUM_NL80211_NAN_FUNC_ATTR - 1', 'operator': '-'}