Skip to content Skip to sidebar Skip to footer

Using Regular Expressions In Python To Determine C++ Functions And Their Parameters

So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong. I want a script to go through a file, find all the fu

Solution 1:

The grammar of C++ is far too complex to be handled by simple regular expressions. You'll need at least a minimal parser. I've found that for restricted cases, where I'm not concerned with C++ in general, but only my own style, I can often get away with a flex based tokenizer and a simple state machine. This will fail in many cases of legal C++—for starters, of course, if someone uses the pre-processor to modify the syntax; but also because < can have different meanings, depending on what precedes it names a template or not. But it's often adequate for a specific job.

Solution 2:

I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.

Example Python code for C++ function parser:

EDIT: Address template parameters. Tested with input from SK-logic and output is correct.

import pyPEG
from pyPEG import parseLine
import re

defsymbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
deftype(): return symbol
deffunctionName(): return symbol
deftemplatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"defparameter(): return [templatedType, type], symbol
deftemplate(): return"<", -1, [symbol, template], ">"deffunction(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")"# -1 -> zero or more repetitions.


sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)

When this is executed results is:

([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')

Solution 3:

C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.

There is another concern too, determining when to parse and when not to. A function may be declared:

  • at file scope
  • in a namespace
  • in a class

And the two last can be nested at arbitrary depths.

I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:

  • a C API, with (notably) an API to the Indexing Library
  • Python bindings on top of the C API

The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.


That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...

If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.

Post a Comment for "Using Regular Expressions In Python To Determine C++ Functions And Their Parameters"