Annotated Python – Header Checker

This series of articles is targeted at those readers who don't know too much about Python, but are curious. Curious about what Python can do, whether or not it is as easy to read as the Python enthusiasts expound, and is it worth putting effort into learning.

I have been wanting to write an article about Python for a while but as is often the case you think "what could I possibly write about?" I wasn't really interested in writing a syntax guide, there are many of those around already. The solution to the quandary was to ask a muse (Paul Grenyer).

[13:37] tim: I'm wanting to write a python article for cvu
[13:37] tim: in the comming edition
[13:37] Paul Grenyer: ok
[13:37] tim: but not sure what to cover
[13:37] tim: ideas?
<boring – non relevant bit snipped>
[13:38] Paul Grenyer: Well, the last script I was thinking of writing....
[13:39] Paul Grenyer: was to go through source files and checking to see 
   if the GNU license was at the top and if it wasn't, adding it.

I thought that this could be interesting, mildly useful, and cover a number of things without being too complex. So here we go.

The first thing to do is to decide exactly what the script has to do.

Getting Aeryn

Next, we need some files to crawl over. Given that it was Paul I was talking to, he suggested Aeryn, an excellent C++ unit testing framework he has written.

The interactive interpreter

Python is an interpreted language. This means that there is no explicit compile step to turn source code into something that is executed. Python also provides an interactive interpreter (often just referred to as 'the interpreter'). Once the interpreter is started, it presents the user with a prompt at which you can type code into directly. Python statements are executed as they are entered; this is a great place to test out code without even committing ideas to a source file.

On linux machines Python is normally in the default path, and starting the interpreter is done by calling the Python executable from the shell prompt:

tim@spike:~$ python
Python 2.4.3 (#2, Apr 27 2006, 14:43:58)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
tim@spike:~$

If you are in Windows with a default Python installation, then the location of the Python executable isn't in your PATH.

C:\Documents and Settings\Tim>c:\Python24\python
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> ^Z

C:\Documents and Settings\Tim>

To exit the Python interpreter, use Ctrl-D on linux, or Ctrl-Z on windows. Even though the development of this script was done on both linux and windows, all the path names in the examples are given as linux paths for consistency.

Which files?

The first piece of code we need is something that will crawl a directory recursively and easily.

The first place to look for code that you suspect to exist is the Python Library Reference. Anything that is operating system specific, such as traversing a directory is in the os module. A quick look through the documentation leads us to the Files and Directories section and there is an excellent example of how to use the os.walk function there.

>>> import os 1
>>> for root, dirs, files in os.walk('/home/tim/accu/aeryn'): 2
...     print root, dirs, files 3
...
[result snipped due to way to much stuff printed out]
  1. Python comes with an extensive list of standard modules – os contains operating system specific commands. The import statement finds the module (by traversing the PYTHONPATH), initialises it, and defines one or more names at local scope. In this case the local variable os is initialised to be a module instance.
  2. The walk function provides iteration over the directories. The result of each iteration is three parameters – the directory name, a list of directories in that directory, and a list of files in that directory
  3. The print statement takes an arbitrary number of parameters, and by default will print the string representation of the parameters. Also by default the print statement adds a carriage return.

A quick look at the results from this shows that the subversion directories are also being crawled here – and we don't want that. So let's add a quick check to remove them:

>>> for root, dirs, files in os.walk('/home/tim/accu/aeryn'):
...    if '.svn' not in root: 1
...      print root, dirs, files
...
/home/tim/accu/aeryn ['.svn', 'corelib', 'examples', 'include', 'make', 'src', \
  'testrunner', 'testrunner2', 'tests', 'www'] ['aeryn2.sln', 'Doxyfile', \
  'lgpl_aeryn.txt', 'license.txt', 'Makefile', 'SConstruct', 'VERSION']
/home/tim/accu/aeryn/corelib ['.svn'] ['corelib.vcproj', 'Makefile']
    <<-- snipped results -->>
/home/tim/accu/aeryn/examples/customreport1 ['.svn'] ['main.cpp', 'customreport1.vcproj']
/home/tim/accu/aeryn/examples/mockfiletests ['.svn'] ['main.cpp', 'mockfiletests.vcproj']
  1. 'str' in var returns True if the string 'str' is found in the string var, a not in b is the same as not a in b, but easier to read

This is a bit messy, though. A re-read of the documentation shows that the list of directories that is returned from the walk function is checked for the next iteration. Removing an entry from the directory list tells the function not to traverse it. This is a much tidier way of avoiding the subversion directories.

>>> for root, dirs, files in os.walk('/home/tim/accu/aeryn'):
...     if '.svn' in dirs: dirs.remove('.svn')
...     print root, dirs, files
... 
/home/tim/accu/aeryn ['corelib', 'include', 'www', 'src', 'testrunner2', 'tests', \
  'testrunner', 'make', 'examples'] ['aeryn2.sln', 'Doxyfile', 'VERSION', \
  'lgpl_aeryn.txt', 'license.txt', 'SConstruct', 'Makefile']
    <<-- snipped results -->>
/home/tim/accu/aeryn/examples/customreport1 [] ['main.cpp', 'customreport1.vcproj']
/home/tim/accu/aeryn/examples/mockfiletests [] ['main.cpp', 'mockfiletests.vcproj']
>>> 

Once the complexity of the code being written goes over a couple of lines, I end up writing a script to contain the code. A script can be imported into the interpreter as a module. For example, here is the file checker.py:

import os  1

def files(basedir, extensions): 2
    result = [] 3
    for root, dirs, files in os.walk(basedir):
        if '.svn' in dirs:
            dirs.remove('.svn')
        for file in files:
            for ext in extensions:
                if file.endswith(ext):
                    result.append(os.path.join(root, file)) 4
    return result 5
  1. every script has to stand alone, so a script must import all modules it uses
  2. the def command is used to create a function
  3. no need to declare variables, just assign to them. In this case result is set to be an empty list
  4. os.path.join joins together two or more path elements with the appropriate path separator for the platform
  5. return the list of files – if the return statement is omitted, then None is returned

And the script is imported into the interpreter like this:

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import checker
>>> result = checker.files('/home/tim/accu/aeryn', ('.cpp','.hpp'))
>>> print result[:5] 1
['/home/tim/accu/aeryn/include/aeryn/platform_report_output.hpp', 
 '/home/tim/accu/aeryn/include/aeryn/test_name_not_found.hpp', 
 '/home/tim/accu/aeryn/include/aeryn/use_name.hpp', 
 '/home/tim/accu/aeryn/include/aeryn/namespace.hpp',
 '/home/tim/accu/aeryn/include/aeryn/xcode_report.hpp']
>>> 
  1. [a:b] is the slice operator, where a and b are optional. result[:5] says return a list that has the elements from result starting at the start and up to but not including element at index 5 (the sixth element – indices start from zero).

Working with command line options

Now that there is a way to get the names of all the files that we are interested in, we need to do something with them. In this case, look at the start of the files to see if there is a matching header. A simple way to define the header that we are looking for is to put the header in a file, and pass that name of the file to the script.

Command line parsing is a problem that all but the most trivial scripts need to handle. Luckily Python has an outstanding module for parsing command line arguments – optparse.

from optparse import OptionParser  1

def parse(args):   2
    parser = OptionParser()
    parser.add_option('-d', '--dir', default='.',
                      help='The base directory to start from') 3
    parser.add_option('-e', '--ensure-header',
                      dest='header', metavar='FILE',  4
                      help='Ensure that the header in FILE is'\
                           ' at the start of all the files')
    parser.add_option('-r', '--remove-header',
                      dest='remove', metavar='FILE',
                      help='Remove the specified header first'\
                           ' if it is there')
    parser.add_option('-x', '--ext', action='append',  5
                      help='Look at files with this extension')
    parser.add_option('-t', '--test', action='store_true', 6
                      default=False,
                      help="Test run, doesn't actually change the files") 7
    return parser.parse_args(args)   8
  1. Instead of importing an entire module and prefixing the use of all functions, you can also import individual functions, classes or variables from modules using the from statement. When imported this way, the imported entity does not have to be prefixed with the module name.
  2. Command line arguments are available as list. An advantage of having a stand alone function for parsing the arguments is that it can be tested in isolation using the interactive interpreter.
  3. The first parameter is the short option name, and the second is the long option name. If a default is not specified None is used. The help parameter is printed out if the -h or --help option is set. The variable name that is used to store the option is, by default, the same as the long option name with the prefix '--' removed, so in this case 'dir'.
  4. Here the variable name to store the option is being overridden to a shorter name using the 'dest' parameter. The metavar parameter is used only when printing the help. Start of help text without metavar:
      -e HEADER, --ensure-header=HEADER
    Start of help text with metavar:
      -e FILE, --ensure-header=FILE
  5. The append action allows the option to be specified multiple times. The variable containing the option is returned as a list.
  6. The store_true action specifies that there is no associated input expected for this option.
  7. Notice that the help string here uses double quotes not single quotes. Python strings can be defined using either single or double quotes, although normal usage is to use single. Here I am using double quotes as it avoids having to escape the single quote in doesn't. Alternatively it could have been written:
    help='Test run, doesn\'t actually change the files'
  8. The parse_args function returns a tuple of (options, args) where the options object contains the parameter values, and the args parameter is a list containing the arguments that did not match any of the options.

The options object that is returned has members defined according to the arguments that were parsed, shown here:

>>> reload(checker)  1
<module 'checker' from 'checker.py'>
>>> args = ['-d', '/home/tim/accu/aeryn',
... '--ensure-header=gnu.txt',
... '-x.cpp', '-x', '.hpp', '-t']  2
>>> options, args = checker.parse(args)
>>> options.dir
'/home/tim/accu/aeryn'
>>> options.header
'gnu.txt'
>>> options.ext
['.cpp', '.hpp']
>>> options.test
True
>>> 
  1. If a module has been imported and then changed, the updated code can be loaded by using the reload command.
  2. This gives a way of testing the equivalent of passing '-d /home/tim/accu/aeryn --ensure-header=gnu.txt -x.cpp -x .hpp -t' on the command line.

Looking for headers

In order to create the file gnu.txt I cut the top off one of the files. To load the contents of the file into a variable you can do this:

>>> header = open(options.header).read()

This is a little sloppy though as it relies on the garbage collector to close the file handle for the associated file object. Python 2.5 (which is currently in beta) is adding a with statement which is similar to the using statement in C#. Until then, the clean way is like this:

>>> f = open(options.header)  1
>>> header = f.read()    2
>>> f.close()   3
  1. The open command returns a file object, and by default opens a file read only.
  2. The read method returns the entire contents of the file as a string.
  3. Close the file.

Since this functionality is going to be needed in a few places, put it in a function:

def readfile(filename):
    '''readfile(filename):
    returns the contents of the file 'filename' '''  1
    f = open(filename)
    contents = f.read()
    f.close()
    return contents
  1. This is called a documentation string or docstring and is used to document functions, classes or modules. If the first statement is a string literal it is bound to the attribute __doc__ (and func_doc). Strings that span multiple lines can be specified by using three single or double quotes – called triple quoted strings.
>>> reload(checker)
<module 'checker' from 'checker.py'>
>>> print checker.readfile.__doc__
readfile(filename):
    returns the contents of the file 'filename'

Next we need to write the function that will actually check the headers of the source files. Python has many convenient string handling functions, a few of which will be used here.

def process(filename, ensure_header, remove_header, options):
    '''process(filename, ensure_header, remove_header, options):
       filename: a string
       ensure_header: the header to be added if missing
       remove_header: the header to be removed if found
       options: options object from command line parsing
    '''
    contents = original = readfile(filename)
    actions = []
    if remove_header and contents.startswith(remove_header):  1
        actions.append('removed header')
        contents = contents[len(remove_header):]   2
    if ensure_header and not contents.startswith(ensure_header):
        actions.append('added header')
        contents = ensure_header + contents
    if contents != original:
        print filename, ' and '.join(actions) 3
        if not options.test:
            f = open(filename, 'w')  4
            f.write(contents)
            f.close()
  1. The following entities evaluate to False: None, empty string, or empty container (list, tuple, set, dict). Python uses short circuit boolean evaluation, so in this case if remove_header is None the interpreter never tries to evaluate the startswith method.
  2. The len function is the standard way to get the length of different types. Here we are returning the substring of the contents from the end of the header to the end of the string.
  3. String literals in source are treated as string objects. The join method takes something that can be iterated over as a parameter, and creates a string by appending the contents of itself between each item in the parameter.
  4. Open the file in write mode.

Created a simple file with a few lines of code called test.txt. Back to the interpreter to test the process function.

>>> reload(checker)
<module 'checker' from 'checker.py'>
>>> checker.process('test.txt', header, None, options)
adding header to test.txt

Checked the file – no change. Hmm... hang on a sec, was test set to True or False?

>>> options.test
True
>>> options.test = False
>>> checker.process('test.txt', header, None, options)
adding header to test.txt
>>> checker.process('test.txt', header, None, options)
>>> checker.process('test.txt', None, header, options)
removing header from test.txt
>>> checker.process('test.txt', None, header, options)
>>> 

Bringing it together

Now it is behaving as expected. The last few bits that are needed to tie the functions together in a script are:

def main(args):  1
    (options, args) = parse(args)
    if not options.header and not options.remove:
        print 'Nothing to do, neither --ensure-header'\
              ' nor --remove-header set'
        return
    if len(options.ext) < 1:
        print 'Nothing to do, no extensions specified'
        return
    ensure_header = options.header and readfile(options.header) 2
    remove_header = options.remove and readfile(options.remove)
    filenames = files(options.dir, options.ext)
    for filename in filenames:
        process(filename, ensure_header, remove_header, options)
   
if __name__ == '__main__': 3
    import sys  4
    main(sys.argv)
  1. While not absolutely necessary, I like to have main as a specified function so it can be called in the interactive interpreter.
  2. The logical operators and and or do not return boolean values, but return the last expression needed to calculate the return value.
    >>> 'hello' or 42
    'hello'
    >>> 'hello' and 42
    42
    
  3. The module level variable __name__ is set when the script is executed or imported. When the script is imported as a module __name__ is set to the name of the script (in this case 'checker'). When executed from the command line __name__ is set to '__main__'.
  4. Import statements can appear anywhere, and since the only place the sys module is needed is when the script is run as a script, we can load it there. The command line arguments are found in the list member argv.

And there you have it. The next thing to do is to see if it actually works. In order to get some form of meaningful results over the Aeryn codebase, I decided to update the copyright date in the GNU licence header. So copied the header from a file and named it gnu2005.txt, then edited the file to have year 2006 and saved it as gnu2006.txt. Executing the script in test mode with the following parameters gave some surprising results:

./checker.py -d ~/accu/aeryn/src -x .cpp -x .hpp -r gnu2005.txt -e gnu2006.txt -t

Code rarely works the first time

There were several files where it wasn't removing the header, but it was adding one. Looking at these files it became apparent that that the matching algorithm was less than entirely sufficient. The headers matched in all places except one space. Given that this then meant that the header was not matched, and it would have ended up with two headers is not ideal. There must be a better way. One solution is to match against strings that have all the whitespace stripped. However once you have found that it matches, you then need to somehow work out the substring of the file that contains the header so it can be removed.

Stripping whitespace from a string is relatively simple.

''.join(content.split()) 1
  1. The split method creates a list of words from a string broken on whitespace. In this case the string that is the delimiter between each of the words is the empty string, so we end up with a string stripped of all whitespace.

Finding the appropriate position of the non-stripped string for cutting is not much more difficult. In order to deal with stripped strings, there needed to be some modification to the functions process, and main as follows:

def strip_header(contents, header):
    i = 0
    for x in xrange(len(header)):  1
        while contents[i] != header[x]:
            i += 1  2
        i += 1
    return contents[i:]

def process(filename, ensure_header, ensure_stripped,
            remove_stripped, options):
    
    contents = original = readfile(filename)
    contents_stripped = ''.join(contents.split())
    actions = []
    if remove_stripped and \
       contents_stripped.startswith(remove_stripped):
        actions.append('removed header')
        contents = strip_header(contents, remove_stripped)
        contents_stripped = contents_stripped[len(remove_stripped):]
    if ensure_stripped and not contents.startswith(ensure_header):
        if contents_stripped.startswith(ensure_stripped):
            contents = strip_header(contents, ensure_stripped)
            actions.append('updated header')
        else:
            actions.append('added header')
        contents = ensure_header + contents
    if contents != original:
        print filename, ' and '.join(actions)
        if not options.test:
            f = open(filename, 'w')
            f.write(contents)
            f.close()

def main(args):
    (options, args) = parse(args)
    if not options.header and not options.remove:
        print 'Nothing to do, neither --ensure-header'\
              ' nor --remove-header set'
        return
    if len(options.ext) < 1:
        print 'Nothing to do, no extensions specified'
        return
    ensure_header = options.header and readfile(options.header)
    ensure_stripped = ensure_header and ''.join(ensure_header.split())
    remove_header = options.remove and readfile(options.remove)
    remove_stripped = remove_header and ''.join(remove_header.split())
    filenames = files(options.dir, options.ext)
    for filename in filenames:
        process(filename, ensure_header, ensure_stripped,
                remove_stripped, options)
  1. The xrange function generates the values from zero up to but not including the paremter value, so xrange(5) will generate the values 0, 1, 2, 3, 4.
  2. Python does not have either postfix or prefix increment (or decrement) operators. It does however have increment and assign. In fact it pretty much has any operator with assignment supported.

Executing the updated script over the Aeryn source no longer gave any surprises. The full Python script and gnu headers used for these tests can be found on my website.

I sincerely hope that this article has enlightened you to some of the power and simplicity of Python.

Tim Penhey