My First Python Program - Part 2

This is a continuation of yesterday's post.

Earlier I said that I would show some of the code that went into the system along with explanations of how I solved some problems. To start things off, I will explain how data is filtered, possibly the most import functionality in the system. Lets dive in.

Filtering Data

def filter(cls, **filter_fns):
    """Filter data from the storage.

    x.filer() -> Definition.RecordSet
    x.filter(col=lambda v: v is 6) <==> x.filter(col=6)

    Filter starts by getting all available primary keys in the data set using
    D.__getPrimaryKeys(). If there are any filters in **filter_fns, it will go
    through those, make them into lambdas and apply them to each row in the
    data set corresponding with a valid primary key."""

    pks = cls.__getPrimaryKeys()

    if len(filter_fns) > 0:
        # make sure we're not trying to filter for fields that aren't in this
        # data definition
        for field in filter_fns:
            if field not in cls.fields:
                raise NameError, "Non existant field [%s] cannot be filtered." % field    

            # if supplied a value and not a lambda then create a filter lambda
            if type(filter_fns[field]) is not types.LambdaType:
                v = filter_fns[field]
                filter_fns[field] = (lambda val: val == v)

        # function that will do the filtering
        def filter_row(row_id):
            """Filter the list items in a data set row against the filter_fns.

            This function takes a list of lambda functions, pops one off of
            the stack, tests it, if true, it continues, else it breaks out of
            the loop."""

            for field in filter_fns:
                index = cls.fields[field].index
                if not filter_fns[field]([row_id][index]):
                    return False
            return True

        # filter, filter_fields needs to be copied for each row found because
        # the columns to filter are popped off of it. It lets the function always
        # use the same lambda filters
        pks = [i for i in pks if filter_row(i)]

    return c

The code starts off by getting a list of the primary keys for every row with the exception of deleted rows (pks = cls.__getPrimaryKeys()). This ends up being a list of primary keys (integers) which will eventually be filtered through or passed straight to the Definition.RecordSet class. The functions to filter each row are passed in using the cool ** notation, which gives you the named arguments passed to a function. This lets one use the function as: Definition().filter(col=1) to get all rows in the Definition data set whose column 'col' is equal to one. The function will return all filtered primary keys by using another list comprehension on the primary keys of all non-deleted rows. The final list of primary keys is then passed to the Definition.RecordSet class.

Iterating over Record Sets

def __iter__(self):
    """Generator that reads the data into a dict from storage.

    for row in RS(): ..."""
    self.allow_filter = False
    for i in range(self.__len__()):

def __getslice__(self, i=None, j=None):
    """Limit (slice) the data set.

    !!! This cannot be called after any data has been derived from the
    recordset using or by looping over the record set.

    for row in RS()[:1]: This limits the record set to the first row."""
    if not self.allow_filter:
        raise RecordSetModifiedError, "Can't slice record set after data has been read."

    self.pks = self.pks[i:j]
    return self

def __getitem__(self, n):
    """Return the nth record from the record set.

    RS.__getitem__(0) <==> RS[0] -> Definition.Record"""

    if type(n) is not types.IntType:    
        raise TypeError, "Definition.RecordSet.__getitem__ expects an integer 
                            , %s given" % str(type(n))
    elif n not in range(self.__len__()):
        raise RecordDoesNotExistError, "Row [%r] does not exist in this record set." % n

The RecordSet class mimics the list class (but doesn't extend it) by supporting iteration, slicing, and getting individual rows using numeric indices. I mentioned before that the record set uses Python's generators and you can see that in the __iter__ function with the yield keyword. That means that __iter__ is essentially a restartable function. Each time it is called, it will start from where it left off. In this case, each time it is called, it will advance the for loop that it is in. This is an important optimization because it means that not ever row is read (and thus translated) and makes it so that if the programmer wants to limit or slice the result set (equivalent of SQL's LIMIT start, offset) then that operation is applied to the primary keys.


def sort(self, **filters):
    """Sort the rows given filters."""
    if not self.allow_filter:
        raise RecordSetModifiedError, "Can't sort record set after data has been read."              
    # validate the sorting filters
    for field in filters:
        if field not in self.definition.fields:
            raise NameError, "Cannot sort recordset by non-existant field [%s]." % field
        if type(filters[field]) is types.IntType:
            d = filters[field] >= 0 and 1 or -1 # asc or desc
            filters[field] = lambda a,b: d*cmp(a,b)
        if type(filters[field]) is not types.LambdaType:
            raise Exception, "Expected lambda or integer in Definition.RecordSet.sort() 
                                [%s] given." % str(type(filters[field]))
    def sort_row(field, cmp_filter, a, b):
        index = self.definition.fields[field].index
        return cmp_filter([a][index],[b][index])
    # if the filters are okay, do the sort. The sort needs to happen for each
    # filter and not for each row, eg: we can't look up a row and then apply all
    # sort filters to it because that would be unpredictable
    for field in filters:
        self.pks.sort(lambda a,b: sort_row(field, filters[field], a, b))
    return self

SQL also has an ORDER BY clause which is tremendously useful and this function attempts to mimic its functionality—if not surpass it. This function is very similar to Definition.filter() in that it can take either a value or a lambda. One thing to note is that Definition.RecordSet.sort() expects either integers or lambdas as the values to the named arguments. If one wanted to sort a column in ascending or descending order, they would simply pass a 1 or a -1. However, if more a more complicated sorting scheme is required, a lambda can just as easily be passed.

Another difference between filter and sort is the way that the filter functions are applied. With filter, the functions are applied to the appropriate datum in the row, and if any one of the filters returns False then that row is excluded. With sort, the filters are applied one by one meaning that the rows will progressively by sorted further and further until all sorting functions have been applied. This is done because of the nature of list.sort(). It expects a function with a return value of -1, 0, or 1. This means that if we were to apply more than one sorting filter to a single row and each filter returned a different value then there would be no way to accurately determine if one row should go before or after another.

Instance Persistance

Hehe, I just had to put that as a header! What I mean is that all instances of the same definition classes should share the same data and a few other configuration variables. In order to get this to work with instance methods (as opposed to class methods) I used the following hack..

def __getattr__(self, attr):
    """Get an attribute.

    D.test <==> D.__getattr__('test')
    !!! If attr is in the D.global_attrs tuple then:
    D.moo <==> D.__class__.moo"""      
    if attr in self.class_attrs:
        return getattr(self.__class__, attr)
        return object.__getattr__(self, attr)

def __setattr__(self, attr, val):
    """Set an attribute.
    D.moo <==> D.__setattr__('moo', val)"""

    if attr in self.class_attrs:
        setattr(self.__class__, attr, val)
        object.__setattr__(self, attr, val)

If there is anything I don't like about the program I have made then these two functions are at the top of the list. By the way, "class_attrs" is a tuple with a list of variable names that should be common between all instances of a particular class.

This isn't usually an issue when using the @classmethod decorator; however, there is value in using instance methods (such as __init__, __getitem__, and __setitem__) and to make those work as expected with the class methods, this hack was necessary.

If you're coming from PHP and don't quite understand what this is doing, think of it this way: we have a top class with several static variables and a few classes that extend that top class. Using this, all instances of a particular class (be it the top class or a child class) will share the values of those static variables, but no two instances of different classes will share those values. If you can visualize this then you should immediately realize the usefulness of this.


I love Python. Yes, seriously. Otherwise, I think there was a lot of value in doing this exercise. I use a relational database every day and normally don't think about how it stores or retrieves rows. In this sense, I haven't actually learnt anything about a lot of the insanely cool algorithms rdbms apply to do what they do but it has made me think differently about how I handle data. When I use—for example: MySQL—through the functions PHP supplies me I am normally constrained to working with all rows at once. In this exercise, I constrained myself to a similar view of data as with databases: tables, columns, relations. Beyond that it was up to me to implement how everything worked under the hood and I found interesting solutions (such as working with primary keys only) that would not had occurred to me otherwise.


There are no comments, but you look like you have something on your mind.