Stackless Operator Precedence

I got operator precedence for my parser working properly yesterday and am well into the interpreter (it can interpret expressions!). I thought that the way I solved the problem of generating a tree from some infix expressions was both clever and hackish. The normal solution involves creating a stack, pushing a frame in, parsing any operands/operators we find and putting them into a parse tree on that stack frame. If a '(' is found then a new frame is pushed onto the stack and all operations within the sub-expression are put on the parse tree of the newer stack frame. When a ')' is found, the topmost frame is popped off the stack and its parse tree is merged into the new top frame's parse tree. Interestingly, I managed to achieve the same effect without a stack at all.

The key to this is the tokenizer. My tokenizer is iterative, so I only parse the input text for a token when I ask for a token. For a simple calculator, a tokenizer would not be necessary; however, it was for the language. I have an enumeration of all the possible tokens for my language. I conveniently set it up so that the first token is an unknown token (commonly used as the right link in a cons cell), and then the first twenty tokens are all infix operators. I have a function to register operators along with their left and right precedences, and so they are all included in a convenient operators array. So far it seems as though the tokens are irrelevant. I will explain.

My parse tree is a binary tree laid out using cons cells. Eg: 1 + 2 * 3 becomes:

       (+)
       / \
     (1)  ()
         /  \
        (*)
       /   \
     (2)   ()
          /  \
        (3)

Each empty () is a node with a token of SW_TOKEN_UNKNOWN (token: 0), both + and * are nodes on the tree with token symbols whose values are less than twenty. So, this just shows how I have the parse trees made. Each node knows its parent node, and each node has a left and right child (one or both of which might be null), and whenever a node is added as a child to another, a new leaf is created and the parser is aware of this leaf[1]. Given this information, we can start onto a stackless model.

Imagine that the parser has just accepted the infix operator *. A useful assumption to make is that the parser works properly and that any prior infix operators--in this case adding 1 + 2--have been parsed into the tree correctly. We find the * operator, and given the handy operators table that I mentioned above, we know its left and right binding powers. Since we know what leaf we're on, we can move upward in the tree looking for nodes whose token symbols are <= 20 (unknown nodes and infix operator nodes). If we find a node that has a token symbol greater than twenty, we stop. If we find a node whose right binding power is greater[2] than the left binding power of the infix node that we have parser (the *) then we will record it, overwriting any previously recorded nodes of this sort. If we find a node whose right binding power is less than or equal to the left binding power of the parsed infix operator then we break out of the loop and hope that we've recorded an infix operator in a previous iteration of the loop.

Okay, so that seems like an odd process. Here is another way of looking at this. We have found an infix operator. If no prior infix operators have been parsed then our job is easy: it becomes the root to whatever was on the left of it and whatever is to the right of it. If we have found prior operators then our job is tougher. We need to find exactly which previous operator to bind this with. Consider this: 2^3^4+5. We need to turn this into (2^(3^4))+5. Since the operators effectively become the roots to the expressions on either side of them, then we can say that the + needs to bind with the first ^. An exponent operator has a higher binding power than an addition operator, and it also binds to the expressions to the right of it more strongly then it binds to the expressions to the left of it--hence why we get 2^(3^4) and not (2^3)^4.

When the + is found, we move up in the tree. We find the second ^, and realize that it has a higher binding power than the + operator. This is good because the + might bind to it. We record the second ^ and keep going up the tree. We find the first ^, it has a higher binding power, so we record it. We try to move up the tree--however there is no parent node so the loop is stopped. The node to bind with was the last one recorded--the first ^. Because the ^ has a higher binding power than the +, it means every exponent operation needs to happen before any addition operation. So, we make the tree of 2^(3^4) the child of our + operator. The case is reversed when an operator is found with the same or a lower binding power. The + operator and its left/right expressions become the children of that operator.

Now, on to the key to stackless sub-expressions. This whole scheme works because infix operators' token symbols are represented by a known range of numbers (1-20, including 0 as empty nodes). The above algorithm works by looking for branches with token symbols in that range. If it stumbles upon a branch with any symbol outside of that range then it will simply stop. So, to deal with sub-expressions, all that is needed is to attach another node to the last known leaf of the tree! This way no stack is needed and no merging back onto the tree is needed because the sub-expression is already part of the tree! It works because the sub-expression node, with a token symbol out of the infix operator's range, represents a road block to any infix operations going on within it (acting as the root of the sub-expression).


1. At first, the choice to use cons cells was random. It has, however, paid off significantly. Since every node is a cons cell, every branch has the same format, and all sub-nodes of the branch also follow this format. When a node is added, a cons cell is made. That means that a blank node added as the right child to the parent node. The node we want to add is set as the left child of the cons cell and the right side of the cons cell is set to null.

2. This may seem backward depending on how you have implemented operator precedence in the past. My precedence--or rather: priority-- goes from weak (1) to strong (10).

Christmas Parse Tree

First of all, I hope everyone is having a great holiday. I've been busy coding and have also been skiing/relaxing. My last post outlined my ideas for a language I wanted to play around with. I've been making progress on making it into a workable language! To show you my progress, and to celebrate the holidays, I thought I would put up a christmas parse tree for the following code block.

[1 to 10]()

p(n) => 11

identifier(arg1, arg2, ...) => do:

	z = {a: '10', b: '20', c: '30'}
	y = [1, 2, "3"]
	while:
		return()
    ;
;
a = -b
test = 1 + 2 + 3 + !x[1:5]
moo => 1 * (7/8)
  • (998)
    • (59)
      • 1
      • (0)
        • (18)
        • (0)
          • 10
          • (0)
    • (0)
      • (100)
        • (108)
          • (109)
          • (0)
        • (0)
      • (0)
        • p
          • (108)
            • n
            • (0)
          • (0)
            • (101)
              • 11
              • (0)
            • (0)
        • (0)
          • identifier
            • (108)
              • arg1
              • (0)
                • arg2
                • (0)
            • (0)
              • (101)
                • do
                  • (108)
                  • (0)
                    • (101)
                      • (24)
                        • z
                        • (107)
                          • (58)
                            • (0)
                              • a
                                • (0)
                                  • 10
                                  • (0)
                              • (0)
                                • b
                                  • (0)
                                    • 20
                                    • (0)
                                • (0)
                                  • c
                                    • (0)
                                      • 30
                                      • (0)
                                  • (0)
                          • (0)
                      • (0)
                        • (24)
                          • y
                          • (107)
                            • (59)
                              • 1
                              • (0)
                                • 2
                                • (0)
                                  • 3
                                  • (0)
                            • (0)
                        • (0)
                          • while
                            • (108)
                            • (0)
                              • (101)
                                • return
                                  • (108)
                                    • (109)
                                    • (0)
                                  • (0)
                                • (0)
                              • (0)
                          • (0)
                    • (0)
                • (0)
              • (0)
          • (0)
            • (24)
              • a
              • (107)
                • (41)
                  • b
                  • (0)
                • (0)
            • (0)
              • (24)
                • test
                • (107)
                  • 1
                  • (0)
                    • 2
                    • (0)
                      • 3
                      • (0)
                        • (26)
                          • x
                          • (0)
                            • (106)
                              • 1
                              • (0)
                                • 5
                                • (0)
                            • (0)
                        • (0)
              • (0)
                • (23)
                  • moo
                  • (0)
                    • (102)
                      • 1
                      • (0)
                        • 7
                        • (0)
                          • 8
                          • (0)
                    • (0)
                • (999)

Happy Holidays!

Almost finished exams, and...

I'm almost finished exams (for this semester, one to go) and I have been mulling over the idea of making my own programming language. I've always liked this idea; however, I have never been able to do it, nor did I know any languages that would have made the task of interpreting it fast.

I took one computer science course this semester; I got a decent overview and some useful experience working with C. Believe it or not, I actually like C! Unfortunately, I do everything by the command line and don't have any cool debugger for it so when bus errors or segmentation faults happen they are a pain in the ass to find. Oh well. On to what I really want to talk about: the idea behind the language I currently want to make.

First, it is not object-oriented. It is a lot like javascript in terms of javascript's functional nature, but then it still has a bunch of procedural language constructs built in. Second, you can make object-like entities in it, and I will show how. Third, the core concept in the language is procedures; encapsulated code that is run every time it is called on. A function is simply a procedure with arguments. Here is a steady build up of the syntax. You will notice little things that I've taken from the languages I love.

a = 10 // simple variable definition
a => 10 // simple procedure definition

add_two(x) => x+2 // simple function definition

// more complex function
factorial(x) => do:
    if x > 1:
        return x * factorial(x-1)
    ;
    return 1
;

Okay, so what is the point of the "=>" symbol? That symbol signifies a implicitly returning procedure. Any code in a procedure is executed every time it is called on. If we had a variable that was a procedure to a function call then every time you use the variable that function would be called. Contrast this with having a variable equal a function call; every time you use that variable you will be using the result of the original function call. This distinction is amazingly important and lets one transparently do some really cool things.

Knowing that the procedure is an implicit return, the factorial function above makes sense. The implicit return is a bit disingenuous, it might be better put that a procedure can only have one expression in it. An expression could be doing some math, which has a return value (and so it implicitly returns) or it could be a language construct such as "do" which allows us to encapsulate further expressions. Here are four ways to write the same function:

// Syntax for quickly typed simple functions and more involved functions by
// returning a language construct 
add_two(x) => x+2  // simple named function, implicit return


// Named function that uses the "do" language construct to allow for more 
// expressions/statements.
add_two(x) => do:
    x = x+2
    return x
;


// example of lambda (anonymous/unnamed) functions
add_two = (x) => x+2


// example of a lambda within a procedure
add_two => (x) => x+2

Okay, so I am making slow progress showing why the procedure approach is so cool. The last example of the add_two function above hints at it though. When add_two is a procedure to a function then every time add_two is used it will return a function. When using functions only by their name, we are using the function directly. So, in one hand we have something that returns a function, and in the other, something that is a function. How could this be useful? Well, lets say that we wanted to make an object system. My current plans for this language don't plan for that (even though I recognize its benefits) so the most we can do is approximate one. Associative arrays (or dictionaries in Python) will be how we approach this. We know that every time we instantiate a class into an object it needs to have its own state. I will start with a simple Object function.

/* Object from dictionary. Do not confuse this with real objects in OO languages! */
Object(d:dict={}) => do:
    this = d
    
    if !("construct" in this):
        this["construct"] = () => null
    ;
    
    return (...) => do:
        apply(this["construct"], args, kargs)
        return this
    ;
;

What can be said about the above function? It accepts a dictionary, if that dictionary doesn't have a "construct" key in it then it will create one pointing to an anonymous function. Finally, a function will be returned that accepts any number of arguments, and when that function is called those arguments will be applied to the "construct" function. The idea of "this" is faked by making the "this" variable equal to the dictionary that we are passing off as an object. Here is how we would use this function...

SomeClass = Object({
     /* ... */
});

Unfortunately, there is a very subtle problem with SubClass and it is because of the nature of the scoping rules in Object. Every time SomeClass is "instantiated" it will return the same dictionary! More meaningfully, every instance of SomeClass will act on the same dictionary, meaning SomeClass is a singleton of sorts. This is clearly a big problem: instances of classes should have their own state but also have the same state structure as other instances of the same class. Here is where the power of procedures comes in.

Class(d) => (...) => apply(Object(d), args, kargs)
Singleton = Object

// How to use:

Stateful = Class({
    /* ... */
})

Registry = Singleton({
    /* ... */
})

// note: assuming the syntax allows it, the following should also work:
Class(d) => => Object(d)

There you have it. Using the Object function we now have two simple object types: classes and singletons! The singleton works because all uses of Singleton act on the same result of calling the Object function. The class works because it curries the Object function. Every time Class is called it acts as if Object was called directly: it accepts the same arguments and returns a function that accepts any number of arguments (the constructor). The key is that Object is called only when this mock constructor is called. At that point, a singleton class is built, instantiated, and returned all at once. Pretty amazing!

There are a few other things that I would like to say about the things I want to have working in this language, but that's just it--this language is still only an idea. This is a project that I am extremely interested in pursuing and hopefully I will be able to get something along the above lines working eventually.

Busy Getting Ready For the Frosh

So, I have been training for the last week and a half as a residence advisor at the university that I go to. I have been really busy and thus have not posted. Unfortunately I will be really busy for the next two weeks so don't expect anything from me in terms of blog posts.

12 Things You Should Dislike About PHP

In my last post I outlined my 15 favorite features of PHP that many people tend not to know about. To contrast my previous post I will detail several things I dislike most in PHP.

Note: the intention is not to start a flame war between PHP and any other language and I expect that the comments will reflect this.

1. Naming and Return Value Conventions
Naming Conventions

The first part of this--naming conventions--is obvious to anyone who has used PHP. PHP lacks a specific naming convention. This can easily be demonstrated with many of the string functions. For example, why is the function to find the first position of one string in another called strpos() when the function to repeat a string is called str_repeat()?

Another clear example of this is with PHP's built in classes. Why is the SQLite class called "SQLiteDatabase" whereas the MySQLi class is called "mysqli"? Anothing minor example with PHP's magic functions: __toString() and __set_state().

This might seem like excessive nitpicking; I think it is a serious issue that affects expectations. When programming, productivity is increased when ones expectations about a function or class name are fulfilled given the prior experience of using similar functions or classes.

Return Value Conventions

This was something that was annoying me the other day: parse_str(). The parse string function takes in a URL encoded string and then extracts the variables within into the current scope. As an option, it can take a reference to an array and put the variables into that by indexing them associatively.

Why would this function ever extract these variables? In fact why does it even take an array by reference? I knew of this function but hadn't read the documentation because I assumed it worked in a similar way to parse_url(). It doesn't. It doesn't even return an array.

This might seem like more of a rant than anything, but then lets get back to expectations. Consider that GET variables can have periods, spaces, hyphens, and more in them. PHP doesn't support any of these characters in their variable names so how is the function acting like extract() useful for these cases? Wouldn't it be more intuitive if it acted more like the similarly named parse_url()?

2. Functions

PHP's functions are not objects. More to the point, PHP has no first class functions. Functions cannot be passed by reference and they cannot be self-calling (this is different than a recursive function, which calls itself from within, whereas self-calling is from the outside).

Functions also don't seem to obey their scope in a predictable way. For example, functions can be nested indefinitely in PHP but that doesn't mean that they act like sub functions! Instead it means that any sub functions become accessible in the global scope once the parent function is called. This does have a practical application where different implementations of functions is needed for different PHP versions; however, that is what the if() statement is for. Simply put: nesting functions does not encapsulate them.

PHP has make-believe lambdas using create_function(). Once created, these functions are available in the global namespace and cannot be self-calling. For example: create_function(...)(); does not work because create_function returns the string name of the function it just created.

Consider the implication of returning a string function name from create_function() when it comes to meta-programming in PHP:

class TestLambda {
	public $test;
	
	public function __construct() {
		$this->test = create_function('$str', 'echo $str;');
	}
}
$l = new TestLambda;

// if functions were objects this would work; unfortunately that is not the case.
$l->test("hello world");

// instead this ugliness needs to be done:
call_user_func_array(array($l, $l->test), array("hello world"));

3. Method chaining doesn't work as expected

In my previous post I championed the use of method chaining--not necessarily with the best example--but showed how to do it nonetheless. Unfortunately there is a critical element of method chaining missing from PHP. It doesn't affect the methods that return $this but the constructor and class instantiation itself. The following code is from my previous post...

class Foo {
	public function bar() {
		return $this;
	}
	public function baz() {
		return $this;
	}
}
As is obvious, bar() and baz() can be chained indefinitely. PHP can't do the following even though a class constructor implicitly returns the instance of an object.
(new Foo)->bar()->baz()->...

4. Magic function __toString() doesn't work as expected

Note: This is a solved bug on php 5.2; however, I am stuck using php 5.1 on MAMP and this is an annoyance for me.

The __toString() method provides a nifty alternative to the cryptic "Object id #1". One would expect that coercing an instance of an object into a string would thus call __toString()--it doesn't.

class Foo {
	public function __toString() {
		return "instance of class Foo";
	}
}

$foo = new Foo;

echo $foo; // output: instance of class Foo

$str = (string)$foo;
echo $str; // output: Object id #1 WTF!?

5. Static Methods and Scope Resolution

This was a problem that bit the Zend team in the ass early on in their development of the ORM in the Zend Framework. The problem happens when there is a static public method in a parent class that is called by an extending class. One would expect that the function would know it is being called from within the child class (as is the case with instance methods); however, that is not the case. Consider the following example:

class StaticTest {
	static public function moo() {
		echo __CLASS__; // also redundant given the get_class() function, but whatever.
		echo get_class();
	}
}

class ChildStaticTest extends StaticTest {
	// ...
}

ChildStaticTest::moo();

// output: StaticTest StaticTest < -- wrong class names!

As a bug in PHP this has many annoying implications. It means that--for example--in the Zend Framework they could not do ModelName::find() for quick access to model functions without a monstrous hack involving debug_backtrace(). They ended up changing their finder syntax.

6. __autoload()

I mentioned this function as being a useful in my previous post. I also said not to use it and instead use spl_autoload_register(). To reiterate the point: __autoload() belongs to the global namespace. This means if two or more scripts share files and define an __autoload() function of their own then there will be a huge problem. This problem also sheds light on the fact that PHP does not support redefining; however, that is a limitation and not a problem of PHP.

7. Magic Quotes and Register Globals

To be honest I don't even want to talk about these. It's been rehashed so many times that I think most PHPers generally get the point.

8. Cleaning up after Exceptions

PHP5 introduces Exceptions. This is one of my favorite improvements present in PHP5. Unfortunately there is one gotcha that I hacked a solution for in a previous post. Exceptions can be thrown and caught. As with all classes in PHP5, destructors are generally called when an object leaves scope. Unfortunately, an uncaught exception will never be destroyed! Why is this an issue? PHP has no finally clause for exceptions! The following is not possible in PHP:

try {
	throw new Exception;
} catch(Exception $e) {
	echo 'caught';
} finally { // whoops, parse error here, missing feature
	echo 'clean up';
}

That means any dependable clean up / tear down action must happen in the exception destructor, which won't be called unless the exception is caught! (Note: a hack around this is to explicitly call the __destruct() method of an exception from within its __toString() method.)

Note: if you didn't catch it above, PHP DOES NOT have a finally statement. This point is to show that because of this, destructors are the only way to clean up after an exception and even they are not 100% dependable.

9. Return values in write context

This is a huge WTF. There might be some logic behind it but for everyday programming it can be incredibly annoying. The following code will cause a fatal error:

function return_empty_string() {
	return '';
}

empty(return_empty_string()); // PHP Fatal error:  Can't use function return value in write context

10. Safe Mode

No. Just No. Safe mode is not safe. Please read the talks by Ilia Alshanetsky. Every hosting company should stop using PHP in safe mode and simply use open_basedir.

11. func_get_args()

This one has always been confusing to me and I will show why with some example code. PHP's func_get_args() cannot be passed as an argument to a function because of scoping issues. Consider that the following in PHP is possible:

strtoupper($string = "hello world");
echo $string; // output: hello world

Notice that $string is available in this scope and hasn't been affected by strtoupper() in any way. Unfortunately the following is not possible:

function test() {
	return implode("", func_get_args()); // func_get_args can't be passed as an argument
}
test("hello", "world");

Does this restriction make any sense?

12. Return values again

PHP acts inconsistently with return values of functions. If a function were to return a string then any string functions could easily be applied to it. If a function returns an object then any of the objects instance methods can be called on it. However, if a function returns an array, then array item syntax (using square brackets) cannot be applied directly to it!

function test_with_array() {
	return array('a', 'b', 'c');
}

function test_with_array_object() {
	return new ArrayObject(array('a', 'b', 'c'));
}

// both fail with an unexpected parse error
echo test_with_array()[0];
echo test_with_array_object()[0];

// works
echo test_with_array_object()->offsetGet(0); // output: a

Although the tone of this post was negative it must be taken with a grain of salt. As a language PHP is constantly improving and I'm eagerly looking forward to PHP6. In its current state I think PHP is an amazing language--despite the above flaws. One might say its biggest flaw is that its too easy to learn. That argument puts the blame in the wrong hands. Many PHP programmers are unaware of security issues and best practices and as a result the language as a whole is tarnished.

This post was meant to contrast my previous one. Looking at what I think are PHP's unknown heroes and what I see as its flaws is not an accurate way of weighting strengths against weaknesses. There is much more to PHP (such as its libraries, community, etc) than what's on these two lists and I encourage people to discover those things for themselves.

15 Cool Things About PHP That Most People Overlook

Note: I have followed this post up with 12 Things You Should Dislike About PHP.

I saw this article linked to from Digg and was disappointed by the list. Here's what I think are some of the more interesting features of PHP.

1. Reflection API

PHP5's Reflection API can introspect PHP functions and classes and even allow them to be rebuilt (see this demo)! For example, PHP provides a call_user_func[_array]() function to call functions/methods given a string representation of their name; however, PHP doesn't provide such a function for classes.

class TestCallFunc {
	public function __construct($a, $b) {
		echo $a .' '. $b;
	}
}

//call_user_func_array(array('TestCallFunc', '__construct'), array('hello', 'sir')); // will not work
$class = new ReflectionClass('TestCallFunc');
$class->newInstanceArgs(array('hello', 'sir')); // works

// the above can come in handy for certain cases where you want to unpack an array of values,
// without knowing how many elements are in that array to a classes constructor.

// If one knows the number of arguments of a constructor, or the constructor does not take a
// variable number of arguments, then this will suffice:
$class = 'TestCallFunc';
new $class('hello', 'sir');

Unfortunately the above example hardly does the Reflection API justice. Consider the following use case. Since the Reflection API can figure out almost everything about PHP code it means that it could be used as a strategy for dependency injection. By using smart type hinting within function definitions (see below) and the reflection classes, function dependencies could be figured out on the fly and be satisfied automatically.

2. Ticks

PHP has an unusual statement called declare. At the moment the statement only supports ticks (functions that are executed after every nth process/statement. The implication of tick handlers for timing and testing applications is huge. Imagine being able to chart how long specific processes take as they happen without the burden of having to manually place timing functions within the code! Another possibility would be to pass state to a tick function for analysis. In some cases this would be immensely useful for tracking down bugs or for simply seeing how data is changed throughout the different processes within a program.

3. list(), extract(), and compact()

list() is a function that PHPers learn early and forget about almost as quickly. list() allows for easy unpacking of numeric arrays, eg: array('a', 'b', 'c', ...). List can be used cunningly for simple tasks such as switching the values of two variables:

$a = 'a';
$b = 'b';

list($a, $b) = array($b, $a);

echo $a; // output: b
echo $b; // output: a

List also has a practical application for any function that returns an array. Instead of having to deal with the results as $result[0], $result[1], etc one can use the list() construct to set the values of the array to variables without the hassle of having to do each variable setting manually. Essentially, list() is PHP's way of doing multiple variable assignment.

Unpacking associative arrays is also easily accomplished with extract(). Consider the following simple example:

extract(array('a' => 'a', 'b' => 'b'));
echo $a; // a
echo $b; // b

extract() comes in handy when transferring scope variables around from one place to another is needed. By packing scope variables into an associative array using get_defined_vars() or compact() and then unpacking them somewhere else using extract(), scope variables can be transparently transported. This experiment of mine examples this use of extract().

Another use for extract() is for template engines. Many simple template engines allow their variables to be set in one place using an associative array and accessed as normal variables within the templates. This works simply by extracting that associative array from within the templates.

Special note should be given to compact(), which was mentioned in the comments by Tarique Sani and does the opposite of extract(). It acts in a similar way to get_defined_vars(); however, it gives the programmer more control by allowing them to specify which local variables to pack into an associative array.

4. PHP5 SPL

The PHP5 Standard PHP Library is full of gems. Two of my favorites include the Iterator and IteratorAggregate interfaces.

PHP5's Iterators work transparently with the foreach() construct as if they were arrays. Iterators come in handy when--for example--formatting needs to be applied to a set of data. Be it a MySQL result set or a multidimensional array, Iterators allow for the accessing, formatting it, and return of data all at the same time! The implication is that data is only fetched when it is needed and that any intermediate steps between retrieval and return of data happen transparently.

Sometimes making a class implement Iterator to allow a class to be iterated is bulky. This problem is solved by implementing the IteratorAggregate interface. It works is by defining a getIterator() function within a class that returns an iterator. If you're a Python user, getIterator() is tantamount __iter__().

Two other nifty interfaces in the SPL are ArrayAccess and Countable. Consider the following code:

class Foo implements ArrayAccess, Countable {
	private $vars = array();
	
	// ArrayAccess
	public function offsetGet($key) {
		$ret = NULL;
		if($this->offsetExists($key))
			$ret = $this->vars[$key];
		return $ret;
	}
	public function offsetSet($key, $val) {
		$this->vars[$key] = $val;
	}
	public function offsetUnset($key) {
		unset($this->vars[$key]);
	}
	public function offsetExists($key) {
		return isset($this->vars[$key]);
	}
	
	// Countable
	public function count() {
		return count($this->vars);
	}
}

$foo = new Foo;
$foo['bar'] = 'baz'; // calls Foo::offsetSet()
echo $foo['bar']; // output: baz, calls Foo::offsetGet()

echo count($foo); // output: 1, calls Foo::count()

As is clear, the Foo class acts in many respects like an array. This functionality is desirable when the convenience of array-like access to a class and the power of class methods are used to enhance productivity and functionality. (As a side note, check out ArrayObject.)

When do any of the SPL classes come in handy? Consider the following object-oriented interface to a MySQL database: iterators are used to fetch the results only when they are needed; Countable allows calling of the native count() function on that query result instead of custom and unintuitive numRows() function; and ArrayAccess is used to represent each row from the database where row objects act like arrays for convenience but can implement more advanced business logic through extension.

The SPL classes allow programmers to interact with their classes more natively and intuitively than ever before. Implementing the different interfaces and classes from the SPL into code allows for transparent interaction with different data types in the same way.

5. __autoload()

__autload is called transparently by PHP when a class that doesn't exist is instantiated. Autoload will take in a class name and then go and try and find the file that the class is in, include it, and let the program continue running its course.

Although this is a handy function I suggest that it not be implemented directly. The reason why I say don't use __autoload() is because it has no namespace. Simply put, if two (or more) applications are running at the same time and share common files that both happen to define the __autload() function then there will be errors. Instead, please check out spl_autoload() and spl_autoload_register().

6. Type Hinting

Type hinting is used to make sure that the information being passed into specific functions complies with an expected data type. Unfortunately type hinting only works for objects and arrays, but it's still extremely nifty! Note: as the following example shows, PHP will error when an unexpected data format is passed to the function.

Here is an example of type hinting in action for objects and for arrays:

function test(array $a, stdclass $b) {
	echo 'yay';
}

test(array(), new stdclass); // works
test('a', 'b'); // results in an error

7. Abstract Classes and Iterfaces

When classes should follow specific contracts interfaces and abstract classes should be used. Interfaces define the required functions and function arguments for all implementing classes. Abstract classes are similar in that respect; however, they also allow for functions to be regularly defined (with bodies).

Note: interfaces nor abstract classes can be instantiated directly. Interfaces need to be implemented (PHP supports multiple interfaces for classes) and abstract classes must be extended.

Obvious use cases for interfaces and abstract classes are for abstraction layers where its useful to define a single interface in one place and let extending classes implement the data-specific logic on their own. Simple examples to this would the database abstraction layers, such as ADOdb.

8. "static" keyword

This is more for php4 people and is really a hack. The most common use of the static keyword from within a function is to implement singleton factories for classes. With improvements in PHPs object-oriented model in PHP5, the static keyword is not as useful because classes allow for static methods and proper function visibilities.

function increment() {
	static $count;
	if($count === NULL) {
		$count = 0;
	}
	return ++$count;
}
echo increment(); // 1
echo increment(); // 2
echo increment(); // 3
// ...

9. === and !==

These are extremely useful and most PHPers can't tell the difference between == and ===. Essentially, == looks for equality, and by that PHP will generally try to coerce data into similar formats, eg: 1 == '1' (true), whereas === looks for identity: 1 === '1' (false). The usefulness of these operators should be immediately recognized for common functions such as strpos(). Since zero in PHP is analogous to FALSE it means that without this operator there would be no way to tell from the result of strpos() if something is at the beginning of a string or if strpos() failed to find anything. Obviously this has many applications elsewhere where returning zero is not equivalent to FALSE.

10. Variable Assignment from within Conditional Statements

This is another fun trick that can also be a pain in the ass. Mistakingly doing if($result = '1') instead of if($result == '1') often happens and is a difficult bug to trace. Interestingly, this annoyance can also come in handy! Try this: check if a strpos() failed or not and get the proper result of it all from within an if() statement. If you don't know how it would work then take a look at the following code:

if(FALSE !== ($pos = strpos('aaaa', 'a'))) {
	echo $pos; // echo's 0
}

11. PHP's Magic Functions

They definitely don't compare to what Python offers but they can be pretty nifty. Couple these functions with the SPL and some pretty cool classes can be made.

__call()

The __call() magic function allows for a class to call functions that it doesn't necessarily have. For example, if there are a lot of redundant functions in a class that work in very similar ways but still require a slight amount of custom logic then maybe __call() would allow for a better implementation. Consider an associative array where the keys represent a white list of functions that __call should work with and their values would somehow allow __call to implement the slight differences between the functions.

If that example seems unlikely then let me detail two ways that I've used __call.

The first way was implementing filters for different MySQL data types. Data for VARCHAR, INT, CHAR, TEXT, et al. can all be handled in very similar ways but require slightly different type casts and regular expressions checks. Well, storing the unique checks in an array with function names for each data type dramatically reduced the amount of redundant code I had to write and made the system more maintainable for when I needed to add more data types to the list.

The second way that I implemented __call() was for credit card processing APIs. I recently made a library to interact with the Authorize.net API and it was expected of me that in the event that the processor was changed that the only thing that would need to be changed is the call to instantiate the appropriate class. If anyone has worked with payment processors before then they will know that the data they accept can be crytpic (eg: x_login, x_customer_cc). To make things simple, I used call to map an associative array of keys (function names) to the processor-specific variables that they modified. Instead of setting an x_login field, I could do: $AIM->login().

__get(), __set(), __isset(), __unset()

These act similarly to ArrayAccesses offsetGet, offsetSet, etc etc, except that these functions allow you to access non-existent instance variables.

__toString()

What happens when you try to echo a class instance? Normally the output will be the unhelpful Object id #1 (or another number). This can be changed though! Changing what is returned from an echo is as simple as defining a custom __toString() method and giving it a custom return value.

12. __halt_compiler(): Halt the Compiler!

Ever wanted to make the installer for a program and the entire program to be installed fit into one file? This can actually be done by using __halt_compiler()! Check it out and make sure to look up how applications such as Phar and FUDForum take advantage of it.

I was asked how this was different from exit() so I will explain. Imagine having a PHP installer script at the top of a file and then having a compressed version of, for example, a forum (in the case of FUDForum) in the rest of the file. When __halt_compiler() is called, PHP literally stops parsing and compiling past that point. That means that the installer can open itself up, look for everything after the __halt_compiler() call, unzip it, and then store it on the server.

13. Variable Composition

Just see this experiment for more interesting implementations, but otherwise the following is possible:

${'a' . 'b'} = 'c';
echo $ab; // c

14. Chaining Method Calls

If you're a fan of jQuery then you will love Javascript's ability to chain methods. PHP5 can also do this! For example: (take note of return $this)

class Foo {
	public function bar() {
		return $this;
	}
	public function baz() {
		return $this;
	}
}

$foo = new Foo;
$foo->bar()->baz()->bar()->.... // these two functions can continue to be chained

The above application doesn't show any practical use for chaining so I encourage people to look at how the CodeIgniter PHP framework uses method chaining for database query building.

15. preg_split

Ever use explode() and then end up looping through the results and picking out the empty ones? Isn't that annoying? By using preg_split this tedious process can be avoided:

// instead of...
$str = "a,b,,c,d"
explode(",", $str); // array('a','b','','c','d')

// do...
preg_split("~,~", $str, -1, PREG_SPLIT_NO_EMPTY); // array('a','b','c','d')

The above example might not seem all that useful so consider bb code parsers. One of the major problems with parsing bb code is that most of the bbcode to html translators don't fix improper closing tags, or even check for them! Generally speaking, most bb code parsers will to a flat preg_replace for each of the tags.

A better approach to parsing bb code is to split the text up into parts that are text and parts that are bbcode-tag-like, eg: ~\[([^\]]*)\]~. By doing this one can go through the split up results and parse the bbcode using a stack. This allows for both proper checking of bbcode opening and closing tags along with the ability to implement a generalized solution so that adding bbcodes isn't about making a new preg_replace but rather making a class/function that knows how to correctly handle the opening and closing tags.

In one of my previous posts I use the above mentioned technique to parse HTML for special XML-like template tags.

I think that's a decent enough list for now, how you enjoyed it.

Download PyrofORM

If you're interested in seeing the inner workings of PyrofORM then you can download it here.

PyrofORM - Python Object Relational Mapping (ORM)

Over the past week I've been working on an ORM and have been talking about it in its various stages of development on this blog. In this post I will detail its API and follow that with some sexy code examples of the ORM in action.

If you're wondering about the name—PyrofORM—I was trying to think of how to use "Py" and "ORM" together. My thought process followed this rationale: styrofoam, pyrofoam, pyroform, PyrofORM.

Definitions

A definition is everything. It describes how specific data is stored, it deals with the basic read/write/delete/etc functions, it has subclasses to deal with individual rows of data (records), sets of data (record sets), and fields, and it deals with loading and storing the pickled data. The following is a description of the API to the ORM.

  • Definition
      Hooks
    • install() - If the cache file is empty, this installs default data.
    • setUp() - This tells the definition which fields to support, their types and their relations.
    • tearDown() - This is a hook that is called in Definition.__del__()
    • Methods
    • read(row_id) - Reads a single row into a Record object, which is similar to a dictionary.
    • write(data={}) - Inserts/updates a row in the cache.
    • delete(row_ids=[]) - Deletes one or more rows from the cache.
    • filter(**filter_fns) - Filters data out of the data set given either values or lambda functions. Returns a new Definition.RecordSet object.
    • __iter__() - Lets the programmer put the definition instance directly into a loop. __iter__ calls Definition.filter() with no filtering functiosn—thus filtering all rows.
    • store() - Stores the data in the cache file.
    • create() - Returns a new empty Definition.Record object.
    • Classes
    • Field - Represents a single allowed field in the data (like a SQL table column)
        Methods
      • default(val) - Allows the programmer to change the default value of a field. Otherwise, the default value of a field is None
      • hasOne(cls, alias, **filter_fns)
      • ownsOne(cls, alias=None, **filter_fns) - Allows one to specify a one-to-one relationship with the definition whose class is cls. If the relation is not aliases then it will be called by cls.__name__. Relations are accessible through Definition.Record.__getattr__ (see below). Ownership means that if a row in the owner definition is deleted, then all related rows in owned definition will also be deleted (cascade delete). The filters are optional to automatically apply filtering functions in the same way that Definition.filter() works.
      • hasMany(cls, alias, **filter_fns)
      • ownsMany(cls, alias, **filter_fns) - Allows one to specify a one-to-many relationship with another definition. See Field.ownsOne for more details.
    • Record - Represents a single row from the data set (like a dictionary / associative array)
        Methods
      • __getitem__(item) - Allows one to access the fields as record["field"] or also access them in numeric order, eg: record[0].
      • __getattr__(attr) - Allows one to access any table relations as record.attr(). See Definition.Field.ownsOne.
      • save() - Saves any data in this record to the definition's data set.
      • delete() - Deletes any data related to this record from the definition's data set.
      • id() - Returns the value of the record's implied primary key if it is saved. Otherwise it raises an IndexError.
    • RecordSet - Represents one or more rows from the data set.
        Methods
      • __len__() - Returns the number of records in the record set, for example: len(recordset)
      • __getslice__(start, end) - Allows the programmer to slice a record set as if it were a list. This can be used to limit the number of results, for example: record_set[:limit].
      • __getitem__(item) - Allows the programmer to access an individual row by its numeric position in the record set. Position 0 will be the first row, etc.
      • next() - Returns the current record in the record set and moves the internal pointer to the next one.
      • delete() - Deletes all records in the record set from the definition's data set.
      • sort(**filter_fns) - Sort the rows in the record set by specific filters. To sort by a single field in ascending or descending order, do the following: recordset.sort(field=-1) (descending order).

Example - Basic Usage

Extending the Definition class is easy. In general all one would overwrite are the hook functions; however, anything can be overwritten. For example, I wrote a nested sets (modified pre-order tree traversal algorithm) implementation using this ORM that required overwriting several of the non-hook methods such as write, save, and delete.

Here is a simple demonstration of how to extend the definition class. A People class will be defined. When that class is instantiated People.setUp() will be called. It describes how data is stored in the rows of the data set of that definition. It also happens to be the place where relations to other definitions are defined. After that, if a cache file doesn't exist for it then the People.install() method will be called to create the initial data set.

#!/usr/local/bin/python

import PyrofORM
import types

class People(PyrofORM.Definition):

	def setUp(self):
		"""Set up the fields that this definition recognizes
		and the types that it expects their data to be in."""

		self["name"] = types.StringType
		self["age"] = types.IntType

	def install(self):
		"""Install the default data if the cache is empty."""

		peter = self.create()
		peter["name"] = "Peter Goodman"

		peter["age"] = 19
		peter.save()

# now lets instantiate the people class and filter out the people!
for person in People():
	print person["name"], "is", person["age"], "years old"

# output:
# Peter Goodman is 19 years old

As you can see it's very straightforward. One thing that might throw you off is how fields work. Above, we set fields by doing self["field"] = and then we tell it the type it expects by telling it what it equals. Behind the scenes this actually sets self["field"] to be a new Definition.Field() object. To use the Definition.Field.default() function one would do: self["field"].default(val).

Example - Relations

There are several ways that definition relations be accomplished. The first was is by using the built in functions that were described under Definition.Field in the above API (hasOne, ownsOne, hasMany, ownsMany). Another way to describe definition relations would be to make custom filtering functions in the Definiton.Record class (or if in the above case, define a People.Record class which extends Definition.Record) and use those.

It is important to get one point across before describing the relations system: all rows in all data sets have implied primary keys. This means that all rows will automatically have a primary key set to them. There are two ways to access the value of the primary key: record.id() and record["pk"]. By design, the relations system will match a field in the current definition to the primary key field in another definition. However, this behavior can be easily changed.

#!/usr/local/bin/python

import PyrofORM
import types

class People(PyrofORM.Definition):

	def setUp(self):
		"""Set up the fields that this definition recognizes
		and the types that it expects their data to be in."""

		self["name"] = types.StringType
		self["age"] = types.IntType
		self["job_id"] = types.IntType

		self["job_id"].hasOne(Jobs, alias="job")

	def install(self):
		"""Install the default data if the cache is empty."""

		jobs = Jobs()
		job = jobs.create()
		job["name"] = "Programmer"
		job.save()
		jobs.store() # __del__ is finicky so we will make sure the job is stored

		peter = self.create()
		peter["name"] = "Peter Goodman"
		peter["age"] = 19
		peter["job_id"] = job.id()
		peter.save()

class Jobs(PyrofORM.Definition):
	def setUp(self):
		self["name"] = types.StringType


# now lets instantiate the people class and filter out the people!
for person in People():
	print person["name"], "is a", person.job()["name"]

# output:
# Peter Goodman is a Programmer

In the above example I have added a job_id field onto the People definition and also gave it a relation. The relation maps People.job_id to Jobs.pk (primary key). However, assuming we wanted to change the default behavior of mapping one field to the primary key of another definition then we could use one of the following:

# related self.key to ForeignDefinition.pk
self["key"].hasOne(ForeignDefinition, alias="foreign")

# implied other primary key, equivalent to the above
self["key":].hasOne(ForeignDefinition, alias="foreign")

# map self.key to ForeignDefinition.foreign_key
self["key":"foreign_key"].hasOne(ForeignDefinition, alias="foreign")

# implied self primary key, this table. Maps self.pk to ForeignDefinition.foreign_key
self[:"foreign_key"].hasOne(ForeignDefinition, alias="foreign")

Another thing to keep in mind is that named filters can be passed as a third parameter to the has/ownsOne/Many functions.

Conclusions

With nifty slicing functions, lambdas, and method chaining, Python has allowed me to make a pretty spiffy system with some neat syntax. In my next post I will show the code implementation of nested sets for a category system and release the code. Until then, tell me what you think of the API to the ORM.

My First Python Program - Part 2

This is a continuation of yesterday's post.

So, I said that I would show some of the code that went into the system along with explanations of how I solved some problems. To start things off, I will explain how data is filtered, possibly the most import functionality in the system. Lets dive in.

Filtering Data

@classmethod
def filter(cls, **filter_fns):
	"""Filter data from the storage.

	x.filer() -> Definition.RecordSet
	x.filter(col=lambda v: v is 6) <==> x.filter(col=6)

	Filter starts by getting all available primary keys in the data set using
	D.__getPrimaryKeys(). If there are any filters in **filter_fns, it will go
	through those, make them into lambdas and apply them to each row in the
	data set corresponding with a valid primary key."""

	pks = cls.__getPrimaryKeys()

	if len(filter_fns) > 0:
		# make sure we're not trying to filter for fields that aren't in this
		# data definition
		for field in filter_fns:
			if field not in cls.fields:
				raise NameError, "Non existant field [%s] cannot be filtered." % field	

			# if supplied a value and not a lambda then create a filter lambda
			if type(filter_fns[field]) is not types.LambdaType:
				v = filter_fns[field]
				filter_fns[field] = (lambda val: val == v)

		# function that will do the filtering
		def filter_row(row_id):
			"""Filter the list items in a data set row against the filter_fns.

			This function takes a list of lambda functions, pops one off of
			the stack, tests it, if true, it continues, else it breaks out of
			the loop."""

			for field in filter_fns:
				index = cls.fields[field].index
				if not filter_fns[field](cls.data[row_id][index]):
					return False
			return True

		# filter, filter_fields needs to be copied for each row found because
		# the columns to filter are popped off of it. It lets the function always
		# use the same lambda filters
		pks = [i for i in pks if filter_row(i)]

	return c

The code starts off by getting a list of the primary keys for every row with the exception of deleted rows (pks = cls.__getPrimaryKeys()). This ends up being a list of primary keys (integers) which will eventually be filtered through or passed straight to the Definition.RecordSet class. The functions to filter each row are passed in using the cool ** notation, which gives you the named arguments passed to a function. This lets one use the function as: Definition().filter(col=1) to get all rows in the Definition data set whose column 'col' is equal to one. The function will return all filtered primary keys by using another list comprehension on the primary keys of all non-deleted rows. The final list of primary keys is then passed to the Definition.RecordSet class.

Iterating over Record Sets

def __iter__(self):
	"""Generator that reads the data into a dict from storage.

	for row in RS(): ..."""
	self.allow_filter = False
	for i in range(self.__len__()):
		yield self.definition.read(self.pks[i])

def __getslice__(self, i=None, j=None):
	"""Limit (slice) the data set.

	!!! This cannot be called after any data has been derived from the
	recordset using RS.next() or by looping over the record set.

	for row in RS()[:1]: This limits the record set to the first row."""
	if not self.allow_filter:
		raise RecordSetModifiedError, "Can't slice record set after data has been read."

	self.pks = self.pks[i:j]
	return self

def __getitem__(self, n):
	"""Return the nth record from the record set.

	RS.__getitem__(0) <==> RS[0] -> Definition.Record"""

	if type(n) is not types.IntType:	
		raise TypeError, "Definition.RecordSet.__getitem__ expects an integer \
							, %s given" % str(type(n))
	elif n not in range(self.__len__()):
		raise RecordDoesNotExistError, "Row [%r] does not exist in this record set." % n
	return self.definition.read(self.pks[n])

The RecordSet class mimics the list class (but doesn't extend it) by supporting iteration, slicing, and getting individual rows using numeric indices. I mentioned before that the record set uses Python's generators and you can see that in the __iter__ function with the yield keyword. That means that __iter__ is essentially a restartable function. Each time it is called, it will start from where it left off. In this case, each time it is called, it will advance the for loop that it is in. This is an important optimization because it means that not ever row is read (and thus translated) and makes it so that if the programmer wants to limit or slice the result set (equivalent of SQL's LIMIT start, offset) then that operation is applied to the primary keys.

Sorting

def sort(self, **filters):
	"""Sort the rows given filters."""
	if not self.allow_filter:
		raise RecordSetModifiedError, "Can't sort record set after data has been read."				
	
	# validate the sorting filters
	for field in filters:
		if field not in self.definition.fields:
			raise NameError, "Cannot sort recordset by non-existant field [%s]." % field
		
		if type(filters[field]) is types.IntType:
			d = filters[field] >= 0 and 1 or -1 # asc or desc
			filters[field] = lambda a,b: d*cmp(a,b)
		
		if type(filters[field]) is not types.LambdaType:
			raise Exception, "Expected lambda or integer in Definition.RecordSet.sort() \
								[%s] given." % str(type(filters[field]))
	
	def sort_row(field, cmp_filter, a, b):
		index = self.definition.fields[field].index
		return cmp_filter(self.definition.data[a][index], self.definition.data[b][index])
	
	# if the filters are okay, do the sort. The sort needs to happen for each
	# filter and not for each row, eg: we can't look up a row and then apply all
	# sort filters to it because that would be unpredictable
	for field in filters:
		self.pks.sort(lambda a,b: sort_row(field, filters[field], a, b))
	
	return self

SQL also has an ORDER BY clause which is tremendously useful and this function attempts to mimic its functionality—if not surpass it. This function is very similar to Definition.filter() in that it can take either a value or a lambda. One thing to note is that Definition.RecordSet.sort() expects either integers or lambdas as the values to the named arguments. If one wanted to sort a column in ascending or descending order, they would simply pass a 1 or a -1. However, if more a more complicated sorting scheme is required, a lambda can just as easily be passed.

Another difference between filter and sort is the way that the filter functions are applied. With filter, the functions are applied to the appropriate datum in the row, and if any one of the filters returns False then that row is excluded. With sort, the filters are applied one by one meaning that the rows will progressively by sorted further and further until all sorting functions have been applied. This is done because of the nature of list.sort(). It expects a function with a return value of -1, 0, or 1. This means that if we were to apply more than one sorting filter to a single row and each filter returned a different value then there would be no way to accurately determine if one row should go before or after another.

Instance Persistance

Hehe, I just had to put that as a header! What I mean is that all instances of the same definition classes should share the same data and a few other configuration variables. In order to get this to work with instance methods (as opposed to class methods) I used the following hack..

def __getattr__(self, attr):
	"""Get an attribute.

	D.test <==> D.__getattr__('test')
	!!! If attr is in the D.global_attrs tuple then:
	D.moo <==> D.__class__.moo"""		
	if attr in self.class_attrs:
		return getattr(self.__class__, attr)
	else:
		return object.__getattr__(self, attr)

def __setattr__(self, attr, val):
	"""Set an attribute.
	
	D.moo <==> D.__setattr__('moo', val)"""

	if attr in self.class_attrs:
		setattr(self.__class__, attr, val)
	else:
		object.__setattr__(self, attr, val)

If there is anything I don't like about the program I have made then these two functions are at the top of the list. By the way, "class_attrs" is a tuple with a list of variable names that should be common between all instances of a particular class.

This isn't usually an issue when using the @classmethod decorator; however, there is value in using instance methods (such as __init__, __getitem__, and __setitem__) and to make those work as expected with the class methods, this hack was necessary.

If you're coming from PHP and don't quite understand what this is doing, think of it this way: we have a top class with several static variables and a few classes that extend that top class. Using this, all instances of a particular class (be it the top class or a child class) will share the values of those static variables, but no two instances of different classes will share those values. If you can visualize this then you should immediately realize the usefulness of this.

Conclusions

I love Python. Yes, seriously. Otherwise, I think there was a lot of value in doing this exercise. I use a relational database every day and normally don't think about how it stores or retrieves rows. In this sense, I haven't actually learnt anything about a lot of the insanely cool algorithms rdbms apply to do what they do but it has made me think differently about how I handle data. When I use—for example: MySQL—through the functions PHP supplies me I am normally constrained to working with all rows at once. In this exercise, I constrained myself to a similar view of data as with databases: tables, columns, relations. Beyond that it was up to me to implement how everything worked under the hood and I found interesting solutions (such as working with primary keys only) that would not had occurred to me otherwise.

Next Time on I/O Reader

I will implement the modified preorder tree traversal algorithm—also known as nested sets— with my simple relational storage and release all of the code. If you want to get a sneak peak of what the functionality of the system will be then take a look at this unfinished PHP MPTT class I wrote.

My First Python Program

For my first Python program I got the idea of making an ORM. I haven't learnt how to use most libraries yet and so it would need to be an ORM to Python's built-in data structures. To make it persistent, I used the cPickle library to serialize and load data to/from a file. Right now you might be thinking: "What problem does this solve?" or "It would never be able to handle concurrent data saving predictably." Well, it solves no existing problem and you're right, I wouldn't want more than one person modifying the data at the same time! This was an exercise in learning how to program with Python.

Off the bat I wanted to be able to store and access the data as if it were a SQL table. This meant that Python's dictionary object (similar to PHP's associative arrays) would be involved. I decided early on that storing individual dictionaries for each row of data in the data sets would not only be overkill but that it wouldn't be flexible. Just imagine trying to change a column name if the column names are hard-coded into each row! That leaves me with either a tuple (an unmutable list) or a list (similar to PHP's numeric arrays). Tuples are faster, but would have been a pita to work with because I need to deal with setting default values for columns with no values specified, and I also need to be able to set the values of each row without necessarily knowing the order but knowing where the values fit in. Lists it was.

Another interesting thing I did which has turned out to be instrumental in getting everything working in a nice way is the idea of primary keys. Each row in the data set has a primary key, which just happens to also be the value of the index of the row in the list. Reading this, you might realize that if this is the case, then I can't actually delete any rows from the data set or else the primary keys stored in the individual rows of data won't correspond to the list indices. Well, you're right. I solved this problem by simply setting the row to None.

Filtering data from the data sets was an interesting challenge, but Python ended up making the task easy. My requirement was that the programmer could simply pass the argument name and either the value it should equal or a lambda to return true/false given a value. Remember, rows in the data set are not dictionaries, and so given the column names, the filter function needed to be able to figure out which data corresponded to the proper columns. This was generally easy.

Remember how I said that each row has a primary key and that key is also the list index in the data set? Well, this makes some things wildly easy. For example, if no arguments are passed to the filter function, then presumably we should return all rows in the data set--excluding deleted rows. Using Python's list comprehension, this task is made easy:

cls.pks = [i for i in range(len(cls.data)) if cls.data[i] is not None]

Notice how this comprehension doesn't actually return any data (if it did, instead of [i for ..., it would be: [cls.data[i] for ...). This is the best idea behind my whole system: data isn't fetched unless it is explicitly asked for. You might wonder why and the the main reason is because I want to minimize how many times the data needs to be copied. As I said, the data is stored in lists but is represented as dictionaries. That means that at some point there is a translation from a list to a dictionary (a copy) that needs to take place and by using Python's generators I am able to do this only when the data is asked for.

This has a huge implication: the majority of the time the data itself is not passed around, only the primary keys to that data. This means that to limit or slice the data set, all one is actually doing is slicing the list of primary keys! So, instead of having a slice return a (posssibly) large list of dictionaries its returning a large list of integers.

I feel like I've written quite a bit, so the next article will probably go more into the code side of the ORM, how I solved the problems that came up, and examples of syntax for all I/O operations.