RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, C++, Functional Programming, Applications,

Selecting from more than one database table: How to deal with column name conflicts in PINQ

Obligatory Plug

The short answer is that you don't need to (in PINQ, using PQL). You can if you want to but PINQ will automatically do it for you.

The Problem

Alright, that's enough plugging of PINQ. I'm very excited to say that I just solved this problem and I'm going to explain how I did it. But first, I will set up an example if you're not quite clear of the problem I'm talking about. Consider the following two simplified SQL table defininitions:

CREATE TABLE posts (
    id      INT,
    user_id INT,
    name    VARCHAR,
    body    TEXT
)

CREATE TABLE users (
    id      INT,
    name    VARCHAR
)

If I do a database query and select from both the posts and users tables at the same time (linking the two, of course), then their 'id' and 'name' columns will conflict. PHP's mysql_fetch_assoc will return roughly the following array:

array(
    [id] => 1,
    [user_id] => 1,
    [name] => 'user name',
    [body] => 'the body text of the post'
)

Those are terrible results! We've ended up losing lots of valuable post information because user information has overwritten it. We could solve the problem by returning a numerically indexed array, but then we would lose the convenience of indexing into the result by column names (or aliases for that matter). Alright, so how can we solve this problem?

Solving the Problem

The Normal Way

The obvious and normal way of doing this is manually aliasing the conflicting columns of one of the tables in the SELECT statement. This is a fine solution and isn't too bad. But, it takes the convenience out of doing a simple SELECT posts.*, users.* FROM ... which is honestly, super convenient.

How PINQ does it, and how it benefits you

The PINQ solution is to actually do roughly the same thing. First, PINQ figures out what the conflicting columns are going to be. This can be done with a series of calls to php's array_intersect_key and array_merge, that is, assuming we know all of the columns for a given table (which we assume and which we do in the models). The conflicts array ends up being this:

array(
    [id] => 1,
    [name] => 1,
)

We start by expanding any SELECT *'s into all the columns that need to be selected from and how they are going to be aliased. We then go over all of the column name to alias mappings and check if any of the aliases exist in the conflicts array using a simple isset(). If they do, we prefix the aliases with the model name. Here's what we end up with:

array(
    [posts_id] => 5,
    [user_id] => 1,
    [posts_name] => 'my first post!!!1!',
    [body] => 'the body text of the post',
    [users_id] => 1,
    [users_name] => 'user name',
)

That's all good and fine, but it's not a complete solution. If we were to stop here then we would need to know when columns conflict and index them with their new prefixes. That is undesirable behavior. So lets look at the result and see what we notice. Immediately we can see that the results come out in the order that we selected them (remember posts.* then users.*). If we assume that when we select the columns that we group them by their tables then we can count on there being this nice order to the result array (a fair assumption using PQL in PINQ). We've observerd that there is this order to the results, but that still doesn't get us any closer to getting rid of those prefixes we added to those columns.

What if we injected some dummy columns into the query to tell us where the columns for one table end and another begin? Further, we need to know when NOT to do this whole separating process. For example, if in PINQ we use straight SQL in a query then none of our assumptions will hold water and so we might end up mangling the results in an unfortunate way. Remember, the goal is to lose no data!

The solution is clear: we need sentinel columns in there to tell us:

  • When to perform the splitting up and processing of a query
  • Where to split the row data into different models
  • How to identify the prefixes on the unambiguous columns so that we can remove them
We can do step 1 with one sentinel, and steps 2 and 3 at the same time by including the model name (which is the prefix on ambiguous columns) in the sentinel to show where a table begins/ends. Here is our desired result array:

array(
    [__pql__] = 1,
    [__posts] = 1,
    [posts_id] => 5,
    [user_id] => 1,
    [posts_name] => 'my first post!!!1!',
    [body] => 'the body text of the post',
    [__users] = 1
    [users_id] => 1,
    [users_name] => 'user name',
)

Hey, we can actually do that in SQL in a very simple way.

SELECT 1 AS __pql__, 1 AS __posts, posts.*, 1 AS __users, users.* 
FROM posts, users 
WHERE posts.user_id=users.id

That is in essence what is done in PQL (except that the posts.* and users.* are expanded to be unambiguous and prefixed). Now that we have the desired row data, all that is left is to separate it into the different contained models and remove any unwanted prefixes.

What are the benefits and Why is this interesting?

This seems like a feat of over-engineering. We've solved a minor problem with complicated solution that requires extra processing of records from the database. What are the advantages? First, lets look at how the traditional ORM works. We would first get a row from the posts model, and then call a relationship on it, such as $post->users. Calling that relationship on each record means one extra query per record. One of the goals of PQL is to shift the work of calling these relationships on objects (and thus doing queries) to constructing the relationships in the database queries then sorting out the data when its needed (note: my intention is to also support the traditional ORM style of calling on relationships). Still, the question "why?" remains. I was prompted to this idea by seeing these relationships being called in large loops. The number of database queries piles up like crazy and then things start slowing down.

An Example of PQL

So, here is an example of a pql query that does all of the above (from within a controller method):

$db = $this->import('db.my_db');
$posts = $db->findAll(
    from('posts')->select(ALL)->
    from('users)->select(ALL)->link('posts', 'users')
);

foreach($posts as $post) {
    echo $post['id']; // this is the ambiguous id which resolves to the user id
    echo $post->users['id']; // this is the user id from within the columns
    echo $post->posts['id']; // this is the post id
}

As you can see here, my earlier assumption that the select columns would be in order comes true with PQL. The reason is because PQL removes ambiguity from queries by having the programmer select columns from a specific model.

EDIT: I have already come up with a simpler method that I didn't recognize earlier. Instead of calculating the conflicts I prefix every column and then chop off the prefixes later. Regardless, most of the above article still holds and I feel as though others might find the process interesting.


Comments


Comment