?

Log in

No account? Create an account

Postgresql grrrrrr

« previous entry | next entry »
Apr. 6th, 2006 | 02:38 pm
mood: bitchybitchy
music: Auf der Maur

Postgresql is a database system. While it has many good points, it has one exceedingly bad point - the query planner is hard to control. For some unknown reason, the developers has refused to provide explicit control over which index is used to do a query. Instead, they always advise to adjust the query planner's analysis statistics settings until it starts choosing the index you want. It's like saying you have to twiddle a bunch of knobs and levers until the light goes on, instead of putting a switch right next to the light itself.

This is stupid. No query planner can possibly handle every situation that comes up in every database in the world. And indeed, one has come up at my work for which the postgresql query planner is inadequate.

The situation is this: There are two columns, let's call them c1 and c2. In a table with 9,000,000 rows, there are around 150,000 unique values in each of these columns. One might make a naive guess that there are 9,000,000 / 150,000 = 60 rows containing each possible value. Due to other known statistics, the query planner derives a value of around 33 rather than 60. But this is not the value I'm complaining about. In fact, this estimate leads to good query planning for a single column lookup.

Let's say a lookup for value X in column c1 would return 60 rows. And a lookup for value Y in column c2 would return 60 rows. How many rows would a lookup which specifies values for BOTH columns return? The odds of a hit in each column are 60 / 9,000,000. The odds of getting a hit in both columns would then be 60^2 / 9,000,000^2, right? That gives us an estimate of .00004 rows returned for each query.

The problem is, the data in these columns is correlated! But there is no way to inform the query planner of this. So it goes ahead and assumes that it'll be lucky to find even one row with the specified values for both columns, whereas in reality it is likely to find anywhere between 0 and 500.

So, why is it a problem that the query planner gets the number of rows we will find wildly wrong? Because, based on its guess, it decides that fetching all rows matching c1 and c2 is equivalent to fetching all rows matching c1, c2 and any other columns! If specifying c1 and c2 did indeed reduce it to one row only, then this would be true. Unfortunately, it is not, and specifying other columns DOES increase the selectivity of the query. As a result, the query planner chooses an index only in c1 and c2, rather than including the other specified columns.

If only they had a FORCE INDEX option, like MySQL, there would not be any problems. I would just say "I want to use this index, not that OTHER index that your crazy query planner thinks will work better." And everyone would be happy. Instead, I am going to have to go and manually edit the statistics for the table to con the query planner into using the index I want. Which may screw things up elsewhere. But I don't see what choice I have.

Update

I have kludged my way around the problem. I changed the index on (c1, c2) to be an index on (c1 || '/' || c2). Due to the content of these columns, this result is unique. "||" stands for concatenation.

As a result, the query "optimizer" is unable to use this index, and instead chooses the correct index.

This could have been fixed so easily is Postgresql supported a "USE/FORCE/IGNORE INDEX" style optimizer hint.

Update 2

Here is a response from a developer to me asking of optimizer hints (such as "use this index") would be implemented:

Brian Herlihy <btherl@yahoo.com.au> writes:
> Before I go, I have a question - From discussions on the Postgresql irc
> channel, and from reading the TODO list on the website, I am under the
> impression that there are no plans to allow optimizer hints, such as "use index
> table_pkey". Is this really true?

I personally don't think it's a good idea: the time spent in designing,
implementing, and maintaining a usable hint system would be significant,
and IMHO the effort is better spent on *fixing* the optimizer problems
than working around them. There are also good arguments about how hints
wouldn't be future-proof --- for instance, the recent addition of bitmap
indexscan capability would've obsoleted an awful lot of hints, had
anyone had any on their queries. We'd then be faced with either turning
off the hints or being forced by them to adopt inferior plans.

The direction I'd like to see us go to solve your problem is maintaining
cross-column statistics. It's not practical to store joint stats for
every set of columns, but the existence of an index on (p2,p3) ought to
cue us that p2/p3 stats would be worth having.

regards, tom lane

Link | Leave a comment | Share

Comments {0}