SQL Performance of Join and Where Exists
Sometimes we need to identify record sets with at-least-one relationships. Find customers who’ve placed an order, products categorized as books, or cities that have airports. In this post we’ll compare the performance and execution paths of inner join to where exists in PostgreSQL 9.5.
Let’s settle with finding all products that have been ordered.
Create and populate two tables, products and orders, with 1,000,000 rows each of random data.
CREATE TABLE p ( p_id serial PRIMARY KEY, p_name character varying UNIQUE ); INSERT INTO p (p_name) SELECT substr(gen_salt('md5'), 4) FROM generate_series(1, 1000000);
| p_id | p_name | -------------------------- | 1 | 5aHN4w0f | | 2 | 29LKwrBw | | 3 | 9cIR4iXE | | 4 | 5P9aTUQN | | ... | ... | | 1000000 | 6cpGNL18 |
Pretty random looking. Now for the orders table.
CREATE TABLE o ( o_id serial PRIMARY KEY, p_id integer REFERENCS p (p_id) ); INSERT INTO o (p_id) SELECT rnd_val FROM (SELECT trunc(random() * 249999 + 1)::int AS rnd_val FROM generate_series(1, 1000000)) as gen;
| o_id | p_id | ------------------------ | 1 | 1 | | 2 | 525 | | 3 | 759 | | 4 | 1 | | ... | ... | | 1000000 | 225001 |
Of course in production these tables would have more columns, but we’re focused on their intersection and execution paths.
I also set up a btree index on
orders p_id foreign key to avoid costly Sequential Scans.
In our randomly generated data, the orders table has 1,000,000 foreign keys matching 245,538 unique products. Duplicate foreign keys show products ordered multiple times.
Let’s start with the familiar inner join.
SELECT p_name FROM p JOIN o USING(p_id);
EXPLAIN ANALYZE shows the following execution path.
Merge Join (rows=1000000) Merge Cond: (p.p_id = o.p_id) -> Index Scan using p_pkey on p (rows=1000000) -> Index Only Scan using p_id on o (rows=1000000) Heap Fetches: 1000000 Planning time: 0.305 ms Execution time: 1323.388 ms
Good so far. Next up.
SELECT p_name FROM p WHERE EXISTS ( SELECT 1 FROM o WHERE p.p_id = o.p_id );
Our execution path looks like.
Merge Semi Join (rows=192999) Merge Cond: (p.p_id = o.p_id) -> Index Scan using p_pkey on p (rows=1000000) -> Index Only Scan using p_id on o (rows=1000000) Heap Fetches: 1000000 Planning time: 0.297 ms Execution time: 1224.178 ms
Very similar. Where Exists performs a Semi Join that returns each parent table row on first match in the subquery. The parent and sub-query’s ids are compared in the WHERE clause with a boolean short circuit early exit.
Comparison of 5 runs in ms.
Where exists is 9% faster. Even though it uses a correlated sub-query, it’s short circuit algorithm out-performs joins.
But this is a best case scenario for inner joins. Do you see a potential problem with the query?
With the inner join, any record with more than one foreign key in orders referring to a primary key in products creates a undesired duplicate in the result set.
We’ll try both DISTINCT and GROUP BY to de-dupe the result set.
SELECT DISTINCT p_name FROM p JOIN o USING(p_id);
Execution path is
Unique (rows=1000000) -> Sort (rows=1000000) Sort Key: p.p_category Sort Method: external merge Disk: 41944kB -> Merge Join (rows=1000000) Merge Cond: (p.p_id = o.p_id) -> Index Scan using p_pkey on p (rows=1000000) -> Index Only Scan using p_id on o (rows=1000000) Heap Fetches: 1000000 Planning time: 0.318 ms Execution time: 29189.051 ms
Ouch. The external merge socked our performance. How about GROUP BY?
SELECT p_name FROM p JOIN p USING(p_id) GROUP BY p_name
In modern Postgres it won’t help us at all, having the same execution path.
Comparison of 5 runs of Join with Distinct to Where Exists.
|Join (Distinct)||Where Exists|
In the likely case of wanting de-duped records, Where Exists performs 20.5 times faster than inner joins. Beyond performance, I see Where Exists as more readable, having only a subset of join functionality. I hope this post was useful. Ran Postgres on 2.7GHz Intel i5, 8GB DDR3 iMac.