February 29, 2016

SQL Performance of Join and Where Exists

Sometimes we need to identify record sets with at-least-one relationships. Find customers who’ve placed an order, products categorized as books, or cities that have airports. In this post we’ll compare the performance and execution paths of inner join to where exists in PostgreSQL 9.5.

Let’s settle with finding all products that have been ordered.

Setup #

Create and populate two tables, products and orders, with 1,000,000 rows each of random data.

CREATE TABLE p
(
    p_id serial PRIMARY KEY,
    p_name character varying UNIQUE
);

INSERT INTO p (p_name)
    SELECT substr(gen_salt('md5'), 4)
    FROM generate_series(1, 1000000);

Output

|  p_id     |  p_name    |
--------------------------
|  1        |  5aHN4w0f  |
|  2        |  29LKwrBw  |
|  3        |  9cIR4iXE  |
|  4        |  5P9aTUQN  |
|  ...      |  ...       |
|  1000000  |  6cpGNL18  |

Pretty random looking. Now for the orders table.

CREATE TABLE o
(
    o_id serial PRIMARY KEY,
    p_id integer REFERENCS p (p_id)
);

INSERT INTO o (p_id)
SELECT rnd_val
FROM (SELECT trunc(random() * 249999 + 1)::int AS rnd_val
        FROM generate_series(1, 1000000)) as gen;

Output

|  o_id     |  p_id    |
------------------------
|  1        |  1       |
|  2        |  525     |
|  3        |  759     |
|  4        |  1       |
|  ...      |  ...     |
|  1000000  |  225001  |

Of course in production these tables would have more columns, but we’re focused on their intersection and execution paths.

I also set up a btree index on orders p_id foreign key to avoid costly Sequential Scans.

In our randomly generated data, the orders table has 1,000,000 foreign keys matching 245,538 unique products. Duplicate foreign keys show products ordered multiple times.

Pretty simple.

Let’s start with the familiar inner join.

SELECT p_name FROM p JOIN o USING(p_id);

EXPLAIN ANALYZE shows the following execution path.

Merge Join  (rows=1000000)
  Merge Cond: (p.p_id = o.p_id)
  ->  Index Scan using p_pkey on p  (rows=1000000)
  ->  Index Only Scan using p_id on o  (rows=1000000)
        Heap Fetches: 1000000
Planning time: 0.305 ms
Execution time: 1323.388 ms

Good so far. Next up.

SELECT p_name
FROM p
WHERE EXISTS (
    SELECT 1 FROM o WHERE p.p_id = o.p_id 
);

Our execution path looks like.

Merge Semi Join  (rows=192999)
  Merge Cond: (p.p_id = o.p_id)
  ->  Index Scan using p_pkey on p  (rows=1000000)
  ->  Index Only Scan using p_id on o  (rows=1000000)
        Heap Fetches: 1000000
Planning time: 0.297 ms
Execution time: 1224.178 ms

Very similar. Where Exists performs a Semi Join that returns each parent table row on first match in the subquery. The parent and sub-query’s ids are compared in the WHERE clause with a boolean short circuit early exit.

Comparison of 5 runs in ms.

Joins	Where Exists
1302.4	1183.0
1300.1	1213.5
1309.6	1191.4
1304.2	1209.4
1294.2	1224.2

Where exists is 9% faster. Even though it uses a correlated sub-query, it’s short circuit algorithm out-performs joins.

But this is a best case scenario for inner joins. Do you see a potential problem with the query?

With the inner join, any record with more than one foreign key in orders referring to a primary key in products creates a undesired duplicate in the result set.

We’ll try both DISTINCT and GROUP BY to de-dupe the result set.

SELECT DISTINCT p_name FROM p JOIN o USING(p_id);

Execution path is

Unique  (rows=1000000)
  ->  Sort  (rows=1000000)
        Sort Key: p.p_category
        Sort Method: external merge  Disk: 41944kB
        ->  Merge Join  (rows=1000000)
              Merge Cond: (p.p_id = o.p_id)
              ->  Index Scan using p_pkey on p  (rows=1000000)
              ->  Index Only Scan using p_id on o  (rows=1000000) 
                    Heap Fetches: 1000000
Planning time: 0.318 ms
Execution time: 29189.051 ms

Ouch. The external merge socked our performance. How about GROUP BY?

SELECT p_name 
FROM p JOIN p USING(p_id)
GROUP BY p_name

In modern Postgres it won’t help us at all, having the same execution path.

Comparison of 5 runs of Join with Distinct to Where Exists.

Join (Distinct)	Where Exists
24489.2	1183.0
24189.1	1213.5
23752.2	1191.4
24539.5	1209.4
24210.0	1224.2

In the likely case of wanting de-duped records, Where Exists performs 20.5 times faster than inner joins. Beyond performance, I see Where Exists as more readable, having only a subset of join functionality. I hope this post was useful. Ran Postgres on 2.7GHz Intel i5, 8GB DDR3 iMac.

403

Kudos

403

Kudos

SQL Performance of Join and Where Exists

Setup #

Now read this

Events, Concurrency and JavaScript