Given a directed acyclic graph shaped like this:
branch-a-head-id branch-b-head-id | | id-Z id-C | | id-Y id-B | | id-X id-A | | branch-a-first-id branch-b-first-id | | \ / \ / \ / common-ancestor-id | id-2 | id-1 | initial-id
And modeled in a PostgreSQL table structure like:
select * from graphtable; id | parent -------------------+-------------- initial-id | (NULL) id-1 | initial-id id-2 | id-1 common-ancestor-id | id-2 branch-a-first-id | common-ancestor-id id-X | branch-a-first-id id-Y | id-X id-Z | id-Y branch-a-head-id | id-Z branch-b-first-id | common-ancestor-id id-A | branch-b-first-id id-B | id-A id-C | id-B branch-b-head-id | id-C
I need a query to find the common ancestor. Here is a query that works but is inefficient:
WITH RECURSIVE linked_list AS(
SELECT id, parent FROM graphtable WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
UNION ALL
SELECT g.id, g.parent FROM graphtable g INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent FROM linked_list GROUP BY parent HAVING COUNT(DISTINCT id) > 1;
Which for the above structure returns:
ids | parent -------------------------------------+-------------------- branch-a-first-id,branch-b-first-id | common-ancestor-id
This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where the A and B branches are both small (100 rows each) but the common portion is large (millions of rows) it will iterate over millions of rows instead of hundreds.
Here’s a benchmark with 100 nodes in A and B branches each and variable number of common nodes:
# nodes | Query time (seconds) |
---|---|
16 | 0.011 |
32 | 0.008 |
64 | 0.011 |
128 | 0.014 |
256 | 0.026 |
512 | 0.058 |
1024 | 0.146 |
2048 | 0.44 |
4096 | 0.028 |
8192 | 0.049 |
16384 | 0.104 |
32768 | 0.209 |
65536 | 0.423 |
131072 | 0.859 |
262144 | 1.781 |
524288 | 3.914 |
The question: how to make this query efficient? Holding to 100 nodes per branch and only varying number of common nodes, this appears to be O(N). Hopefully O(1) is possible.
I’m specifically interested in Postgres. Thanks!
Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:
CREATE OR REPLACE FUNCTION first_common_ancestor(a text, b text) RETURNS text
LANGUAGE plpgsql AS
$$DECLARE
iter integer := 1;
a_ancestors text[] := ARRAY[a];
b_ancestors text[] := ARRAY[b];
a_next text;
b_next text;
a_done boolean := FALSE;
b_done boolean := FALSE;
BEGIN
WHILE NOT a_done AND NOT b_done LOOP
/* test if we have found a common ancestor */
IF ARRAY[a_ancestors[iter]] && b_ancestors THEN
RETURN a_ancestors[iter];
END IF;
IF ARRAY[b_ancestors[iter]] && a_ancestors THEN
RETURN b_ancestors[iter];
END IF;
/* get the next ancestors */
SELECT parent INTO a_next
FROM graphtable
WHERE id = a_ancestors[iter];
IF FOUND THEN
a_ancestors := a_ancestors || a_next;
ELSE
a_done := TRUE;
END IF;
SELECT parent INTO b_next
FROM graphtable
WHERE id = b_ancestors[iter];
IF FOUND THEN
b_ancestors := b_ancestors || b_next;
ELSE
b_done := TRUE;
END IF;
iter := iter + 1;
END LOOP;
/* no common ancestor */
RETURN NULL;
END;$$;
Show an
EXPLAIN (ANALYZE, BUFFERS)
for the query when run with enough nodes that performance is a problem.Try reducing the search space. Any node that has a single child node cannot be the nearest common ancestor to any other pair of nodes. Also, you didn’t specify whether or not a node could have more than one parent. On a stylistic note, although trees in nature grow from the bottom up, the general convention in computer science is to branch downward. There is no O(1) solution to this search problem. What do you mean by “100 nodes per branch and only varying number of common nodes?” Please update the post with the table DDL and code to populate the tables to reproduce your tests cases.
@JohnH I have two branches in the diagram at the top, one with the head node
branch-a-head-id
the other with headbranch-b-head-id
. So “A” and “B”, respectively. Sorry that wasn’t clearer. I’m saying imagine a larger graph with 100 nodes in each of those independent branches, and then a “common” portion of the graph that both A and B belong to (i.e. nodes id-1 and id-2) which I’ve varied from 16 to 500k in my table. Concur with the recommendation of reducing search space. I’m working on a solution with tracking depth and doing something likeWHERE depth < 10000
.I would write a database function for that.
Recursion performance suffers as depth increases. That is just how it is. Have you considered using ltree so that each row contains its path to the root? postgresql.org/docs/current/ltree.html The ltree extension includes an
lca(<ltree>, <ltree>)
function that makes your query a no-brainer.Show 1 more comment