Efficiently find common ancestor in postgresql directed acyclic graph

Question 1

Given a directed acyclic graph shaped like this:

branch-a-head-id        branch-b-head-id
       |                        |
     id-Z                     id-C
       |                        |
     id-Y                     id-B
       |                        |
     id-X                     id-A
       |                        |
branch-a-first-id       branch-b-first-id
       |                        |
        \                      /
         \                    /
          \                  /
           common-ancestor-id
                   |
                 id-2
                   |
                 id-1
                   |
               initial-id

And modeled in a PostgreSQL table structure like:

select * from graphtable;
   id              |     parent
-------------------+--------------
initial-id         | (NULL)
id-1               | initial-id
id-2               | id-1
common-ancestor-id | id-2
branch-a-first-id  | common-ancestor-id
id-X               | branch-a-first-id
id-Y               | id-X
id-Z               | id-Y
branch-a-head-id   | id-Z
branch-b-first-id  | common-ancestor-id
id-A               | branch-b-first-id
id-B               | id-A
id-C               | id-B
branch-b-head-id   | id-C

I need a query to find the common ancestor. Here is a query that works but is inefficient:

WITH RECURSIVE linked_list AS(
    SELECT id, parent FROM graphtable WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
    UNION ALL
    SELECT g.id, g.parent FROM graphtable g INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent FROM linked_list GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Which for the above structure returns:

                 ids                 |       parent       
-------------------------------------+--------------------
 branch-a-first-id,branch-b-first-id | common-ancestor-id

This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where the A and B branches are both small (100 rows each) but the common portion is large (millions of rows) it will iterate over millions of rows instead of hundreds.

Here’s a benchmark with 100 nodes in A and B branches each and variable number of common nodes:

# nodes	Query time (seconds)
16	0.011
32	0.008
64	0.011
128	0.014
256	0.026
512	0.058
1024	0.146
2048	0.44
4096	0.028
8192	0.049
16384	0.104
32768	0.209
65536	0.423
131072	0.859
262144	1.781
524288	3.914

The question: how to make this query efficient? Holding to 100 nodes per branch and only varying number of common nodes, this appears to be O(N). Hopefully O(1) is possible.

I’m specifically interested in Postgres. Thanks!

Question 2

Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:

CREATE OR REPLACE FUNCTION first_common_ancestor(a text, b text) RETURNS text
   LANGUAGE plpgsql AS
$$DECLARE
   iter integer := 1;
   a_ancestors text[] := ARRAY[a];
   b_ancestors text[] := ARRAY[b];
   a_next text;
   b_next text;
   a_done boolean := FALSE;
   b_done boolean := FALSE;
BEGIN
   WHILE NOT a_done AND NOT b_done LOOP
      /* test if we have found a common ancestor */
      IF ARRAY[a_ancestors[iter]] && b_ancestors THEN
         RETURN a_ancestors[iter];
      END IF;
      IF ARRAY[b_ancestors[iter]] && a_ancestors THEN
         RETURN b_ancestors[iter];
      END IF;

      /* get the next ancestors */
      SELECT parent INTO a_next
      FROM graphtable
      WHERE id = a_ancestors[iter];

      IF FOUND THEN
         a_ancestors := a_ancestors || a_next;
      ELSE
         a_done := TRUE;
      END IF;

      SELECT parent INTO b_next
      FROM graphtable
      WHERE id = b_ancestors[iter];

      IF FOUND THEN
         b_ancestors := b_ancestors || b_next;
      ELSE
         b_done := TRUE;
      END IF;

      iter := iter + 1;
   END LOOP;

   /* no common ancestor */
   RETURN NULL;
END;$$;

Leave a Comment Cancel reply