Efficiently find common ancestor in postgresql directed acyclic graph

Given a directed acyclic graph shaped like this:

branch-a-head-id        branch-b-head-id
       |                        |
     id-Z                     id-C
       |                        |
     id-Y                     id-B
       |                        |
     id-X                     id-A
       |                        |
branch-a-first-id       branch-b-first-id
       |                        |
        \                      /
         \                    /
          \                  /
           common-ancestor-id
                   |
                 id-2
                   |
                 id-1
                   |
               initial-id

And modeled in a PostgreSQL table structure like:

select * from graphtable;
   id              |     parent
-------------------+--------------
initial-id         | (NULL)
id-1               | initial-id
id-2               | id-1
common-ancestor-id | id-2
branch-a-first-id  | common-ancestor-id
id-X               | branch-a-first-id
id-Y               | id-X
id-Z               | id-Y
branch-a-head-id   | id-Z
branch-b-first-id  | common-ancestor-id
id-A               | branch-b-first-id
id-B               | id-A
id-C               | id-B
branch-b-head-id   | id-C

I need a query to find the common ancestor. Here is a query that works but is inefficient:

WITH RECURSIVE linked_list AS(
    SELECT id, parent FROM graphtable WHERE id = 'branch-a-head-id' OR id = 'branch-b-head-id'
    UNION ALL
    SELECT g.id, g.parent FROM graphtable g INNER JOIN linked_list ll ON ll.parent = g.id
)
SELECT string_agg(id, ',') AS ids, parent FROM linked_list GROUP BY parent HAVING COUNT(DISTINCT id) > 1;

Which for the above structure returns:

                 ids                 |       parent       
-------------------------------------+--------------------
 branch-a-first-id,branch-b-first-id | common-ancestor-id

This is inefficient because it always runs through all rows until it finds one with a NULL parent. That means that if you have a structure where the A and B branches are both small (100 rows each) but the common portion is large (millions of rows) it will iterate over millions of rows instead of hundreds.

Here’s a benchmark with 100 nodes in A and B branches each and variable number of common nodes:

# nodes Query time (seconds)
16 0.011
32 0.008
64 0.011
128 0.014
256 0.026
512 0.058
1024 0.146
2048 0.44
4096 0.028
8192 0.049
16384 0.104
32768 0.209
65536 0.423
131072 0.859
262144 1.781
524288 3.914

The question: how to make this query efficient? Holding to 100 nodes per branch and only varying number of common nodes, this appears to be O(N). Hopefully O(1) is possible.

I’m specifically interested in Postgres. Thanks!

  • Show an EXPLAIN (ANALYZE, BUFFERS) for the query when run with enough nodes that performance is a problem.

    – 

  • Try reducing the search space. Any node that has a single child node cannot be the nearest common ancestor to any other pair of nodes. Also, you didn’t specify whether or not a node could have more than one parent. On a stylistic note, although trees in nature grow from the bottom up, the general convention in computer science is to branch downward. There is no O(1) solution to this search problem. What do you mean by “100 nodes per branch and only varying number of common nodes?” Please update the post with the table DDL and code to populate the tables to reproduce your tests cases.

    – 

  • @JohnH I have two branches in the diagram at the top, one with the head node branch-a-head-id the other with head branch-b-head-id. So “A” and “B”, respectively. Sorry that wasn’t clearer. I’m saying imagine a larger graph with 100 nodes in each of those independent branches, and then a “common” portion of the graph that both A and B belong to (i.e. nodes id-1 and id-2) which I’ve varied from 16 to 500k in my table. Concur with the recommendation of reducing search space. I’m working on a solution with tracking depth and doing something like WHERE depth < 10000.

    – 

  • 1

    I would write a database function for that.

    – 

  • Recursion performance suffers as depth increases. That is just how it is. Have you considered using ltree so that each row contains its path to the root? postgresql.org/docs/current/ltree.html The ltree extension includes an lca(<ltree>, <ltree>) function that makes your query a no-brainer.

    – 

Even though PL/pgSQL is quite slow compared to SQL, I still recommend writing a database function. The algorithm is procedural, so that is a natural choice. And that way it is easy to avoid fetching all the ancestors of either argument:

CREATE OR REPLACE FUNCTION first_common_ancestor(a text, b text) RETURNS text
   LANGUAGE plpgsql AS
$$DECLARE
   iter integer := 1;
   a_ancestors text[] := ARRAY[a];
   b_ancestors text[] := ARRAY[b];
   a_next text;
   b_next text;
   a_done boolean := FALSE;
   b_done boolean := FALSE;
BEGIN
   WHILE NOT a_done AND NOT b_done LOOP
      /* test if we have found a common ancestor */
      IF ARRAY[a_ancestors[iter]] && b_ancestors THEN
         RETURN a_ancestors[iter];
      END IF;
      IF ARRAY[b_ancestors[iter]] && a_ancestors THEN
         RETURN b_ancestors[iter];
      END IF;

      /* get the next ancestors */
      SELECT parent INTO a_next
      FROM graphtable
      WHERE id = a_ancestors[iter];

      IF FOUND THEN
         a_ancestors := a_ancestors || a_next;
      ELSE
         a_done := TRUE;
      END IF;

      SELECT parent INTO b_next
      FROM graphtable
      WHERE id = b_ancestors[iter];

      IF FOUND THEN
         b_ancestors := b_ancestors || b_next;
      ELSE
         b_done := TRUE;
      END IF;

      iter := iter + 1;
   END LOOP;

   /* no common ancestor */
   RETURN NULL;
END;$$;

Leave a Comment