Trees in SQL with Prime Numbers
This article presents an original way to implement trees in SQL. The technique works well, but I would not advise anyone to use it in production. It is intended as a fun way to do things!
Suppose we want to implement a commenting system. Each comment can
have one reply (or more), which is also a comment. An reply being a
comment, it can have replies as well.
In other words: we have a tree of comments.
A possible approach is to keep a reference of the direct parent. This approach is straightforward. However, traversals are problematic (at least, without recursive CTEs):
- Retrieving all the parents requires as many self-
JOIN
s as there are parents. - We cannot
COUNT()
the parents of a comment in a comfortable manner.
This article presents an arithmetical approach that is reference-free, thus making traversal operations much smoother.
Here are sample data for our nested comments:
id | parents | text
---+----------+----------------------------------------
2 | 1 |
3 | 2 | First comment
5 | 2 | Second comment
7 | 2 | Third comment
11 | 10 | First comment on the second comment
13 | 6 | First comment on the first comment
17 | 6 | Second comment on the first comment
19 | 14 | Comment on third comment
23 | 266 | Comment on comment on third comment
29 | 10 | Second comment on the second comment
31 | 290 | Comment on the second comment on the second comment
37 | 8990 | Comment on comment on the second comment on the second comment
41 | 332630 | Second comment on comment on comment on the second comment on the second comment
43 | 13637830 | Comment on second comment on comment on comment on the second comment on the second comment
47 | 332630 | First comment on comment on comment on the second comment on the second comment
Each node has a pair of two integers: id
and parents
.
The root has the pair (2, 1)
and is empty.
Any child with a pair (id, parents)
must follow two rules:
id
is a prime number, and it is unique.parents
is the productid × parents
of its direct parent.
The critical thing to note is that each parents
value is a product of
a prime number and previous parents
value. The previous parents
value is also
a product of a prime number and a previous parents
value until
the root (2, 1)
is reached.
It is essential because it means that any parents
has a prime decomposition
by which we can retrieve all the parents' id
s.
Here is the representation of the tree we have modelized in our comment
table:
Retrieving the parents of a node
Let's suppose we want comment #47 and its parents.
Comment #47 has a parents
= 332630.
332630 = 2 × 5 × 29 × 31 × 37. That is, its parents are
the nodes with respectively IDs 37, 31, 29, 5 and 2. We can now retrieve the
parents with a simple SQL query:
SELECT * FROM comment WHERE id IN (2, 5, 29, 31, 37);
id | parents | text
---+---------+----------------------------------------------------------------
2 | 1 |
5 | 2 | Second comment
29 | 10 | Second comment on the second comment
31 | 290 | Comment on the second comment on the second comment
37 | 8990 | Comment on comment on the second comment on the second comment
(Typically, the factorization method is implemented in the language from which we call the database.
But, of course, we could have implemented a factorization routine right into the database and done
something like this: SELECT * FROM comment WHERE id IN PrimeFactors(332630);
.)
Counting parents is pretty easy too, and we don't even need to ask the database. We simply count how many prime factors we have. Here, we have 4 factors: 5, 29, 31 and 37 (we don't count 2 because it is the root).
Retrieving the children of a node
Children of a given node with the pair (id, parents)
are all nodes in the form
(_,id × parents)
.
Again, the SQL query to retrieve them is pretty simple:
SELECT * from comment WHERE parents = 5 * 2;
id | parents | text
---+---------+--------------------------------------
11 | 10 | First comment on the second comment
29 | 10 | Second comment on the second comment
Counting:
SELECT count(*) FROM comment WHERE parents = 5 * 2;
count
-------
2
Ensuring Consistency
To insert a new node, we need the parent's ID and its parents
value.
We multiply them so we can set our :parents
.
We also need a new unique and prime ID. Here is a safe manner to insert a new node:
We have a table containing as many prime numbers as needed (here, ~10k).
Then, we can insert a new node in a single transaction like this:
WITH newid AS (
SELECT n
FROM primes
WHERE n > (SELECT max(id) FROM comment)
LIMIT 1;
)
INSERT INTO comment
(id, parents, text) VALUES
((SELECT n FROM newid), :parents, :text);
Assuming we have a primes
table looking like this:
SELECT * FROM primes;
n
----
2
3
5
7
11
13
17
...
99929
99961
99971
99989
99991
(9592 rows)
Last words
One of the greatest strength of an RDBMS is its consistency model. By implementing the method presented
in this article, we lost the ensured consistency of our "references" because we can put any number
in the row parents
and the DB won't check if it corresponds to something. That check would be
granted with a reference-based approach.
It is possible to implement triggers on INSERT
s, DELETE
s, and UPDATE
s to overcome this problem, by
programmatically checking consistency properties. However, this would bring a pretty significant amount
of complexity.
References
Serhiy Morozov, Hossein Saiedian and Hanzhang Wang (January 2014). Reusable Prime Number Labeling Scheme for Hierarchical Data Representation in Relational Database. Journal of Computing and Information Technology. p.3, sect. 2. (DOI: 10.2498/cit.1002310)