I want to explain the "table closure" data structure that I want to use for the storage of the tree of tags used by the civic-tech-taxonomy project.
I am explaining it here with a relational table. For the civic-tech-taxonomy project, the table would be an array of maps and the query would be done with a filter() method call.
Normally, trying to store an arbitrary tree in a relational table leads to an anti-pattern. The uninformed way to set up storage for a tree is to have one or more columns for the node information (eg name) and a foreign key pointing to the "parent" node's row. Say that you have a tree:
A -------> B ------> C --------> D \ \ \ \------> E \ \-----> F --------> GThis would, in the naive way, be stored as:
| pk | parent_pk | name ----------------------- | 1 | NULL | A | 2 | 1 | B | 3 | 2 | C | 4 | 2 | F | 5 | 3 | D | 6 | 3 | E | 7 | 4 | GIf you want to make a tree-oriented query, this is a problem. If you want to ask for all the children of a node, you will have to make one query to get the node's children, a second query to get the node's 2nd generation children, a third query to get the node's third generation children, and so on. There are workarounds, but there is no way to avoid the problem that a fetch of a arbitrarily high tree will take an indeterminable amount of time.
One common workaround is to limit the height of the tree. If the tree has a maximum height of n nodes, one can do a tree-oriented query by creating a query that a union of (n-1) queries, one for each generation. This is lame. Requiring users to change their business logic because you have to set a max height to the tree should not be an option.
A table closure allows one to store the same tree. It takes more space and there is more work to add nodes, but an optimal query is always determinedly findable. Modern databases are not often constrained by size and most of these tree structure are write-seldom (or write-once), read-many.
The table above would be stored as:
| pk | ancestor_pk | distance | name ------------------------------------------------ | 1 | NULL | 1 | A | 2 | 1 | 1 | B | 3 | 2 | 1 | C | 4 | 1 | 2 | C | 5 | 2 | 1 | F | 6 | 1 | 2 | F | 7 | 3 | 1 | D | 8 | 2 | 2 | D | 9 | 1 | 3 | D | 10 | 3 | 1 | E | 11 | 2 | 2 | E | 12 | 1 | 3 | E | 13 | 5 | 1 | F | 14 | 2 | 2 | F | 15 | 1 | 3 | FThe problem of having duplicate node entries for each name (in the table above) is usually solved by having the name column actually be a foreign key to a table of unique nodes.
You can probably see that if you ask for all the children of a node or all of the descendants of a node, that is one only one query. If you want all the descendants of B, ask for all nodes whose ancestor is B. Done. If you want all the ancestors of F, ask for all the nodes with the name F. Done. Yes, adding data is harder. Adding a node with n ancestors involves adding n rows, one for each ancestor. But again, we are often looking at write-once, read many. So, slower adds are less important than slower reads.