Page MenuHomeSoftware Heritage

SQL storage: experiment with flattened layouts for directory nodes
Closed, MigratedEdits Locked

Description

The current SQL storage represents directory with two (groups of) tables: one for directory nodes (linking them to directory entries) and three tables for directory entries and their attributes.

The join between the directory table and the directory entry ones is a massive pain, due to the size of the involved tables (see: P766).

We would like to experiment with more "flattened" layouts for the SQL representation of directories, that reduce or eliminate the need for such a join.

Experimentation should benchmark storage size and access speed, for both one-off and batch access.

Event Timeline

zack triaged this task as Normal priority.Sep 15 2020, 12:53 PM
zack created this task.

We considered three possibilities for the schema (assuming that we want to get rid of the three separate tables for dir_entries, rev_entries and file_entries -- otherwise, there's 6 possibilities).

  1. A single flat table representing the directories. It has multiple caveats, notably it can't differentiate between having the empty folder and not having it, and we will never be able to add metadata to directories. Its only advantage is that it is the flattest representation possible.

  1. Two tables, one for the directory and one for the entries (the "through" table of the many-to-many relationship) using the directory hash as the foreign key.

  1. Two tables, exactly like 2., but using an integer as the foreign key. It will probably be faster, it takes less space, and it's also consistent with how we handle snapshots. The drawbacks are that it's impossible to read the edges without doing a join (which is trivial with the previous option) and that using sequential integers makes it hard to shard (but we already have that issue for snapshots and origins -- that's why the data model is like 2. in Cassandra).