SQL storage: experiment with flattened layouts for directory nodes
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Sep 15 2020, 12:53 PM

Description

The current SQL storage represents directory with two (groups of) tables: one for directory nodes (linking them to directory entries) and three tables for directory entries and their attributes.

The join between the directory table and the directory entry ones is a massive pain, due to the size of the involved tables (see: P766).

We would like to experiment with more "flattened" layouts for the SQL representation of directories, that reduce or eliminate the need for such a join.

Experimentation should benchmark storage size and access speed, for both one-off and batch access.

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T2600 SQL storage: experiment with flattened layouts for directory nodes
		Migrated	gitlab-migration	T2601 create a scratch/temporary postgres DB to experiment with flattened directories

Event Timeline

zack triaged this task as Normal priority.Sep 15 2020, 12:53 PM

zack created this task.

We considered three possibilities for the schema (assuming that we want to get rid of the three separate tables for dir_entries, rev_entries and file_entries -- otherwise, there's 6 possibilities).

A single flat table representing the directories. It has multiple caveats, notably it can't differentiate between having the empty folder and not having it, and we will never be able to add metadata to directories. Its only advantage is that it is the flattest representation possible.

Two tables, one for the directory and one for the entries (the "through" table of the many-to-many relationship) using the directory hash as the foreign key.

Two tables, exactly like 2., but using an integer as the foreign key. It will probably be faster, it takes less space, and it's also consistent with how we handle snapshots. The drawbacks are that it's impossible to read the edges without doing a join (which is trivial with the previous option) and that using sequential integers makes it hard to shard (but we already have that issue for snapshots and origins -- that's why the data model is like 2. in Cassandra).

vsellier added a subscriber: vsellier.Sep 16 2020, 9:21 AM

zack mentioned this in T2607: git loader OOM when loading the linux kernel repo.Sep 16 2020, 8:26 PM

gitlab-migration closed subtask T2601: create a scratch/temporary postgres DB to experiment with flattened directories as Migrated.Oct 19 2022, 5:58 PM

This task has been migrated to GitLab.

SQL storage: experiment with flattened layouts for directory nodesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

SQL storage: experiment with flattened layouts for directory nodes
Closed, MigratedEdits Locked
Actions

Related Objects
Search...