git loader: enable "partial" global deduplication of revisions via the extid mapping table
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Oct 8 2021, 2:12 PM

Description

We discussed with @olasd the possibility to lift the extid mapping table to make the
loader git work even less. That would definitely be helpful to compute less ids
especially when ingesting forks.

Currently, without this ^, we only reuse the previous snapshot references to avoid
ingesting known references again. So when we ingest another fork of something known, we
actually do the work again.

All in all, our deduplication exists at the storage level (db or objstorage) but not
completely at the computation level.

For this to happen, we need to start using the extid table using to map the snapshot
branches references to their corresponding sha1_git ids.

So the loader git adaptations would be to:

at the end of the loading, after the snapshot creation of the visit. At this point, we know we don't have dangling references. So it's fine to rely on this to skip some work.

retrieve the branches references of that snapshot (filtering aliases) and push for each reference the mapping (version 0, sha1-git of the commit/tag, revision/release id) into the extid table

at the beginning of the loading, adapt the reading of unknown references (for the origin) to filter out actual known references through the extid table. If they are present, we know them, we can skip the work. Nonetheless, those references should end up in the final snapshot.

That's actually what's been done recently with the mercurial loader.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2206 Quality of Service
Migrated	gitlab-migration	T4080 Minimize archival lag w.r.t. upstream code hosting platforms
Migrated	gitlab-migration	T2207 Improve ingestion efficiency
Migrated	gitlab-migration	T3655 loader git: enable global deduplication of head branches before fetching them
Migrated	gitlab-migration	T3653 Stabilize loader git
Migrated	gitlab-migration	T3635 git loader: enable "partial" global deduplication of revisions via the extid mapping table

Event Timeline

ardumont triaged this task as Normal priority.Oct 8 2021, 2:12 PM

ardumont created this task.

ardumont added a project: Git loader.Oct 8 2021, 2:17 PM

douardda added a subscriber: douardda.Oct 11 2021, 9:54 AM

ardumont added a parent task: T3653: Stabilize loader git.Oct 14 2021, 10:37 AM

ardumont mentioned this in T3653: Stabilize loader git.Oct 14 2021, 10:39 AM

olasd renamed this task from Reduce git loader work (use extid mapping table) to git loader: enable "partial" global deduplication of revisions via the extid mapping table.Oct 14 2021, 11:15 AM

olasd mentioned this in T3655: loader git: enable global deduplication of head branches before fetching them.Oct 14 2021, 11:18 AM

olasd added a parent task: T3655: loader git: enable global deduplication of head branches before fetching them.

Ok I think what puzzle me in this description is the fact the 2 first bullets of the "git loader adaptations" are actually only one point: at the end of a successful loading, store a mapping in the extid table.

Now, I still don't understand what mapping is to be stored in the extid table. What is the meaning of (version 0, sha1-git of the commit/tag, revision/release id) above? (I would expect a mapping to be a couple).

Then I don't really get how this can help if we don't load revisions in topological order.

In T3635#72206, @douardda wrote:

Then I don't really get how this can help if we don't load revisions in topological order.

When the snapshot is being added, at the end of a load, the current git loader implementation guarantees that all the branch heads point to objects whose history has been fully loaded in the archive:

we know that, in the current loader run, all objects of the previous object types have been successfully loaded
we know that the packfile that we're currently loading contains all the objects that have been added in the origin between the snapshot being added and the one in the previous visit.
and recursively we know that the previous snapshots are history-complete. Hopefully? (I guess we need T3656 to be sure of it)

Now, I still don't understand what mapping is to be stored in the extid table. What is
meaning of (version 0, sha1-git of the commit/tag, revision/release id) above? (I
expect a mapping to be a couple).

It's a triplet with possible values (we have an extra 'version' field now because we
need to distinguish versions of data, recall the mercurial loader's old implementation
and the new for example that implied this).

The possible values are for git afaik:

(version, sha1-git of the commit, swh revision id)
(version, sha1-git of the tag, swh release id)

What you did not get was a contraction of ^.

effort : low

this adds "redundant" data to the extid table (space usage increase)
but this is a cheaper way to enable global deduplication until the topological order is guaranteed

This task has been migrated to GitLab.

git loader: enable "partial" global deduplication of revisions via the extid mapping tableClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

git loader: enable "partial" global deduplication of revisions via the extid mapping table
Closed, MigratedEdits Locked
Actions

Related Objects
Search...