Add an utility function to resolve a snapshot branch alias to its real target.
Related to T2734
Differential D4369
algos/snapshot: Add function to resolve branch alias to real target anlambert on Oct 27 2020, 4:48 PM. Authored by
Details
Add an utility function to resolve a snapshot branch alias to its real target. Related to T2734
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D4369 (id=15450)Rebasing onto 9645aef7ab... Current branch diff-target is up to date. Changes applied before testcommit 6ecea2bb3aba79325574eebc4800085d4fab8a53 Author: Antoine Lambert <antoine.lambert@inria.fr> Date: Tue Oct 27 16:45:11 2020 +0100 algos/snapshot: Add function to resolve (possibly chained) aliases See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1027/ for more details. Comment Actions At first glance it looks like it runs in quadratic time, but it seems to be actually linear, as each value in snapshot["branches"] is read at most twice. I think this deserves a small comment at the beginning of the function explaining this. And it looks like it won't stop if there is a cycle; could you add an assertion to make sure it doesn't happen (eg. processed_aliases could probably be used for that)
Comment Actions This seems like a lot of processing for an operation that should be simple (and will often be needed). The three assumptions that make this work fast in most cases:
aren't very strong, so I'm worried that they won't hold in the future. I also think that before being implemented, the usefulness of doing this lookup should be questioned a bit more:
So, to quash these snapshot-related issues once and for all, I think the following should happen:
As a data point, there's currently 68 million alias branches in the main archive, so the population of the resolved branch cache wouldn't generate a very large table.
Comment Actions @olasd, thanks for the shared brainstorming.
I guess we could separate target types count in the branches view of the webapp by displaying a sentence like: "That snapshot contains X branches targetting revisions and Y branch aliases".
I am not sure if we need to go down to directory level but a function resolving breadcrumbs for an alias and returning the list of followed snapshot branches seems indeed simpler.
There is actually six possible target types for a snapshot branch. softwareheritage=> select distinct(target_type) from snapshot_branch; target_type ------------- content alias directory release revision (6 rows) softwareheritage=> select count(*) from snapshot_branch where target_type = 'content'; count ------- 1796 (1 row) softwareheritage=> select count(*) from snapshot_branch where target_type = 'directory'; count ------- 8467 (1 row) softwareheritage=> select count(*) from snapshot_branch where target_type = 'release'; count ---------- 15889250 (1 row) softwareheritage=> select count(*) from snapshot_branch where target_type = 'revision'; count ----------- 324271212 (1 row) softwareheritage=> select count(*) from snapshot_branch where target_type = 'alias'; count -------- 322338 (1 row) If we look at the number of branch per target type, I think we could add the following new columns to the snapshot table: alias_count, release_count, revision_count. Also, if we add those counters, quid of the update of already ingested snapshots. Do we update all in batch or on demand when querying snapshot data ?
I think such a cache could be added at the webapp level instead.
Comment Actions Ok so until further decisions are made regarding the storage of branches count in the backend, I think we can perform the following actions to mitigate T2734
Comment Actions Update:
Comment Actions Build is green Patch application report for D4369 (id=15563)Rebasing onto 6e3e35096f... Current branch diff-target is up to date. Changes applied before testcommit 06adee71227e6f0150251f643b4fab68f344a3a2 Author: Antoine Lambert <antoine.lambert@inria.fr> Date: Tue Oct 27 16:45:11 2020 +0100 algos/snapshot: Add function to resolve branch alias to real target Related to T2734 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1030/ for more details. Comment Actions What about returning a tuple of (list of intermediate branches, last branch)? That way the return type is Optional[Tuple[List[SnapshotBranch], Optional[SnapshotBranch]] and callers of this function don't have to assert the intermediate branches are not-None. The docstring also needs to mention what happens in case of a cycle.
Comment Actions Build is green Patch application report for D4369 (id=15578)Rebasing onto 6e3e35096f... Current branch diff-target is up to date. Changes applied before testcommit 8b1815572c695f7c9751b353f51d507e3fe76e0a Author: Antoine Lambert <antoine.lambert@inria.fr> Date: Tue Oct 27 16:45:11 2020 +0100 algos/snapshot: Add function to resolve branch alias to real target Related to T2734 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1035/ for more details. |