Page MenuHomeSoftware Heritage

algos/snapshot: Add function to resolve branch alias to real target
ClosedPublic

Authored by anlambert on Oct 27 2020, 4:48 PM.

Details

Summary

Add an utility function to resolve a snapshot branch alias to its real target.

Related to T2734

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4369 (id=15450)

Rebasing onto 9645aef7ab...

Current branch diff-target is up to date.
Changes applied before test
commit 6ecea2bb3aba79325574eebc4800085d4fab8a53
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Oct 27 16:45:11 2020 +0100

    algos/snapshot: Add function to resolve (possibly chained) aliases

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1027/ for more details.

vlorentz added a subscriber: vlorentz.

At first glance it looks like it runs in quadratic time, but it seems to be actually linear, as each value in snapshot["branches"] is read at most twice. I think this deserves a small comment at the beginning of the function explaining this.

And it looks like it won't stop if there is a cycle; could you add an assertion to make sure it doesn't happen (eg. processed_aliases could probably be used for that)

swh/storage/algos/snapshot.py
179

I think you mean "assumed" instead of "considered".

But the parenthesis is confusing, you should either remove it or rewrite it to say directly there is no restriction if it's not provided

197–200

use snapshot_get_all_branches so you can remove this while loop

213

"iteration" is unclear as there are three levels of loops. (or two after a rewrite to use snapshot_get_all_branches). You should mention it's the for loop / the loop iterating on branches.

220

how could snapshot["branches"] be None?

226–233

not tested according to Phab; and you can remove most of it by using snapshot_get_all_branches.

Also, are you sure branches is not None? It looks like it could happen on dangling branches. (and we're missing a test for dandling branches anyway, if I'm reading the tests correctly)

244–250

also not tested; but you can also remove this code by using snapshot_get_all_branches

252

if next_branch is always None, then it doesn't need to be PartialBranches. I think Dict[bytes, Optional[SnapshotBranch]] would be a better type for this.

swh/storage/tests/algos/test_snapshot.py
301

you can create a regular dict literal here, it will be converted into an ImmutableDict by the constructor.

This revision now requires changes to proceed.Nov 2 2020, 10:58 AM
swh/storage/algos/snapshot.py
197–200

Iterating per batch is intended here as some snapshots can be huge, for instance V8 JavaScript Engine repository has more than
200k branches.

Most of snapshot contents have a single alias named HEAD that will be processed once the first batch of branches is retrieved.
I know the while loop complicates the code but it enables to save a consequent amount of database queries for large snapshots
if we had to get all branches first before resolving aliases.

213

ack

220

Indeed, this is not needed. mypy was somehow complaining if I recall but I guess that I improved the code afterwards.

226–233

ack, will improve the handling of dangling branches

244–250

Batch iteration should be kept for optimal performances on large snapshots.

252

ack, this is better indeed

This seems like a lot of processing for an operation that should be simple (and will often be needed).

The three assumptions that make this work fast in most cases:

  • there will probably be a single alias
  • it's probably named HEAD
  • it'll probably get processed in the first iteration of the branch lookup loop because snapshots only have a few branches

aren't very strong, so I'm worried that they won't hold in the future.

I also think that before being implemented, the usefulness of doing this lookup should be questioned a bit more:

  1. I'm not sure that branch aliases should count towards the number of actual branches or tags in a given snapshot, as displayed by swh-web.
  2. This doesn't solve the resolution of a given snapshot branch (answering the question "what actual object does HEAD point to for snapshot swh:1:snp:xxyyzz?")
  3. I think a more useful "algos" function, in practice, would be to retrieve the directory pointed at by a given snapshot branch, with a list of breadcrumbs of all the objects present "in between" (branch foo of swh:1:snp:xxyyzz is an alias to branch refs/tags/bar which points to a release swh:1:rel:xx which points to a revision swh:1:rev:yy which points to directory swh:1:dir:zz).

So, to quash these snapshot-related issues once and for all, I think the following should happen:

  • Store the snapshot sizes in the database instead of computing them every time we look up a snapshot. This is only adding a few integer columns to the snapshot table, and it's really easy to do on snapshot insertion. This also forces us to specify and agree on what branches counts for which type.
  • Potentially, keep a cache of "resolved" branches on snapshots (with the following columns: snapshot id, branch name, final target type, final target id, alias chain (if relevant)). Not useful in cassandra which already stores the data in that format. Populate the default branch, or populate all the aliases in the cache on snapshot insertion.
  • Provide a backend function to look up a single, fully resolved, branch of a given snapshot. This would use the cache if the line is present, and fall back to the current (poorly indexed) branch lookup function.
  • (unrelated to this) Potentially, provide an algos function to look up all the breadcrumbs from a given branch to a directory.

As a data point, there's currently 68 million alias branches in the main archive, so the population of the resolved branch cache wouldn't generate a very large table.

swh/storage/algos/snapshot.py
197–200

The current code in swh.web is always doing this operation after counting the branches. So, in effect, you've already asked the database to read all the branches of the corresponding snapshot. Doing that again will hit the cache and shouldn't be a problem at all, even for large snapshots. (I know this is a pretty weak assumption)

@olasd, thanks for the shared brainstorming.

I'm not sure that branch aliases should count towards the number of actual branches or tags in a given snapshot, as displayed by swh-web.

I guess we could separate target types count in the branches view of the webapp by displaying a sentence like: "That snapshot contains X branches targetting revisions and Y branch aliases".
A dedicated icon should also be associated for a branch alias for consistency.

I think a more useful "algos" function, in practice, would be to retrieve the directory pointed at by a given snapshot branch, with a list of breadcrumbs of all the objects present "in between" (branch foo of swh:1:snp:xxyyzz is an alias to branch refs/tags/bar which points to a release swh:1:rel:xx which points to a revision swh:1:rev:yy which points to directory swh:1:dir:zz)

I am not sure if we need to go down to directory level but a function resolving breadcrumbs for an alias and returning the list of followed snapshot branches seems indeed simpler.

Store the snapshot sizes in the database instead of computing them every time we look up a snapshot. This is only adding a few integer columns to the snapshot table, and it's really easy to do on snapshot insertion. This also forces us to specify and agree on what branches counts for which type.

There is actually six possible target types for a snapshot branch.

softwareheritage=> select distinct(target_type) from snapshot_branch;
 target_type 
-------------
 
 content
 alias
 directory
 release
 revision
(6 rows)

softwareheritage=> select count(*) from snapshot_branch where target_type = 'content';
 count 
-------
  1796
(1 row)

softwareheritage=> select count(*) from snapshot_branch where target_type = 'directory';
 count 
-------
  8467
(1 row)

softwareheritage=> select count(*) from snapshot_branch where target_type = 'release';
  count   
----------
 15889250
(1 row)

softwareheritage=> select count(*) from snapshot_branch where target_type = 'revision';
   count   
-----------
 324271212
(1 row)

softwareheritage=> select count(*) from snapshot_branch where target_type = 'alias';
 count  
--------
 322338
(1 row)

If we look at the number of branch per target type, I think we could add the following new columns to the snapshot table: alias_count, release_count, revision_count.
I do not think adding counts for content and directory target types is relevant considering the few instances that can be found.

Also, if we add those counters, quid of the update of already ingested snapshots. Do we update all in batch or on demand when querying snapshot data ?

Potentially, keep a cache of "resolved" branches on snapshots (with the following columns: snapshot id, branch name, final target type, final target id, alias chain (if relevant)). Not useful in cassandra which already stores the data in that format. Populate the default branch, or populate all the aliases in the cache on snapshot insertion.

I think such a cache could be added at the webapp level instead.

swh/storage/algos/snapshot.py
197–200

@anlambert Then we just need a lazy alternative to snapshot_get_all_branches (which would be called by both snapshot_resolve_aliases and snapshot_get_all_branches)

swh/storage/algos/snapshot.py
197–200

Yeah, that'd make sense too.

Ok so until further decisions are made regarding the storage of branches count in the backend, I think we can perform the following actions to mitigate T2734
without touching any database schema:

  • add a function snapshot_resolve_alias in swh.storage.algos.snapshot that returns branch breadcrumbs of snapshot resolving
  • stop trying to resolve aliases in swh.web.common.archive.lookup_snapshot_sizes and thus do not increment revision/release counters when resolving an alias
  • resolve aliases only when a link to their real target needs to be displayed in swh-web and use a dedicated icon in the branches / releases list view to indicate the branch name is an alias
  • add snapshot sizes in django cache to speedup navigation for a given snapshot (there is already a commit in D4356 for that)
swh/storage/algos/snapshot.py
197–200

Indeed, this seems to be a right move here.

Nevertheless, as @olasd proposed I think we should modify that function to only resolve a single alias rather than
all the ones present in the snapshot.

That function could then be used in swh-web codebase as alias resolving is performed in multiple places
and this would greatly simplify the code.

But I would also really like to have a lazy version of snapshot_get_all_branches, I will submit that in another diff.

Update:

  • Prefer to add a function that resolves a single alias in a snapshot instead of multiple ones
  • Update and add more tests
anlambert retitled this revision from algos/snapshot: Add function to resolve (possibly chained) aliases to algos/snapshot: Add function to resolve branch alias to real target.Nov 2 2020, 6:02 PM
anlambert edited the summary of this revision. (Show Details)

Build is green

Patch application report for D4369 (id=15563)

Rebasing onto 6e3e35096f...

Current branch diff-target is up to date.
Changes applied before test
commit 06adee71227e6f0150251f643b4fab68f344a3a2
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Oct 27 16:45:11 2020 +0100

    algos/snapshot: Add function to resolve branch alias to real target
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1030/ for more details.

What about returning a tuple of (list of intermediate branches, last branch)? That way the return type is Optional[Tuple[List[SnapshotBranch], Optional[SnapshotBranch]] and callers of this function don't have to assert the intermediate branches are not-None.

The docstring also needs to mention what happens in case of a cycle.

swh/storage/algos/snapshot.py
199

seen_aliases = {alias_name}

swh/storage/tests/algos/test_snapshot.py
301

er, I meant the constructor of Snapshot. ie. you can do this:

Snapshot(branches={...}) instead of Snapshot(branches=ImmutableDict({...}))

This revision now requires changes to proceed.Nov 3 2020, 11:00 AM

Build is green

Patch application report for D4369 (id=15578)

Rebasing onto 6e3e35096f...

Current branch diff-target is up to date.
Changes applied before test
commit 8b1815572c695f7c9751b353f51d507e3fe76e0a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Oct 27 16:45:11 2020 +0100

    algos/snapshot: Add function to resolve branch alias to real target
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1035/ for more details.

This revision is now accepted and ready to land.Nov 3 2020, 1:30 PM