Page MenuHomeSoftware Heritage

Archive loader should have a snapshot append mode
Closed, MigratedEdits Locked

Description

In a near future, Software Heritage ambassadors will be able to submit save code now requests with a new archives visit type from the Web UI.

That visit type will allow to load multiple tarballs or zip files under a same software origin, one branch being created for each artifact to load.

Under the hood, the ArchiveLoader is used to load the tarballs into the archive.

However, that loader creates a new visit with a snapshot containing only the given tarballs provided as parameter each time it is invoked.

For the save code now case, it means that if a user submits a first request for an origin with a given set of artifacts and then another one
with a different set of artifacts, he will have to switch between visits to access all the code loaded into the archive.

It would be much more convenient to include all artifacts previously uploaded in any new snapshot created by the loader, thus any artifacts
uploaded so far will be available in the latest snapshot of the origin.

This could be implemented by adding a new option to the loader.

Event Timeline

anlambert triaged this task as Normal priority.May 26 2021, 4:46 PM
anlambert created this task.

This could be implemented by adding a new option to the loader.

Or even directly the archive loader's default behavior (append previously seen branch
from early snapshot/visit). As discussed, I'm wondering whether an archive loader (gnu
or cran [1]) would not benefit from always displaying previously seen branches (whether
they are still present in the current main api we list/visit or not [2]).

[1] which is another slightly more specific archive loader due to the metadata
extraction implementation detail

[2] For example, iirc, with cran, they always show the most recent version of a package.
The previous versions are posted/archived in another forge (a snapshot one). For gnu,
that may not be true for gnu though which may already append new versions. In any case,
that would not bother the loader which would just add a branch already there... (so most
likely a noop).

@rdicosmo What do you think?

Or even directly the archive loader's default behavior (append previously seen branch
from early snapshot/visit). As discussed, I'm wondering whether an archive loader (gnu
or cran [1]) would not benefit from always displaying previously seen branches (whether
they are still present in the current main api we list/visit or not [2]).

Yes, I can imagine that some artifacts previously loaded got removed from the listing
and thus we will lost track of it in new visits.

For the record, that simple diff is enough to implement the wanted behavior:

diff --git a/swh/loader/package/archive/loader.py b/swh/loader/package/archive/loader.py
index 1f3c924..a2b2229 100644
--- a/swh/loader/package/archive/loader.py
+++ b/swh/loader/package/archive/loader.py
@@ -8,7 +8,7 @@ import hashlib
 import logging
 from os import path
 import string
-from typing import Any, Dict, Iterator, Optional, Sequence, Tuple, Union
+from typing import Any, Dict, Iterator, Mapping, Optional, Sequence, Tuple, Union
 
 import attr
 import iso8601
@@ -84,6 +84,7 @@ class ArchiveLoader(PackageLoader[ArchivePackageInfo]):
         artifacts: Sequence[Dict[str, Any]],
         extid_manifest_format: Optional[str] = None,
         max_content_size: Optional[int] = None,
+        snapshot_append: bool = False,
     ):
         f"""Loader constructor.
 
@@ -107,6 +108,8 @@ class ArchiveLoader(PackageLoader[ArchivePackageInfo]):
             extid_manifest_format: template string used to format a manifest,
                 which is hashed to get the extid of a package.
                 Defaults to {ArchivePackageInfo.MANIFEST_FORMAT!r}
+            snapshot_append: if :const:`True`, append latest snapshot content to
+                the new snapshot created by the loader
 
         """
         super().__init__(storage=storage, url=url, max_content_size=max_content_size)
@@ -116,6 +119,7 @@ class ArchiveLoader(PackageLoader[ArchivePackageInfo]):
             if extid_manifest_format is None
             else string.Template(extid_manifest_format)
         )
+        self.snapshot_append = snapshot_append
 
     def get_versions(self) -> Sequence[str]:
         versions = []
@@ -164,3 +168,9 @@ class ArchiveLoader(PackageLoader[ArchivePackageInfo]):
             directory=directory,
             synthetic=True,
         )
+
+    def extra_branches(self) -> Dict[bytes, Mapping[str, Any]]:
+        if not self.snapshot_append:
+            return {}
+        last_snapshot = self.last_snapshot()
+        return dict(last_snapshot.to_dict()["branches"]) if last_snapshot else {}
diff --git a/swh/loader/package/archive/tasks.py b/swh/loader/package/archive/tasks.py
index 4ac17f1..727cffd 100644
--- a/swh/loader/package/archive/tasks.py
+++ b/swh/loader/package/archive/tasks.py
@@ -9,7 +9,9 @@ from swh.loader.package.archive.loader import ArchiveLoader
 
 
 @shared_task(name=__name__ + ".LoadArchive")
-def load_archive_files(*, url=None, artifacts=None):
+def load_archive_files(*, url=None, artifacts=None, snapshot_append=False):
     """Load archive's artifacts (e.g gnu, etc...)"""
-    loader = ArchiveLoader.from_configfile(url=url, artifacts=artifacts)
+    loader = ArchiveLoader.from_configfile(
+        url=url, artifacts=artifacts, snapshot_append=snapshot_append
+    )
     return loader.load()

Your diff sounds quite enough.

So we can open that new behavior in the archive loader, left if off by default. Make the
webapp actually send the snapshot_append (to True) alongside the current task for the
'tarballs' types. And then, if we decide we want that default behavior of appending
branches from previous snapshots to other archives (e.g gnu), we can always flip that
parameter to True by default later (we'll also need the adaptation on the other cran
loader at that point but out of scope here ;).