Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9311950
D5412.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
9 KB
Subscribers
None
D5412.diff
View Options
diff --git a/docs/package-loader-tutorial.rst b/docs/package-loader-tutorial.rst
--- a/docs/package-loader-tutorial.rst
+++ b/docs/package-loader-tutorial.rst
@@ -352,10 +352,238 @@
* etc.
-Making your loader more efficient
----------------------------------
+Making your loader incremental
+------------------------------
-TODO
+In the previous sections, you wrote a fully functional loader for a new type of
+package repository. This is great! Please tell us about it, and
+:ref:`submit it for review <patch-submission>` so we can give you some feedback.
+
+Now, we will see a key optimization for any package loader: skipping packages
+it already downloaded, using :term:`extids <extid>`.
+
+The rough idea it to find some way to uniquely identify packages before downloading
+them and encode it in a short string, the ExtID.
+
+Using checksums
++++++++++++++++
+
+Ideally, this short string is a checksum of the archive, provided by the API
+before downloading the archive itself.
+This is ideal, because this ensures that we detect changes in the package's content
+even if it keeps the same name and version number.
+
+If this is not the case of the repository you want to load from, skip to the
+next subsection.
+
+This is used for example by the PyPI loader (with a sha256sum) and the NPM loader
+(with a sha1sum).
+The Debian loader uses a similar scheme: as a single package is assembled from
+a set of tarballs, it only uses the hash of the ``.dsc`` file, which itself contains
+a hash of all the tarballs.
+
+This is implemented by overriding the ``extid`` method of you ``NewPackageInfo`` class,
+that returns the type of the ExtID (see below) and the ExtID itself::
+
+ from swh.loader.package.loader import PartialExtID
+
+ EXTID_TYPE: str = "pypi-archive-sha256"
+
+ @attr.s
+ class NewPackageInfo(BasePackageInfo):
+ sha256: str
+
+ def extid(self) -> PartialExtID:
+ return (EXTID_TYPE, hash_to_bytes(self.sha256))
+
+and the loader's ``get_package_info`` method sets the right value in the ``sha256``
+attribute.
+
+
+Using a custom manifest
++++++++++++++++++++++++
+
+Unfortunaly, this does not work for all packages, as some package repositories do
+not provide a checksum of the archives via their API.
+If this is the case of the repository you want to load from, you need to find a way
+around it.
+
+It highly depends on the repository, so this tutorial cannot cover how to do it.
+We do however provide an easy option that should work in most cases:
+creating a "manifest" of the archive with some metadata in it, and hashing it.
+
+For example, when loading from the GNU FTP servers, we have access to some metadata,
+that is somewhat good enough to deduplicate. We write them all in a string
+and hash that string.
+
+It is done like this::
+
+ import string
+
+ @attr.s
+ class ArchivePackageInfo(BasePackageInfo):
+ length = attr.ib(type=int)
+ """Size of the archive file"""
+ time = attr.ib(type=Union[str, datetime.datetime])
+ """Timestamp of the archive file on the server"""
+ version = attr.ib(type=str)
+
+ EXTID_FORMAT = "package-manifest-sha256"
+
+ MANIFEST_FORMAT = string.Template("$time $length $version $url")
+
+
+The default implementation of :py:func:`swh.loader.package.loader.BasePackageInfo.extid`
+will read this template, substitute the variables based on the object's attributes,
+compute the hash of the result, and return it.
+
+Note that, as mentioned before, this is not perfect because a tarball may be replaced
+with a different tarball of exactly the same length and modification time,
+and we won't detect it.
+But this is extremely unlikely, so we consider it to be good enough.
+
+
+Alternatively, if this is not good enough for your loader, you can simply not implement
+ExtIDs, and your loader will always load all tarballs.
+This can be bandwidth-heavy for both |swh| and the origin you are loaded from,
+so this decision should not be taken lightly.
+
+
+Choosing the ExtID type
++++++++++++++++++++++++
+
+The type of your ExtID should be a short ASCII string, that is both unique to your
+loader and descriptive of how it was computed.
+
+Why unique to the loader? Because different loaders may load the same archive
+differently.
+For example, if I was to create an archive with both a ``PKG-INFO``
+and a ``package.json`` file, and submit it to both NPM and PyPI,
+both package repositories would have exactly the same tarball.
+But the NPM loader would create the revision based on authorship info in
+``package.json``, and the PyPI loader based on ``PKG-INFO``.
+But we do not want the PyPI loader to assume it already created a revision itself,
+while the revision was created by the NPM loader!
+
+And why descriptive? This is simply for future-proofing; in case your loader changes
+the format of the ExtID (eg. by using a different hash algorithm).
+
+
+Testing your incremental loading
+++++++++++++++++++++++++++++++++
+
+If you followed the steps above, your loader is now able to detect what packages it
+already downloaded and skip them. This is what we call an incremental loader.
+
+It is now time to write tests to make sure your loader fulfills this promise.
+
+This time, we want to use ``requests_mock_datadir_visits`` instead of
+``requests_mock_datadir``, because we want to mock the repository's API to emulate
+its results changing over time (eg. because a new version was published between
+two runs of the loader).
+See the documentation of :py:func:`swh.core.pytest_plugin.requests_mock_datadir_factory`
+for a description of the file layout to use.
+
+Let's take, once again, a look at ``swh/loader/package/pypi/tests/test_pypi.py``,
+to use as an example::
+
+ def test_pypi_incremental_visit(swh_storage, requests_mock_datadir_visits):
+ """With prior visit, 2nd load will result with a different snapshot
+
+ """
+ # Initialize the loader
+ url = "https://pypi.org/project/0805nexter"
+ loader = PyPILoader(swh_storage, url)
+
+ # First visit
+ visit1_actual_load_status = loader.load()
+ visit1_stats = get_stats(swh_storage)
+
+ # Make sure everything is in order
+ expected_snapshot_id = hash_to_bytes("ba6e158ada75d0b3cfb209ffdf6daa4ed34a227a")
+ assert visit1_actual_load_status == {
+ "status": "eventful",
+ "snapshot_id": expected_snapshot_id.hex(),
+ }
+
+ assert_last_visit_matches(
+ swh_storage, url, status="full", type="pypi", snapshot=expected_snapshot_id
+ )
+
+ assert {
+ "content": 6,
+ "directory": 4,
+ "origin": 1,
+ "origin_visit": 1,
+ "release": 0,
+ "revision": 2,
+ "skipped_content": 0,
+ "snapshot": 1,
+ } == visit1_stats
+
+ # Reset internal state
+ del loader._cached__raw_info
+ del loader._cached_info
+
+ # Second visit
+ visit2_actual_load_status = loader.load()
+ visit2_stats = get_stats(swh_storage)
+
+ # Check the result of the visit
+ assert visit2_actual_load_status["status"] == "eventful", visit2_actual_load_status
+ expected_snapshot_id2 = hash_to_bytes("2e5149a7b0725d18231a37b342e9b7c4e121f283")
+ assert visit2_actual_load_status == {
+ "status": "eventful",
+ "snapshot_id": expected_snapshot_id2.hex(),
+ }
+
+ assert_last_visit_matches(
+ swh_storage, url, status="full", type="pypi", snapshot=expected_snapshot_id2
+ )
+
+ assert {
+ "content": 6 + 1, # 1 more content
+ "directory": 4 + 2, # 2 more directories
+ "origin": 1,
+ "origin_visit": 1 + 1,
+ "release": 0,
+ "revision": 2 + 1, # 1 more revision
+ "skipped_content": 0,
+ "snapshot": 1 + 1, # 1 more snapshot
+ } == visit2_stats
+
+ # Check all content objects were loaded
+ expected_contents = map(
+ hash_to_bytes,
+ [
+ "a61e24cdfdab3bb7817f6be85d37a3e666b34566",
+ "938c33483285fd8ad57f15497f538320df82aeb8",
+ "a27576d60e08c94a05006d2e6d540c0fdb5f38c8",
+ "405859113963cb7a797642b45f171d6360425d16",
+ "e5686aa568fdb1d19d7f1329267082fe40482d31",
+ "83ecf6ec1114fd260ca7a833a2d165e71258c338",
+ "92689fa2b7fb4d4fc6fb195bf73a50c87c030639",
+ ],
+ )
+
+ assert list(swh_storage.content_missing_per_sha1(expected_contents)) == []
+
+ # Check all directory objects were loaded
+ expected_dirs = map(
+ hash_to_bytes,
+ [
+ "05219ba38bc542d4345d5638af1ed56c7d43ca7d",
+ "cf019eb456cf6f78d8c4674596f1c9a97ece8f44",
+ "b178b66bd22383d5f16f4f5c923d39ca798861b4",
+ "c3a58f8b57433a4b56caaa5033ae2e0931405338",
+ "e226e7e4ad03b4fc1403d69a18ebdd6f2edd2b3a",
+ "52604d46843b898f5a43208045d09fcf8731631b",
+ ],
+ )
+
+ assert list(swh_storage.directory_missing(expected_dirs)) == []
+
+ # etc.
Loading metadata
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Wed, Jul 2, 10:37 AM (2 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3216277
Attached To
D5412: package-loader-tutorial: Add 'Making your loader incremental' section
Event Timeline
Log In to Comment