diff --git a/README.md b/README.md deleted file mode 100644 --- a/README.md +++ /dev/null @@ -1,10 +0,0 @@ -SWH-loader-core -=============== - -The Software Heritage Core Loader is a low-level loading utilities and -helpers used by other loaders. - -The main entry points are classes: -- :class:`swh.loader.core.loader.BaseLoader` for loaders (e.g. svn) -- :class:`swh.loader.core.loader.DVCSLoader` for DVCS loaders (e.g. hg, git, ...) -- :class:`swh.loader.package.loader.PackageLoader` for Package loaders (e.g. PyPI, Npm, ...) diff --git a/README.rst b/README.rst new file mode 120000 --- /dev/null +++ b/README.rst @@ -0,0 +1 @@ +docs/README.rst \ No newline at end of file diff --git a/docs/README.rst b/docs/README.rst new file mode 100644 --- /dev/null +++ b/docs/README.rst @@ -0,0 +1,31 @@ +.. _swh-loader-core: + +Software Heritage - Loader foundations +====================================== + +The Software Heritage Loader Core is a low-level loading utilities and +helpers used by :term:`loaders `. + +The main entry points are classes: +- :class:`swh.loader.core.loader.BaseLoader` for loaders (e.g. svn) +- :class:`swh.loader.core.loader.DVCSLoader` for DVCS loaders (e.g. hg, git, ...) +- :class:`swh.loader.package.loader.PackageLoader` for Package loaders (e.g. PyPI, Npm, ...) + +Package loaders +--------------- + +This package also implements many package loaders directly, out of convenience, +as they usually are quite similar and each fits in a single file. + +They all roughly follow these steps, explained in the +:py:meth:`swh.loader.package.loader.PackageLoader.load` documentation. + +VCS loaders +----------- + +Unlike package loaders, VCS loaders remain in separate packages, +as they often need more advanced conversions and very VCS-specific operations. + +This usually involves getting the branches of a repository and recursively loading +revisions in the history (and directory trees in these revisions), +until a known revision is found diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -1,17 +1,9 @@ -.. _swh-loader-core: - -Software Heritage - Loader foundations -====================================== - -Low-level loading utilities and helpers used by other loaders. - +.. include:: README.rst .. toctree:: :maxdepth: 2 :caption: Contents: - README - Reference Documentation ----------------------- diff --git a/setup.py b/setup.py --- a/setup.py +++ b/setup.py @@ -12,7 +12,7 @@ here = path.abspath(path.dirname(__file__)) # Get the long description from the README file -with open(path.join(here, "README.md"), encoding="utf-8") as f: +with open(path.join(here, "README.rst"), encoding="utf-8") as f: long_description = f.read() diff --git a/swh/loader/package/loader.py b/swh/loader/package/loader.py --- a/swh/loader/package/loader.py +++ b/swh/loader/package/loader.py @@ -468,24 +468,30 @@ def load(self) -> Dict: """Load for a specific origin the associated contents. - for each package version of the origin + 1. Get the list of versions in an origin. - 1. Fetch the files for one package version By default, this can be + 2. Get the snapshot from the previous run of the loader, + and filter out versions that were already loaded, if their + :term:`extids ` match + + Then, for each remaining version in the origin + + 3. Fetch the files for one package version By default, this can be implemented as a simple HTTP request. Loaders with more specific requirements can override this, e.g.: the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version. - 2. Extract the downloaded files By default, this would be a universal + 4. Extract the downloaded files. By default, this would be a universal archive/tarball extraction. Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x). - 3. Convert the extracted directory to a set of Software Heritage + 5. Convert the extracted directory to a set of Software Heritage objects Using swh.model.from_disk. - 4. Extract the metadata from the unpacked directories This would only + 6. Extract the metadata from the unpacked directories This would only be applicable for "smart" loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control). @@ -495,15 +501,15 @@ revision/release objects (authors, dates) as an argument to the task. - 5. Generate the revision/release objects for the given version. From + 7. Generate the revision/release objects for the given version. From the data generated at steps 3 and 4. end for each - 6. Generate and load the snapshot for the visit + 8. Generate and load the snapshot for the visit - Using the revisions/releases collected at step 5., and the branch - information from step 0., generate a snapshot and load it into the + Using the revisions/releases collected at step 7., and the branch + information from step 2., generate a snapshot and load it into the Software Heritage archive """