Page MenuHomeSoftware Heritage

swh.loader.package: Implement a method to prepare package visit
AbandonedPublic

Authored by ardumont on Jul 7 2019, 10:10 PM.

Details

Summary

Implement methods to convert the metadata available for
each version of the focused package to a standard format.
The metadata of all the versions of a focused package can
either be passed by the lister or could be obtained by an API
call.
fetch_metadata() method can be overridden to get the
metadata of the focused package by the appropriate means
and convert it into a standard format.

Later, an origin is created by using the metadata provided
by the fetch_metadata() method.

Related T1389#31790

Closes T1389

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
b1
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 7133
Build 10057: tox-on-jenkinsJenkins
Build 10056: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There is a concern regarding base loader that its implementation style could invite the problems that are present in lister base code. Here is my take on that.
There are my thoughts on this topic. These are the points I keep in mind while making the base loader.

  • Lister usually deals with the API provided by the code hosting platform(i.e. it has a lot of contact with the outside world). Hence this increases the chances of rendering the base lister useless because there is a lot how variety in the API provided. Ranging from pagination followed by GitHub API to one-page API of pypi all the way to download a JSON file to get all repositories in GNU, and never to forget using two languages like in CRAN.

There is a massive variety of APIs that can be found in the wild and drafting a uniform process to satisfy all of these needs is near impossible.
But what different done in base loader is the part where there could be a need to contact API to get metadata, where things can go wild, that part is not even touched in the implementation. i.e. it just wants all the metadata in a specific format(convert_to_standard_format()) and doesn't care how it is done, giving the freedom to choose the best way and implement it.

  • Moving on, comes the part of getting the last snapshot and extracting already archived version. As this process is mainly concealed within the swh environment, here reducing the possibility to encounter some edge cases in the wild. So this process is automated by the base loader.
  • Next step is to filter out previously archived packages and download and unzip the new one. Things could be a bit tricky here because there could different needs of different package managers, but for this stage, I have tried to break the process into small methods, so that only a small portion of code is needed to be overridden if required. Currently, there are two ways used to identify if the package is archived, one using a field in metadata and other by if-modified-then request. This part might need attention in future, but I assure it would work pretty decent for now( and for the near future).
  • Now comes the step to create a revision, this will also vary as per the availability of metadata, hence for this, I have provided a hook point for every key of the revision dict separately and a default value for all the field, to make sure smooth execution is future. There is also a hook point given in the end modify_revision() to eliminate the need of overriding the compute_revision()
  • Next is storing the data and clean up unnecessary files. Again this has no interaction with the outside world, only limited to swh environment. The base loader automates this process

So as you can see, to avoid committing the same mistake that was committed in Lister, I tried to stay away from the steps that are too uncertain, I also have been attempting to provide enough hook points on the steps where there is a slight chance of encountering something unexpected. These are the precautionary measures taken in the base loader to ensure its smooth working and so that it can serve its purpose of easing up the creating of a now loader.

  • Increase the number of hookpoints

You might notice some vage comments like # Do something, they are just for my referance purpose and I wil rewrite them into something meaningful in next revision

swh/loader/base/build_revision.py
11 ↗(On Diff #5856)

Please as mentioned already, use CamelCase as defined everywhere else (same goes for other classes).

58 ↗(On Diff #5856)

I realize that it's not necessary true.
The offset should be part of the find_date implementation...

swh/loader/base/dowload.py
226 ↗(On Diff #5856)

The docstring is below the variable definition.

compare_field = None
"""Field used to identify if the p..."""
nahimilega marked 4 inline comments as done.
  • Improve docstrings
  • Perform recommended changes
nahimilega added inline comments.
swh/loader/base/build_revision.py
58 ↗(On Diff #5856)

Yes, you are correct, actually, it is already the part of find_date, but I realized it now

Add docstring to method (explicit rather than implicit)

I have broken big methods into small ones, like for download_generate_hash() I have converted some of its parts as a separate method(eg uncompress_tarball() ,check_file(), write_file() to make the code more readable. And I have mentioned docstrings in all of them. So do I also need to add an explicit comment for the same or the docstrings are enough?

swh/loader/base/loader.py
276–285 ↗(On Diff #5884)

Can you please check once for this method that docstring appropriate and correctly explains what is done in the function.
I am not sure about this method that the thing I stated in docstring is actually done by this method

Submitting my comments already.

swh/loader/base/dowload.py
226 ↗(On Diff #5856)

ping ;)

swh/loader/base/loader.py
60 ↗(On Diff #5884)

Those should be docstrings, not values of the variables.

loader_name = None
"""Package manager name"""
class_name = None
"""Loader class name"""
etc...
64 ↗(On Diff #5884)

If those are for the configuration file to load, we are moving away from this (even though, that's not apparent in the code).
Please remove.

We use the environment variable stanza (i explained to you in irc) to deploy those, be that in the docker-dev or the production environment.

276–285 ↗(On Diff #5884)

It's the method we use everywhere to compute hashes on a file.

This method does 2 things, given 1 response which holds a raw file to some request:

  • compute the hashes (sha1, sha1_git, sha256, blake2s) of the file
  • write the raw file in filepath

Implementation wise, this does this in one round-trip.

It then returns the hashes (dict) of the file.

IIRC, the writing part was done in a cache (on disk).

453 ↗(On Diff #5884)

I don't understand that method.

I did not read everything yet.
I just submit some comments i forgot to submit yesterday.

swh/loader/base/loader.py
453 ↗(On Diff #5884)

There is a HEAD for every branch that is created by the loader.
In NPM loader this head branch is predicted by the latest version ie the latest version of the packages is the head of the branch.

Now, this is possible when the loader with NPM to determine the latest release because the version of the package is provided in the metadata.

But there could be some package manager(in fact many) which do not provide the version metadata or some metadata which could be used to determine the latest release to determine the HEAD.

So I am not sure how to make properly identify the head of the branch hence for testing purpose I have done if True:

One possible way out could be we just leave this method to be made in all the loaders separately(as done in convert_to_standard_format().

Is there any better approach to solve this problem?

And is it necessary to have HEAD for the branch?

458–460 ↗(On Diff #5884)

Improve docstring

458–466 ↗(On Diff #5884)

Same problem of unavailability of metadata as mentioned in the inline comment find_head().
Same solution:
One possible way out could be we just leave this method to be made in all the loaders separately(as done in convert_to_standard_format().

Is there any better approach to solve this problem?

nahimilega marked 2 inline comments as done.
  • Made recommended changes
  • Remove cran loader from this diff
swh/loader/base/loader.py
64 ↗(On Diff #5884)

Shall I also mention in the docstring or someplace tp environment variable stanza. Like.

export SWH_CONFIG_FILENAME=/path/to/loader-gnu.yml
ardumont added inline comments.
swh/loader/base/loader.py
71 ↗(On Diff #5935)

Also, i'd need to check but the additional_config is also something to deprecate (so to remove here)...
While that documents a bit what's the expectations around config is.
That's not something that we want to keep.

Rationale: if it's not maintained anymore, it's dead code, so that must go away.

64 ↗(On Diff #5884)

I don't really know where to define that part without duplicating... (it's already in the docker-dev for other loaders and in prod since we need it to work there ;)
So maybe, let's not spend too much time on this documentation part for now.

nahimilega marked an inline comment as done.
  • Fixed bugs found in tests

Fixed bugs found in tests

I tried to track down the fixes with phabricator to no avail.
Can you please comment on those, i'm curious?
TIA ;)

Here are the bugs I found and fixed ;)

swh/loader/base/dowload.py
335 ↗(On Diff #5972)

Changed known_versions=None to known_versions={} because if known_versions is None then it would create a problem later in code

swh/loader/base/loader.py
185 ↗(On Diff #5972)

Change the function name from known_versions() to known_version() because there was a class variable self.known_versions.

281 ↗(On Diff #5972)

previously function definition was def check_file(self, length, filepath): but it was called as self.check_file(filepath, length) which lead to assigning of filepath value to length variable. Fixed this bug

@nahimilega , I added a first batch of inline comments to what you implemented so far.

This is moving to a good direction but there is plenty of room for improvements and code simplification before we can land this.

I need to get deeper in your code before suggesting more implementation improvements.

swh/loader/base/dowload.py
1 ↗(On Diff #5972)

I am not a big fan of using inheritance in derived package loaders implementation (CRAN, GNU, ...) to declare the package tarballs retrieval behavior.
I would rather use a single class, named PackageDownloader for instance, that will internally be used by the PackageLoader class through composition instead.

Derived classes implementing real packages loader (GNU, CRAN, ..) should only inherit from the PackageLoader class and reimplement some hook methods to override the default behavior.

I need to further analyze what is implemented in that file before suggesting improvements.

swh/loader/base/loader.py
28 ↗(On Diff #5972)

How about renaming that class to PackageLoader ? This feels more explicit about what it is intended to do.

Also, I would rename the folder containing these new source files from base to package.
Maybe this could also be moved in a new dedicated swh-loader-package repository, but let's
keep it there at the moment.

83–118 ↗(On Diff #5972)

From my point of view, this method is useless and could be removed if we slightly update the package listers implementation (GNU, CRAN, npm, PyPI, ...). Those listers should provide the following parameters to the loader: name, origin_url and tarballs.

144 ↗(On Diff #5972)

The correct signature of this method is:

def prepare_origin_visit(self, *args, **kwargs):

The loader will fail to execute when executed by a celery worker otherwise (this is what we used in production to schedule and execute the loading of software origins in the archive)..

156 ↗(On Diff #5972)

If we remove the convert_to_standard_format method and ensure consistent inputs provided by the listers to that loader, this could be changed in:

self.package_details = kwargs
169 ↗(On Diff #5972)

Same here, correct signature is:

def prepare(self, *args, **kwargs):
185 ↗(On Diff #5972)

I would rather rename it to get_known_versions, the plural use is important here

187 ↗(On Diff #5972)

new_versions seems a better name here

swh/loader/base/revision.py
11 ↗(On Diff #5972)

Please capitalize the first letter of that class name.

swh/loader/base/dowload.py
1 ↗(On Diff #5972)

I am not a big fan of using inheritance in derived package loaders implementation (CRAN, GNU, ...) to declare the package tarballs retrieval behavior.

Yes, thank you. I did not find the right way to explicit that.

We must try and keep separated what's logically separated (artefacts retrieval, ingestion, etc...).

I would rather use a single class, named PackageDownloader for instance, that will internally be used by the PackageLoader class through composition instead.

Yes, that's what i call a collaborator.
PackageDownloader is a collaborator of the PackageLoader.
It's what's being done in the pypi and npm loader IIRC with ArchiveFetcher (or something).

swh/loader/base/loader.py
28 ↗(On Diff #5972)

How about renaming that class to PackageLoader ? This feels more explicit about what it is intended to do.

I like it better indeed!

Also, I would rename the folder containing these new source files from base to package.

I asked to move those back in core.
But your proposal is way better.
Please do so in the end ;)

Maybe this could also be moved in a new dedicated swh-loader-package repository, but let's

keep it there at the moment.

That's what we (olasd, nahimilega and me) agreed on from the start.
First let's keep it here.
Then, when we are mostly satisfied, we'll move it.

83–118 ↗(On Diff #5972)

I agree again.
If we converge with a use case, we can change the lister implementation to unify their output (scheduled tasks).

Note that it means a migration from the existing recurring tasks in the scheduler db (out of scope from this task ;)

185 ↗(On Diff #5972)

i agree!

swh/loader/base/revision.py
17 ↗(On Diff #5972)

Also, we'll also need to be able to build "release"?

loader-deposit for one will need it.

nahimilega added inline comments.
swh/loader/base/loader.py
83–118 ↗(On Diff #5972)

I am again not sure we should do this because, as once olasd told that we parallelize the loading task. Hence we should not keep the API call(to get metadata) step in the listers.
For package managers for a huge amount of packages, lister will take up a lot of time to complete whereas this could be done in the loader and make it fast using parallelization.

156 ↗(On Diff #5972)

I don't think that's a good idea. convert_to_standard_format is provided for loaders where we need to make an API call to get the metadata(eg pypi).

swh/loader/base/loader.py
83–118 ↗(On Diff #5972)

Something rings a bell indeed.
Moving the metadata retrieval done lister side to the loader-side (as the debian loader does).
But i don't remember the details of that exchange (most probably, this is detailed in the task).

Your sentences are a bit unclear.

I am again not sure we should do this because, as once olasd told that we parallelize the loading task. Hence we should not keep the API call(to get metadata) step in the listers.

I'm not sure i see the causality between the parallelize the loading-task and moving the api call to solve metadata loader side.

For package managers for a huge amount of packages, lister will take up a lot of time to complete whereas this could be done in the loader and make it fast using parallelization.

Same here.
Be that loaders or listers, they all doing their job concurrently.

That a lister takes some time to list is not per say a problem.
Because surely, it does not exceed the worst case we already have on loaders.
As of today, given certain origins, some loader-git/mercurial/svn takes forever to finish...


In any case, if this method is the one that solves the metadata loading, it's not clear from its name.
I now realize i must have skim through though as i see the docstring tries to explain this.
Still, i think the name should be improved upon to explicit that.

swh/loader/base/loader.py
83–118 ↗(On Diff #5972)

I'm not sure I see the causality between the parallelise the loading-task and moving the API call to solve metadata loader side.

My point was, when I was making packagist lister, in the first pass, I kept the metadata retrieval step (it would send an api request for each package to get metadata) in the lister itself. While testing it, on my pc, it took like 3 hours to list only about 5 % of the total packages.

Now if this task is done in loading. The lister will pass the metadata URL of the package to the loader. And we will have a separate loading task for each package. If we can parallelise the task(i.e. run many loading tasks at the same time), then we could make metadata request for many packages at the same time, and it will make the whole process fast.
As compared it lister where first all the metadata calls are made, the loading task is created.
(I am emphasising on metadata request part so much because while testing listers, I felt this step is like a bottleneck. it is slow)

One more point it will make the whole system more robust. As let us suppose that there are exceptions in the metadata API for one or two package, like some field is not present or was present at the time lister was developed now removed.
Now if we make that metadata request in lister, then it will encounter an error, and the whole process might stop(although this possibility could be removed if the lister code is robust and written to handle these error).
But if the metadata request is made in the loader, then the loading task of that particular package will fail. And everything will still work fine.

Hope it is a bit clear what I am trying to say ;)

In any case, if this method is the one that solves the metadata loading, it's not clear from its name.

I will try to make it more explicit, but actually, the task of that method is to make the metadata request(if needed) and convert the output coming from lister to a standard format which is used in the loader. The standard format is the format as shown in docstring.

swh/loader/base/loader.py
83–118 ↗(On Diff #5972)

@nahimilega , I totally agree with you that the metadata extraction for a package must be handled loader-side. This calls indeed for a dedicated method to perform that task in the PackageLoader class that derived ones must reimplement.

But keep in mind that metadata extraction must be performed for each version of a package and that most of the times there is no need to perform an API call to a remote service as package content usually provide a file filled with these information that can be easily parsed (package.json, PKG-INFO, ...).

So trying to extract metadata before iterating on each version of a package to load into the archive is not the right solution.

I was also confused by the naming of the method, convert_to_standard_format is not reallly explicit about its role. As @ardumont explained, every processiong step for loading the different versions of a package into the archive should be clearly identified with enough granularity to cover all package managers available in the wild. This implies a lot of brainstorming to do it well and define an API that will be easily usable and maintainable ;-)

Let's continue that work next week !

I think the role to each and every method in PackageLoader is not explicit enough or I am not able to make it. explicit enough(electric communication is hard;) ).
To elaborate upon the working of PackageLoader I have made this diagram.

Hope this helps ;)

swh/loader/base/loader.py
83–118 ↗(On Diff #5972)

But keep in mind that metadata extraction must be performed for each version of a package.

But for the package managers where we don't even know the versions(eg pypi) and we need to make a API call to get a list of package version, in that case, convert_to_standard_format can be used.

And for extraction of metadata from the file provided(package.json, PKG-INFO, ...) extract_metadata method can be used. It iterated on each version of a package.

  • Change class name
  • Improve docstring of convert_to_standard_format() and extract_metadata()
  • Change signature of prepare_origin_visit() and prepare()

@anlambert and i discussed this, just need some time to digest and summarize to get back to you on this.

@anlambert and i discussed this, just need some time to digest and summarize to get back to you on this.

Thanks for informing, please do tell if some part is a bit unclear, I will try to improve it ;)

Hello, finally took the time to put in paper our reflexion.

So here it goes, please:

  • add a thorough explanation of what's the diff is all about (targetting the related tasks/diffs)
  • be incremental in your approach. Currently, there are too many indirections which takes too long to follow through to understand what's what.
  • be incremental in your commits. Puts logical building blocks that belong together in the same commit.
  • be incremental in your diffs. One rule we are trying to put in place, "one commit, one diff".
  • Do not refactor prior to the need you are trying to implement. This usually leads to over-engineering. Also a longer review that does not come through.
  • Add tests immediately on the code you wrote (if it's not testable, that could be a sign that something is unclear)
  • If there is no test in the diffs, there must be an explanation in the diff description about it

So now concretely for the current diff, that means:

  • Add a brief description of what you are adding
  • class IfModifiedSince is not used in the base loader, so please remove it from here (probably used in the cran loader, so please put that there)
  • Merge IfModifedSince, compareField classes' behavior in the PackageLoader *when* it will be needed (so not that diff yet)
  • Same for ConstructRevision. There is no need to open so many hook points yet. At first, i'll open just a build_revision (or compute_revision) method in the PackageLoader (with a standard revision build)
  • An idea, take the graph you drew. For each box, make a logical commit about it. It's not mandatory to build the base package loader in one commit (please don't). To the contrary, building block by block will make things clearer for everyone.

Which concretely means, you need to rework your current code to show
incremental tested steps.

Hoping that helps.

Cheers,

This revision now requires changes to proceed.Jul 31 2019, 12:13 PM
  • Follow the incremental approach and break the diff into different small parts.
nahimilega retitled this revision from swh.loader.base: Implement base package manager loader to swh.loader.package: Implement a method to prepare package visit.Aug 3 2019, 10:06 PM
nahimilega edited the summary of this revision. (Show Details)
ardumont abandoned this revision.
ardumont edited reviewers, added: nahimilega; removed: ardumont.