Page MenuHomeSoftware Heritage

swh.lister.packagist
ClosedPublic

Authored by nahimilega on Jun 14 2019, 12:59 PM.

Details

Summary

Implement a packagist lister to list the release and
source code origin for all the packages

Close T1923

Diff Detail

Repository
rDLS Listers
Branch
pack
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6934
Build 9740: tox-on-jenkinsJenkins
Build 9739: arc lint + arc unit

Event Timeline

vlorentz added inline comments.
swh/lister/packagist/lister.py
76–123

This should be in a code block (triple backquotes)

206–230

same

  • Made test to support different orders
  • Put triple backquote around docstring
  • Made change in the way api response is requested

I tried to run this for first 5 packages in packagist, and it worked fine

Here are the few tasks created by the lister

Task 15952
  Next run: just now (2019-06-14 12:22:58+00:00)
  Interval: 64 days, 0:00:00
  Type: load-git
  Policy: recurring
  Status: next_run_not_scheduled
  Priority: 
  Args:
    'https://github.com/0099FF/dialogflow-php.git'
  Keyword args:
Task 15953
  Next run: just now (2019-06-14 12:22:58+00:00)
  Interval: 90 days, 0:00:00
  Type: load-packagist
  Policy: recurring
  Status: next_run_not_scheduled
  Priority: 
  Args:
    '0.0.0/laravel-env-shim'
    None
  Keyword args:
    tarballs: [{'url': 'https://api.github.com/repos/phpexpertsinc/Laravel57-env-polyfill/zipball/5d4eb47b7469defe3e0dcc0c1f74ac585ca9f90f', 'vcs': 'zip', 'description': "Backward compatible support for Laravel's pre-5.8 env() function."}, {'url': 'https://api.github.com/repos/phpexpertsinc/Laravel57-env-polyfill/zipball/4dc9876f8fa9064c672e1811375d2e11828cb10b', 'vcs': 'zip', 'description': "Backward compatible support for Laravel's pre-5.8 env() function."}, {'url': 'https://api.github.com/repos/phpexpertsinc/Laravel57-env-polyfill/zipball/bfc0ba73d4567b88aae325754e9c389d75d20f99', 'vcs': 'zip', 'description': "Backward compatible support for Laravel's pre-5.8 env() function."}]
swh/lister/packagist/lister.py
76–102

Here I can't shrink the example shown because it clearly to show the source and release of more than one package to get the perfect idea of what the function is doing

swh/lister/packagist/lister.py
127–132

Here try is used because some packages, even though listed in the list of all the packages, still give 404 message.
example
https://repo.packagist.org/p/aura/cli-p.json

package name - aura/cli-p

@nahimilega Heads up, this needs a rebase (same goes for the packagist lister).
There will probably be impacts from D1628.

Currently, this lister visits the page of each and every package to get all the versions of a particular package. This approach is really slow, for me it took 2 hours to list 6% of the total package.
One thing that could be done is, this lister only the name of the package and the URL of metadata for a package and then loader visits that URL to get all the versions(like done in pypi lister). This approach will take only a couple of seconds and reduce the code for the lister.

Currently, this lister visits the page of each and every package to get all the versions of a particular package. This approach is really slow, for me it took 2 hours to list 6% of the total package.

ok

One thing that could be done is, this lister only the name of the package and the URL of metadata for a package and

I suppose you meant this lists only the name....

then loader visits that URL to get all the versions(like done in pypi lister).

And then we could have that behavior fall under the scope of the distribution loader then, is that it?

This approach will take only a couple of seconds and reduce the code for the lister.

sound better indeed.

I suppose you meant this lists only the name....

Ya, sorry for the typo

And then we could have that behaviour fall under the scope of the distribution loader then, is that it?

Yes, as we can deploy multiple workers for loading(I am not sure if I used correct technical terms here), hence it would speed up the process.

sound better indeed.

So I guess I should change the approach of this lister implementation ;)

From afar:

  • it's missing some tests
  • probably a rebase as well (i don't see cgit in the list).
swh/lister/packagist/lister.py
49

Why do you need to overwrite this method?

And the docstring is now out of sync with what's actually done.

probably a rebase as well (i don't see cgit in the list).

no, i saw cgit.
Maybe that still needs a rebase though ;)

  • Rebase on latest master
  • Remove safety_issue_request() method
  • Check for out of sync docstrings
  • Update git commit message according to new approach

it's missing some tests

I didn't add test cases for lister because the code of lister is minimal in itself and it is just iterating over the name of the packages. There is no function outside the lister class, so it would be a bit difficult to write test cases here. Moreover, this lister is quite similar to pypi lister, and it does not have any separate test for the lister.

Do we need test cases for the lister?

One thing on I am not sure about regarding this approach :
The description page also provides the upstream link of the package along with the link to tarballs. In the previous approach, a separate loader task was created for those upstream link to utilise all the information provided and further increase the archive coverage.

But as now the visit to the description page will the part of the packagist loader, so I want to ask can we still utilise the upstream link. I mean can one loader(packagist loader in this case) call another loader task(e.g git loader)?

I didn't add test cases for lister because the code of lister is minimal in itself and it is just iterating over the name of the packages.

So the tests should be simple enough to add.
Following the same scheme (files which reproduce the api's response) you already implemented in other listers.

Moreover, this lister is quite similar to pypi lister, and it does not have any separate test for the lister.

It's wrong that the pypi lister has no tests.
My bad. I was focused on getting it to production.
Then again, we don't want to reproduce that bad behavior.

T1890 to the rescue!

Do we need test cases for the lister?

Yes, we do.
We try to have a minimal tests coverage to ease maintenance in the long run.

While the lister tests are not that fun to read, understand, nor maintain, the scaffolding is there.
So might as well use it for now.

One thing on I am not sure about regarding this approach :
The description page also provides the upstream link of the package along with the link to tarballs. In the previous approach, a separate loader task was created for those upstream link to utilise all the information provided and further increase the archive coverage.
But as now the visit to the description page will the part of the packagist loader, so I want to ask can we still utilise the upstream link. I mean can one loader(packagist loader in this case) call another loader task(e.g git loader)?

I'm sorry, i did not follow.

I'm sorry, I did not follow.

According to the previous approach, it first finds names of all the packages and then visits their metadata url to get the tarball url of all the release.
eg for a package name 'monolog/monolog'
Its metadata page(https://repo.packagist.org/p/monolog/monolog.json) will give tarball url of all the release.
Something like this

{
   "release": [
       {
       "vcs": "zip",
       "url": "https://api.github.com/repos/Seldaek/monolog/zipball/
               433b98d4218c181bae01865901aac045585e8a1a",
       "description": "Logging for PHP 5.3"
       },
       {
       "vcs": "zip",
       "url": "https://api.github.com/repos/Seldaek/monolog/zipball/
               5e651a82b4b03d267da6084720ada0cd398c8d16",
       "description": "Logging for PHP 5.3"
       },
       ...
       ]
   "source": [
       {
       "vcs": "git",
       "url": "https://github.com/Seldaek/monolog.git",
       "description": "Logging for PHP 5.3"
       },
       ...
       ]

Here, as you can see the API response also provide upstream repository under the key source. To utilize this, in the previous approach, lister was creating a task of respective vcs loader(git loader in this example) and also created a packagist loader task to ingest tarballs provided.

Now with a new approach, lister only created the packagist loader task with names of all the packages. The step of visiting metadata page will be implemented in packagist loader. So to ingest the upstream repository under the key source the packagist loader will need to create a git loader task. So I want to ask is it possible that packagist loader can create git loader task. If not, is there any other way to ingest the upstream repository?

Thanks for clarifying, i was out of context for a minute.

Now with a new approach, lister only created the packagist loader task with names of all the packages. The step of visiting metadata page will be implemented in packagist loader.

yes.

So to ingest the upstream repository under the key source the packagist loader will need to create a git loader task. So I want to ask is it possible that packagist loader can create git loader task. If not, is there any other way to ingest the upstream repository?

As exchanged during our weekly plan, we will skip that problematic for the moment.

  • Add tests for lister.py
  • rebase on latest master

I see you define only tests for the new endpoints you define in packagist lister.

Is there a blocking point preventing you from implementing the "usual" tests here?

(Request changes for the sake of the question)

This revision now requires changes to proceed.Jul 15 2019, 10:28 AM

There is a need for variables like test_re (Compiled regex matching the server url. Must capture the index value), whereas these listers (packagist and pypi) do not have an index value, they just make a request to one page and parse over that request to get the info, hence I didn't use the conventional method of testing (method using HttpListerTester class)
(Same implies for pypi)

Heads up, this will need a rebase when D1733 lands.

I tested this docker. It is running fine. Here is a task created by the lister

Task 8
  Next run: 2 minutes ago (2019-07-18 18:25:07+00:00)
  Interval: 90 days, 0:00:00
  Type: load-packagist
  Policy: recurring
  Status: next_run_scheduled
  Priority: 
  Args:
    'hypejunction/hypegamemechanics'
    'https://repo.packagist.org/p/hypejunction/hypegamemechanics.json'
  Keyword args:
  • rebase on latest master
  • fixed few bugs
  • ran in docker

Hope this diff is ready to land ;)

Hope this diff is ready to land ;)

Almost (technically it's good).

It misses some documentation I forgot to ask earlier (even in other listers).

We should document the output of the lister, for example:

  • task-type: load-packagist, etc...
  • arguments: [name, url]

Here is the list i can think of for this lister, please complete it and amend the PackagitLister's docstring:

  • load-git (i'm not sure about this one, i recall it did at some point)
  • load-packagist

maybe more, i'm not sure.

swh/lister/packagist/lister.py
14

Please add a docstring to the class.

Also mentions the actual output of the lister (task-type).
That will simplify when we need to deploy.

Granted, this should be added to other listers as well.
But let's keep this for another time.

  • Update docstring of the class

This lister do not take any arguments, but for the listers like phabricator, we can also have an Args: section in the docstring.

load-git (i'm not sure about this one, i recall it did at some point)

Yes, it was there in the old implementation, but in the new one we only have load-packagist

swh/lister/packagist/lister.py
19

in the Packagist package manager. ...`

23

Task:

40

Please, drop from the sample the extra fields, keep the ones you showed in the Task_created parameter.
Rationale is those are from the scheduler so not really necessary to show them.
Also that keeps the docstring shorter ;)

This revision now requires changes to proceed.Jul 19 2019, 4:21 PM

Almost there, please fix the typos and change slightly the format.
Then i'm good for landing this.

And now, there will be a need to add a 'load-packagist' task-type in scheduler.

And now, there will be a need to add a 'load-packagist' task-type in scheduler.

Do we need to do this in this diff itself?

I was thinking we can make this task of adding docstrings for all lister class as an Easy-Hack for first-time contributors.

I was thinking we can make this task of adding docstrings for all lister class as an Easy-Hack for first-time contributors.

Yes, why not.
The description must be thorough though (or else, it's not such an easy hack ;)

This revision is now accepted and ready to land.Jul 19 2019, 4:37 PM
This revision was automatically updated to reflect the committed changes.