Page MenuHomeSoftware Heritage

packagist (PHP) Lister
Closed, MigratedEdits Locked

Description

Packages listing URL :
https://packagist.org/packages/list.json

Package metadata url
https://repo.packagist.org/p/[vendor]/[package].json

example: https://repo.packagist.org/p/monolog/monolog.json

Link to the source code is available for each package version in the metadata

Revisions and Commits

Event Timeline

nahimilega created this task.
nahimilega created this object in space S1 Public.

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51

There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.

One of the biggest challenges in the implementation of this Lister is to deduplicate all the links listed by the lister, as most of the packages are hosted on GitHub, or bitbucket or gitlab, and a large portion of those websites are already listed and ingested.
Hence we need to remove all the packages that are already listed by another lister in the process of listing all the packages in Packagist.

One way could be to apply some if conditions to check if the listed URL is of GitHub, or bitbucket or gitlab, then do not consider the package.

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51

There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.

In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?

I think the main value of the packagist lister would be to reference these dists, as they will point to something that the authors of the project consider a released artifact (even if in practice it's just a reference to an export of a specific git commit). In the event that the dist points to an export of a directory that we have already archived (e.g. through the git lister), then the load operation will not create any new objects in the archive, just point to the existing one.

Once we have the (released) dists figured out, we can also consider -as a second step- having the packagist lister submit tasks for the upstream repositories mentioned in the source key as well, as an extra data source.

We have provisions in the scheduler API to deduplicate tasks, so that creating a new git loading task for an existing origin wouldn't duplicate the task but rather just reference the existing one.

In T1776#32892, @olasd wrote:

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51

There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.

In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?

I looked at the source key in my analysis.

I think the main value of the packagist lister would be to reference these dists, as they will point to something that the authors of the project consider a released artifact (even if in practice it's just a reference to an export of a specific git commit).

I thought listing out their upstream repositories would do the work because, as you also mentioned,(in practice it's just a reference to an export of a specific git commit)

And if we archive both, their upstream repositories and the zip files, then wouldn't it be duplication, as contents of zip files could be found in upstream repositories. And through upstream repositories, we can get the whole history of the development for a package.

We have provisions in the scheduler API to deduplicate tasks, so that creating a new git loading task for an existing origin wouldn't duplicate the task but rather just reference the existing one.

Thanks for this, I didn't know that

In T1776#32892, @olasd wrote:

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51

There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.

In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?

I looked at the source key in my analysis.

OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from github.

I think the main value of the packagist lister would be to reference these dists, as they will point to something that the authors of the project consider a released artifact (even if in practice it's just a reference to an export of a specific git commit).

I thought listing out their upstream repositories would do the work because, as you also mentioned,(in practice it's just a reference to an export of a specific git commit)

And if we archive both, their upstream repositories and the zip files, then wouldn't it be duplication, as contents of zip files could be found in upstream repositories. And through upstream repositories, we can get the whole history of the development for a package.

For the package manager listers our goal is to record the metadata on what source artifacts were available as *released* through the package manager at the time of the visit. In packagist, this translates to archiving the dists that are available for each version of each package, even if we can see that they are tarballs that come from a given revision of a repository.

Considering that the Software Heritage archive is deduplicated at all levels (most notably, in that case, what matters is the individual file and directory level), archiving the same directories from different sources won't have an impact on the storage. We will only archive the files that we haven't already archived yet, and we will reference any already archived files directly.

Using the extra information provided by the API about the package's source repository could also be interesting (for extra coverage), but that shouldn't be the priority for this lister.

OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.

I made a short script to analyse the location of dists for packages; here is the result.
Packages iterated - 15336 (~ 6% of total packages)
Number packages whose dists were not hosted on GitHub, or bitbucket or gitlab: 2
VCS for dists: zip, git

I also found one interesting thing, some of the distributions have git links instead of zip links, there were about 6 of these kinds of packages.

Using the extra information provided by the API about the package's source repository could also be interesting (for extra coverage), but that shouldn't be the priority for this lister.

The output of this lister could be a list of all the tarballs for a package and a list of upstream repositories. These both the list will be passed to packagist loader. Then the packagist loader can pass upstream repos to their respective loaders according to their version control system. Also, the packagist loader will load the list of tarballs. (In the same as GNU loader will load the tarballs from the list of tarballs)

In this way, we can utilise the extra information provided by the API and fulfil the goal to archive the dists that are available for each version of each package

OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.

I made a short script to analyse the location of dists for packages; here is the result.
Packages iterated - 15336 (~ 6% of total packages)
Number packages whose dists were not hosted on GitHub, or bitbucket or gitlab: 2
VCS for dists: zip, git

Alright, thanks.

I also found one interesting thing, some of the distributions have git links instead of zip links, there were about 6 of these kinds of packages.

Do you have an example of such a package?

Using the extra information provided by the API about the package's source repository could also be interesting (for extra coverage), but that shouldn't be the priority for this lister.

The output of this lister could be a list of all the tarballs for a package and a list of upstream repositories. These both the list will be passed to packagist loader. Then the packagist loader can pass upstream repos to their respective loaders according to their version control system. Also, the packagist loader will load the list of tarballs. (In the same as GNU loader will load the tarballs from the list of tarballs)

In this way, we can utilise the extra information provided by the API and fulfil the goal to archive the dists that are available for each version of each package

There's a good chance that we can just generalize the tar loader enough that we won't need a specific packagist loader (see the last comment in T1389).

The lister would create two sets of tasks:

  • primarily, it would generate "tar"/"packagist" loading tasks, referencing the dists, achieving the coverage for the packagist package manager in Software Heritage
  • concurrently, it would generate "git" loading tasks referencing the source repositories, as an extra. Most of these will be redundant with tasks generated by other forge-specific listers, but that's not an issue as we can deduplicate tasks when sending them to the scheduler backend.

I think you've done enough exploratory work and you can move forward with the implementation now.

bchauvet added a parent task: Unknown Object (Maniphest Task).Sep 2 2022, 10:52 AM