Page MenuHomeSoftware Heritage

add SourceForge projects lister based on the use of rsync
AbandonedPublic

Authored by anlambert on Nov 2 2017, 6:22 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T735: SourceForge lister
Summary

First draft of a lister for projects hosted on the legacy SourceForge platform.

This lister is a little bit different from the others as the SourceForge
REST API does not enable to list hosted projects. So we use a rsync
mirror of files hosted on SourceForge (typically binaries and tarballs)
to get the project names.
Those files are located in folders that correspond to project names in the
rsync mirror. Once we get a project name, we can easily get its metadata
through the SourceForge REST API : https://sourceforge.net/p/<project_name>,
notably which type of VCS is used for the project and thus get the origin url
for scheduling a SWH visit.
Some preprocessing is done for each project in order to ensure that found
code repositories are not empty (as there is a lot of projects in that case).

Some statistics after the full run of the lister in my local environment:

  • number of projects referenced on rsync mirror: 13274
  • number of projects with non emty code repositories: 6023 with:
    • 1110 git repos
    • 3237 svn repos
    • 132 hg repos
    • 1522 cvs repos
    • 22 bzr repos

Diff Detail

Repository
rDLS Listers
Branch
sourceforge-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1076
Build 1416: arc lint + arc unit

Event Timeline

This needs more work as numerous sourceforge projects are missed with the current implementation. I removed that diff from the review queue until I find a better solution.

I forgot that diff. This is clearly not the good approach to list sourceforge projects (as a lot of them are missing from the rsync mirrors, which only backup release files hosted on sourceforge), so let's abandon it.