First draft of a lister for projects hosted on the legacy SourceForge platform.
This lister is a little bit different from the others as the SourceForge
REST API does not enable to list hosted projects. So we use a rsync
mirror of files hosted on SourceForge (typically binaries and tarballs)
to get the project names.
Those files are located in folders that correspond to project names in the
rsync mirror. Once we get a project name, we can easily get its metadata
through the SourceForge REST API : https://sourceforge.net/p/<project_name>,
notably which type of VCS is used for the project and thus get the origin url
for scheduling a SWH visit.
Some preprocessing is done for each project in order to ensure that found
code repositories are not empty (as there is a lot of projects in that case).
Some statistics after the full run of the lister in my local environment:
- number of projects referenced on rsync mirror: 13274
- number of projects with non emty code repositories: 6023 with:
- 1110 git repos
- 3237 svn repos
- 132 hg repos
- 1522 cvs repos
- 22 bzr repos