Page MenuHomeSoftware Heritage

SourceForge lister
Open, NormalPublic

Description

We need a lister for SourceForge, in order to be able to archive what's there.

@anlambert has investigated related work, and found various existing scripts that can list the entirety of SourceForge, via web scarping and have been used to do so in various contexts. An example is: https://github.com/marcroberts/archiveteam-sourceforge-lister

Event Timeline

zack created this task.Jul 12 2017, 5:14 PM

@anlambert: if you found additional related work, can you post it to this task? TIA

Below are some intels I managed to gather in order to fulfill that task.

Listing projects on sourceforge

Two solutions could be used.

First one is to do some web scraping from the Sourceforge directory url: https://sourceforge.net/directory/. This is the solution used by archiveteam, the source code of their scraper (in Ruby) can be found on Github: https://github.com/marcroberts/archiveteam-sourceforge-lister. However this does not seem reliable as not all pages from the Sourceforge directory can be browsed. Currently, there is 18831 available pages about Sourceforge projects but trying to browse pages number greater than 1850 returns an error 500 (for instance, https://sourceforge.net/directory/?sort=name&page=2000).

Second one, as pointed by pombreda on IRC, is to use rsync mirrors of files made available for download (typically release tarballs) in Sourceforge projects: rsync://netix.dl.sourceforge.net/sfmir/, rsync://rsync.mirrorservice.org/downloads.sourceforge.net/. That solution seems better as it will allow us to list all relevant projects names on Sourceforge (thus discarding empty projects and those without any releases). Please find below a sample output when using rsync to list projects whose name start with gl.

antoine@antoine-X550CC:~$ rsync --list-only rsync://rsync.mirrorservice.org/downloads.sourceforge.net/g/gl/
----------------------------------------------------------------------------
Welcome to the University of Kent's UK Mirror Service.

More information can be found at our web site: http://www.mirrorservice.org/
Please send comments or questions to help@mirrorservice.org.
----------------------------------------------------------------------------

drwxr-xr-x         20,480 2017/07/13 02:27:00 .
lrwxrwxrwx             19 2010/01/05 07:08:57 index-sf.html
drwxr-xr-x          4,096 2016/08/25 07:30:46 gl-117
drwxr-xr-x          4,096 2016/08/25 07:30:46 glabels
drwxr-xr-x          4,096 2016/08/25 07:30:46 gladewin32
drwxr-xr-x          4,096 2017/06/10 02:25:52 gladys
drwxr-xr-x          4,096 2016/08/25 07:30:55 glass-theme
drwxr-xr-x          4,096 2016/08/25 07:30:57 glattony
drwxr-xr-x          4,096 2016/08/25 07:30:59 glaunch
drwxr-xr-x          4,096 2016/08/25 07:31:35 glc-lib
drwxr-xr-x          4,096 2016/08/25 07:31:37 glc-player
drwxr-xr-x          4,096 2016/08/25 07:32:34 glcdtools
drwxr-xr-x          4,096 2016/08/25 07:32:38 glchess
drwxr-xr-x          4,096 2016/08/25 07:32:46 gldirect
drwxr-xr-x          4,096 2016/08/25 07:32:49 gle
drwxr-xr-x          4,096 2016/08/25 07:33:36 glesius
drwxr-xr-x          4,096 2017/06/11 02:28:24 glest
drwxr-xr-x          4,096 2016/08/25 07:33:53 glew
...

Ingesting sourceforge projects into the SWH archive

Once a list of relevant projects is obtained, some preprocessing has to be done before being able to ingest a project into the SWH archive.
From a Sourceforge project name, its associated metadata can easily be obtained using the public Allura REST API (Allura being the software forge used on Sourcefore, see https://allura.apache.org/).
For instance, to get the metadata about the glew project: https://sourceforge.net/rest/p/glew. The url of the VCS repository (can be cvs, svn, hg, git) used by the project can be reconstructed from the retrieved metadata.
I found a project on Github, released on the public domain, dedicated to the metadata retrieval of open source projects hosted on Sourceforge: https://github.com/chpwssn/sourceforge-items/. In particular, the following Python script https://github.com/chpwssn/sourceforge-items/blob/master/rsync-disco/apiscrape.py could be reused by us.