SourceForge lister
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Jul 12 2017, 5:14 PM

Description

We need a lister for SourceForge, in order to be able to archive what's there.

Sourceforge uses the Apache Allura forge under the hood to host open source projects.
Unfortunately, the associated REST API does not offer the possibility to list all hosted projects. A ticket has been created on the subject a couple of years ago but no action have been taken so far.

It is nonetheless possible to do full and incremental listing, using sitemaps and the REST API to query project-by-project information. See a specification blueprint by @zack in T735#51468 below. It has been designed discussing with a SourceForge tech contact.

Revisions and Commits

rDLS Listers
	Abandoned		D261 add SourceForge projects lister based on the use of rsync
rDSNIP Code snippets
		D5294	rDSNIP26f5d657d753 Improve correctness of sourceforge-ls

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T3315 archive SourceForge
Migrated	gitlab-migration	T735 SourceForge lister
Migrated	gitlab-migration	T3310 Deploy sourceforge lister on staging
Migrated	gitlab-migration	T3350 Deploy sourceforge lister in production
Migrated	gitlab-migration	T3374 Ingest sourceforge repositories (origins of type git, svn, hg)
Migrated	gitlab-migration	T3470 lister-sourceforge: Activate sourceforge origins when listed

Event Timeline

@anlambert: if you found additional related work, can you post it to this task? TIA

zack added a project: Origin-SourceForge.Jul 12 2017, 5:15 PM

Below are some intels I managed to gather in order to fulfill that task.

Listing projects on sourceforge

Two solutions could be used.

First one is to do some web scraping from the Sourceforge directory url: https://sourceforge.net/directory/. This is the solution used by archiveteam, the source code of their scraper (in Ruby) can be found on Github: https://github.com/marcroberts/archiveteam-sourceforge-lister. However this does not seem reliable as not all pages from the Sourceforge directory can be browsed. Currently, there is 18831 available pages about Sourceforge projects but trying to browse pages number greater or equal than 1000 returns an error 500 (for instance, https://sourceforge.net/directory/?sort=name&page=2000).

Second one, as pointed by pombreda on IRC, is to use rsync mirrors of files made available for download (typically release tarballs) in Sourceforge projects: rsync://netix.dl.sourceforge.net/sfmir/, rsync://rsync.mirrorservice.org/downloads.sourceforge.net/. That solution seems better as it will allow us to list all relevant projects names on Sourceforge (thus discarding empty projects and those without any releases). Please find below a sample output when using rsync to list projects whose name start with gl.

antoine@antoine-X550CC:~$ rsync --list-only rsync://rsync.mirrorservice.org/downloads.sourceforge.net/g/gl/
----------------------------------------------------------------------------
Welcome to the University of Kent's UK Mirror Service.

More information can be found at our web site: http://www.mirrorservice.org/
Please send comments or questions to help@mirrorservice.org.
----------------------------------------------------------------------------

drwxr-xr-x         20,480 2017/07/13 02:27:00 .
lrwxrwxrwx             19 2010/01/05 07:08:57 index-sf.html
drwxr-xr-x          4,096 2016/08/25 07:30:46 gl-117
drwxr-xr-x          4,096 2016/08/25 07:30:46 glabels
drwxr-xr-x          4,096 2016/08/25 07:30:46 gladewin32
drwxr-xr-x          4,096 2017/06/10 02:25:52 gladys
drwxr-xr-x          4,096 2016/08/25 07:30:55 glass-theme
drwxr-xr-x          4,096 2016/08/25 07:30:57 glattony
drwxr-xr-x          4,096 2016/08/25 07:30:59 glaunch
drwxr-xr-x          4,096 2016/08/25 07:31:35 glc-lib
drwxr-xr-x          4,096 2016/08/25 07:31:37 glc-player
drwxr-xr-x          4,096 2016/08/25 07:32:34 glcdtools
drwxr-xr-x          4,096 2016/08/25 07:32:38 glchess
drwxr-xr-x          4,096 2016/08/25 07:32:46 gldirect
drwxr-xr-x          4,096 2016/08/25 07:32:49 gle
drwxr-xr-x          4,096 2016/08/25 07:33:36 glesius
drwxr-xr-x          4,096 2017/06/11 02:28:24 glest
drwxr-xr-x          4,096 2016/08/25 07:33:53 glew
...

Ingesting sourceforge projects into the SWH archive

Once a list of relevant projects is obtained, some preprocessing has to be done before being able to ingest a project into the SWH archive.
From a Sourceforge project name, its associated metadata can easily be obtained using the public Allura REST API (Allura being the software forge used on Sourcefore, see https://allura.apache.org/).
For instance, to get the metadata about the glew project: https://sourceforge.net/rest/p/glew. The url of the VCS repository (can be cvs, svn, hg, git) used by the project can be reconstructed from the retrieved metadata.
I found a project on Github, released on the public domain, dedicated to the metadata retrieval of open source projects hosted on Sourceforge: https://github.com/chpwssn/sourceforge-items/. In particular, the following Python script https://github.com/chpwssn/sourceforge-items/blob/master/rsync-disco/apiscrape.py could be reused by us.

anlambert added a revision: D261: add SourceForge projects lister based on the use of rsync.Nov 6 2017, 2:49 PM

ardumont mentioned this in T1351: (periodically) ingest GNU package releases.Mar 12 2019, 7:01 PM

nahimilega awarded a token.Mar 15 2019, 7:34 PM

nahimilega added a subscriber: nahimilega.

anlambert updated the task description. (Show Details)Apr 2 2019, 5:11 PM

anlambert updated the task description. (Show Details)May 17 2019, 2:15 AM

zack updated the task description. (Show Details)Jun 7 2019, 9:16 PM

The scripts and data at https://github.com/chpwssn/sourceforge-items/ look to be exactly what is required with that person (chpwssn) having already identified over 350,000 SVN, Mercurial, and GIT repositories on SourceForge with associated rsync commands for downloading them.

I started looking into this task myself with simple scripts that scraped the directory, but this looks like it's already super close to completion (or essentially already complete, but someone needs to create the SH-bits.

zack updated the task description. (Show Details)Oct 22 2020, 1:07 PM

zack mentioned this in rDSNIP422bfaa8f889: add prototype SourceForge lister.Oct 22 2020, 1:12 PM

Here's a blueprint for implementing a SourceForge lister, based on an exchange with a SourceForge tech contact:

start from the Allura sitemap index
recurse into sitemaps (e.g., sitemap-0.xml)
extract the list of all project URLs, matching https://sourceforge.net/p/PROJECT_NAME (e.g., seedai)
for each project name, query its REST API endpoint https://sourceforge.net/rest/p/PROJECT_NAME/ (e.g., seedai)
from there extract the list of project "tools"; they include tools that corresponds to VCS, with names like "git", "svn", "cvs"
associated to each VCS tool there is a URL, from which we can build clone/checkout commands (or, equivalently, origin URLs for a full lister). The URL pattern (to be verified) should be {type}.code.sf.net/p/{project}/{mount_point} (e.g., svn, git)

I've put a prototype implementation of this (up to the listing of all tool types and URLs included, but with no integration with the swh-lister API) in the snippet repo.

I've run it once, successfully listing all of SourceForge in ~4 hours with 8 parallel threads to query the REST endpoint.
As of that run I've listed 480'711 projects and 402'908 VCS "tools" (see P832 for details), with the following breakdown by VCS type:

182'858 git
145'225 svn
44'493 cvs (read-only)
29'148 hg
1'184 bzr

Other improvements needed are:

incremental listing: this is possible to do exploiting the <lastmod> value in sitemaps. We have been told by SourceForge that that last modification timestamp is unique per project and that it is updated when the VCS is updated. It is therefor possible to be smart and do incremental listing that only list updated repositories w.r.t. the last lister run
there are some subprojects on SourceForge, although we have been told by SourceForge they are very very rare. We should consider including them too. An example is: computerastherapy/ict-framework (note how the "project" here is computerastherapy/ict-framework
in order to play nice with SourceForge while crawling we should:
- set the crawler user-agent to something identifying it as coming from Software Heritage
- make sure the crawler IP address(es) have a reverse DNS entry (ideally pointing to a Software Heritage hostname too)
- keep parallelism at 8 concurrent workers maximum

zack updated the task description. (Show Details)Oct 22 2020, 1:22 PM

zack updated the task description. (Show Details)Nov 8 2020, 2:08 PM

vlorentz raised the priority of this task from Normal to High.Feb 12 2021, 11:25 AM

It looks like there are projects outside of the /p/ namespace. Just looking at the very first sitemap, I got an /adobe/ namespace (https://sourceforge.net/rest/adobe/manjobi), which implies that we should also consider namespaces outside of /p/ when listing.

Note also that a lot of entries are duplicated across the /projects/ and /p/ namespaces, while both point to the same thing.

New stats:

317973 distinct projects in the sitemaps (including subprojects)
360 subprojects
356 projects are outside of the normal /p/ namespace, including subprojects

Alphare mentioned this in D5293: Add a non-incremental sourceforge lister.Mar 19 2021, 6:08 PM

Alphare mentioned this in rDLSf7b27c693022: Add a non-incremental sourceforge lister.Mar 23 2021, 6:41 PM

vlorentz added a revision: D5294: Improve correctness of sourceforge-ls.Apr 29 2021, 3:15 PM

ardumont changed the status of subtask T3310: Deploy sourceforge lister on staging from Open to Work in Progress.May 7 2021, 12:07 PM

zack added a parent task: T3315: archive SourceForge.May 7 2021, 5:25 PM

ardumont closed subtask T3310: Deploy sourceforge lister on staging as Resolved.May 28 2021, 11:12 AM

Status:

Updated the lister sourceforge code so sourceforge origins (as disabled) can occur
Packaged and deployed the change
Added the task to the scheduler [1] so the listing occurs [2] [3]

[1]

swhscheduler@saatchi:~$ swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task add list-sourceforge-full
Created 1 tasks

Task 381572107
  Next run: today (2021-06-01T08:03:40.852707+00:00)
  Interval: 90 days, 0:00:00
  Type: list-sourceforge-full
  Policy: recurring
  Args:
  Keyword args:

[2] scheduler-runner:

Jun 01 08:04:30 saatchi swh[2685155]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks list-sourceforge-full

[3] worker:

Jun 01 07:40:21 worker11 python3[1407475]: [2021-06-01 07:40:21,310: INFO/MainProcess] lister@worker11.internal.softwareheritage.org ready.
Jun 01 08:05:24 worker11 python3[1407475]: [2021-06-01 08:05:24,006: INFO/MainProcess] Received task: swh.lister.sourceforge.tasks.FullSourceForgeLister[e29c07ff-b01f-4739-a820-1d326e76ad63]
Jun 01 08:05:27 worker11 python3[1407482]: [2021-06-01 08:05:27,962: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/wiki' does not have any tools
Jun 01 08:05:28 worker11 python3[1407482]: [2021-06-01 08:05:28,338: WARNING/ForkPoolWorker-4] Project 'https://sourceforge.net/rest/adobe/blog' does not have any tools

New listed origins are enabled=f as expected so they they won't be selected just yet for ingestion.

> softwareheritage-scheduler=> select enabled, * from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='sourceforge' limit 10;
 enabled |              lister_id               |                  url                  | visit_type | extra_loader_arguments | enabled |          first_seen           |           last_seen           |      last_update       |            $
---------+--------------------------------------+---------------------------------------+------------+------------------------+---------+-------------------------------+-------------------------------+------------------------+------------$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/abdousoft/git       | git        | {}                     | f       | 2021-06-01 08:08:39.897035+00 | 2021-06-01 08:08:39.897035+00 | 2017-11-06 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/actsasdimension/git | git        | {}                     | f       | 2021-06-01 08:09:24.409325+00 | 2021-06-01 08:09:24.409325+00 | 2017-10-26 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/afmg/git            | git        | {}                     | f       | 2021-06-01 08:06:40.714127+00 | 2021-06-01 08:06:40.714127+00 | 2017-12-19 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/akrush/git          | git        | {}                     | f       | 2021-06-01 08:08:41.904664+00 | 2021-06-01 08:08:41.904664+00 | 2017-11-04 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/albinatank/git      | git        | {}                     | f       | 2021-06-01 08:09:48.838677+00 | 2021-06-01 08:09:48.838677+00 | 2018-01-17 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/alfaprogrammer/git  | git        | {}                     | f       | 2021-06-01 08:10:07.898698+00 | 2021-06-01 08:10:07.898698+00 | 2017-11-05 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/alipur/code         | git        | {}                     | f       | 2021-06-01 08:06:49.783528+00 | 2021-06-01 08:06:49.783528+00 | 2017-10-28 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/allepost/git        | git        | {}                     | f       | 2021-06-01 08:06:30.83668+00  | 2021-06-01 08:06:30.83668+00  | 2017-12-20 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/aloa-ims/git        | git        | {}                     | f       | 2021-06-01 08:10:04.021204+00 | 2021-06-01 08:10:04.021204+00 | 2017-11-04 00:00:00+00 | b678cfc3-27$
 f       | b678cfc3-2780-4186-9186-d78a14bd4958 | git.code.sf.net/p/altitude/altitude   | git        | {}                     | f       | 2021-06-01 08:06:48.571867+00 | 2021-06-01 08:06:48.571867+00 | 2017-12-26 00:00:00+00 | b678cfc3-27$

(Last 2 comments was meant for T3350...)

ardumont changed the status of subtask T3350: Deploy sourceforge lister in production from Open to Work in Progress.Jun 3 2021, 6:18 PM

ardumont closed subtask T3350: Deploy sourceforge lister in production as Resolved.Jun 11 2021, 12:24 PM

ardumont changed the status of subtask T3374: Ingest sourceforge repositories (origins of type git, svn, hg) from Open to Work in Progress.Jun 24 2021, 4:48 PM

ardumont mentioned this in rDDOCc7239b84093a: changelog: Simplify task link on sourceforge ingestion.Jul 22 2021, 10:46 AM

vlorentz added a commit: rDSNIP26f5d657d753: Improve correctness of sourceforge-ls.Aug 5 2021, 5:20 PM

ardumont closed subtask T3374: Ingest sourceforge repositories (origins of type git, svn, hg) as Resolved.Sep 28 2021, 5:55 PM

ardumont closed this task as Resolved.Oct 15 2021, 9:44 AM

ardumont claimed this task.

gitlab-migration changed the status of subtask T3310: Deploy sourceforge lister on staging from Resolved to Migrated.Oct 19 2022, 6:02 PM

gitlab-migration changed the status of subtask T3350: Deploy sourceforge lister in production from Resolved to Migrated.

gitlab-migration changed the status of subtask T3374: Ingest sourceforge repositories (origins of type git, svn, hg) from Resolved to Migrated.

This task has been migrated to GitLab.

SourceForge listerClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

SourceForge lister
Closed, MigratedEdits Locked
Actions

Related Objects
Search...