Page MenuHomeSoftware Heritage

Optimize the number of HTTP requests sent by the cgit lister
Closed, ResolvedPublic

Description

Current implementation of cgit lister will request main HTML page of each listed repository to extract the git clone URL.

However, sending a lot of requests in a short amount of time is not really friendly for a cgit server.

We could avoid to request each repository page by constructing a git repository clone URL from the cgit server home URL
and a clone URL prefix passed as lister parameter.

Repositories home page URLs are in the form: <cgit_server_url>/<path_to_repo>
and extracted from the HTML pages listing hosted repositories.

Repositories clone URLS will then be in the form: <cgit_clone_prefix_url>/<path_to_repo>.

Examples based on the cgit instances listed in T1835 plus a couple of other found ones in the wild:

|---------------------------------------------+-----------------------------------|
| cgit_server_url                             | git_clone_prefix_url              |
|---------------------------------------------+-----------------------------------|
| https://git.kernel.org/                     | https://git.kernel.org/           |
| https://gitweb.torproject.org/              | https://gitweb.torproject.org/    |
| https://fedorapeople.org/cgit/              | https://fedorapeople.org/cgit/    |
| https://git.openembedded.org/               | https://git.openembedded.org/     |
| https://git.zx2c4.com/                      | https://git.zx2c4.com/            |
| http://git.gnu.org.ua/cgit/                 | http://git.gnu.org.ua/repo/       |
| https://git.alpinelinux.org/                | https://git.alpinelinux.org/      |
| https://git.baserock.org/cgit/              | https://git.baserock.org/git/     |
| https://code.qt.io/cgit/                    | http://code.qt.io/                |
| http://git.yoctoproject.org/clean/cgit.cgi/ | https://git.yoctoproject.org/git/ |
| https://forge.frm2.tum.de/cgit/cgit.cgi/    | https://forge.frm2.tum.de/review/ |
| https://git.eclipse.org/c/                  | https://git.eclipse.org/r/        |
| http://hdiff.luite.com/cgit/                | http://hdiff.luite.com/cgit/      |
| https://git.systemreboot.net/               | https://git.systemreboot.net/     |
| https://git.netfilter.org/                  | git://git.netfilter.org/          |
| https://jff.email/cgit/                     | git://jff.email/opt/git/          |
| https://inqlab.net/git/                     | https://inqlab.net/git/           |
|---------------------------------------------+-----------------------------------|

While testing some clone URLs, I remembered that some cgit instances do not offer git clone URLs
supporting git smart transfer protocol. So we will be able to list the repositories but not to load them
into the archive as dulwich does not support git dumb transfer protocol (T2489, related sentry issue).

Event Timeline

anlambert triaged this task as Normal priority.Jan 27 2021, 1:52 PM
anlambert created this task.

Analyzing further the suggestions using the deprecated swh-lister cache db table as data
point (production data) [1], 3 instances so far will generate sometimes wrong origin
urls with the suggested approach.

Here is the summary:

|---------------------------------------------+-------------------------------------+-----------------------------------------------------------------------------+----------------|
| cgit_server_url                             | git_clone_prefix_url                | Note (using old swh-lister cache db with production data)                   | instance       |
|---------------------------------------------+-------------------------------------+-----------------------------------------------------------------------------+----------------|
| https://git.kernel.org/                     | https://git.kernel.org/             | ok-ish some extra path on origin url,                                       | git-kernel     |
|                                             |                                     | e.g. https://git.kernel.org/pub/scm/network/connman/connman.git             |                |
| https://gitweb.torproject.org/              | https://gitweb.torproject.org/      | ok                                                                          | tor            |
| https://fedorapeople.org/cgit/              | https://fedorapeople.org/cgit/      | ok-ish some extra path on origin url,                                       | fedora         |
|                                             |                                     | https://fedorapeople.org/cgit/ktdreyer/public_git/rubygem-after_commit.git/ |                |
| https://git.openembedded.org/               | https://git.openembedded.org/       | ok                                                                          | openembedded   |
| https://git.zx2c4.com/                      | https://git.zx2c4.com/              | ok                                                                          | zx2c4          |
| http://git.gnu.org.ua/cgit/                 | http://git.gnu.org.ua/repo/         | ok                                                                          | git.gnu.org.ua |
| https://git.alpinelinux.org/                | https://git.alpinelinux.org/        | Lots of entries (not all) have extra "/user/" mixed in the path             | alpinelinux    |
|                                             |                                     | e.g https://git.alpinelinux.org/user/zelebar/acf-asterisk/                  |                |
| https://git.baserock.org/cgit/              | https://git.baserock.org/git/       | ok                                                                          | baserock       |
| https://code.qt.io/cgit/                    | http://code.qt.io/                  | ok                                                                          | qt.io          |
| http://git.yoctoproject.org/clean/cgit.cgi/ | https://git.yoctoproject.org/git/   | ok                                                                          | yoctoproject   |
| https://forge.frm2.tum.de/cgit/cgit.cgi/    | https://forge.frm2.tum.de/review/   | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| https://git.eclipse.org/c/                  | https://git.eclipse.org/r/          | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| http://hdiff.luite.com/cgit/                | http://hdiff.luite.com/cgit/        | ok                                                                          | hdiff.luite    |
| https://git.systemreboot.net/               | https://git.systemreboot.net/       | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| https://git.netfilter.org/                  | git://git.netfilter.org/            | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| https://jff.email/cgit/                     | git://jff.email/opt/git/            | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| https://inqlab.net/git/                     | https://inqlab.net/git/             | looks ok (no data point, checked only 1 or 2 repos)                         | x              |
| https://git.joeyh.name/index.cgi/           | https://git.joeyh.name/git/         | ok                                                                          | git.joeyh.name |
| https://git.savannah.gnu.org/cgit/          | https://git.savannah.gnu.org/git/   | ok                                                                          | gnu-savannah   |
|---------------------------------------------+-------------------------------------+-----------------------------------------------------------------------------+----------------|
| https://www.happyassassin.net/cgit/         | https://www.happyassassin.net/cgit/ | ok (site down)                                                              | happyassassin  |
| https://cgit.kde.org/                       | https://anongit.kde.org/            | ok (site down)                                                              | kde            |
|---------------------------------------------+-------------------------------------+-----------------------------------------------------------------------------+----------------|

[1]

$ psql service=swh-lister
> select uid, origin_url from cgit_repo where uid != origin_url and instance='$instance';

Note: $instance to be replaced with fedora, kde, git-kernel, etc...

Analyzing further the suggestions using the deprecated swh-lister cache db table as data
point (production data) [1], 3 instances so far will generate sometimes wrong origin
urls with the suggested approach.

I guess you mean that some cgit origin URLs previously loaded into the archive will not be the same if
the approach in that task is implemented ?

I guess you mean that some cgit origin URLs previously loaded into the archive will not be the same if
the approach in that task is implemented ?

No, I mean some urls generated by the new lister implementation will result in 404 at loading time (which also
should be fine, we'll eventually amend the loaders to add that "not_found" status to origin-visit-status).

The 3 divergent urls (for 3 different instances) i gave as example in the table can't be generated correctly by the new lister cgit implementation. Not all origins with the proposed scheme will be incorrect for those instances, most should be ok mostly.

Don't get me wrong, I guess it's fine as I think it won't be that much. Then again, actually testing it will tell.

In any case, i intend to keep both behavior just in case (diff incoming btw :).