rewrite the CGit lister as a proper lister
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Apr 18 2019, 10:38 AM

Description

We have a hackish CGit lister as a shell script. We should integrate it with other listers and move it to the common repo.

Revisions and Commits

rDLS Listers
	D1610	rDLSb972a2a88d25 swh.lister.cgit

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1798 ingest Tor project source code (meta task)
Migrated	gitlab-migration	T1799 ingest Tor git repositories
Migrated	gitlab-migration	T1451 ingest GNU Savannah Git repositories
Migrated	gitlab-migration	T1659 rewrite the CGit lister as a proper lister

Event Timeline

zack triaged this task as Low priority.Apr 18 2019, 10:38 AM

zack created this task.

zack added a parent task: T1451: ingest GNU Savannah Git repositories.

ardumont mentioned this in T1800: gitweb lister.Jun 12 2019, 5:42 PM

anarcat added a parent task: T1798: ingest Tor project source code (meta task).Jun 12 2019, 9:30 PM

anarcat added a parent task: T1799: ingest Tor git repositories.

anarcat removed a parent task: T1798: ingest Tor project source code (meta task).

i couldn't find the time to work through the developer setup and the lister tutorial, so I used the shell script to generate a list of projects for tor gitweb.

i had to tweak the script to work with our specific use case, which ended up looking like this:

#!/bin/bash

# Copyright (C) 2015 Stefano Zacchiroli <zack@upsilon.cc>
# License: GNU General Public License (GPL), version 3 or above

# Depends: libxml2-utils

REPO_URL_XPATH="//td[@class='sublevel-repo']/a/@href"

if [ -z "$1" ] ; then
    echo "Usage:   cgit-lister CGIT_BASE_URL"
    echo "Example: cgit-lister http://git.savannah.gnu.org/cgit/"
    echo "         cgit-lister http://anonscm.debian.org/cgit/"
    echo "         cgit-lister http://cgit.freedesktop.org/"
    exit 1
fi
CGIT_URL="$1"
shift

# extrace base_url, excluding URL path
base_url=$(echo "$CGIT_URL" | sed 's|^\(https\?://[^/]*\)/.*|\1|')

# we use the xmllint shell, as there is no way to use a separator other than
# space for xpath results with "xmllint --xpath" :-( The output format of the
# shell sucks as well, but at least is line based.
curl -sSL "$CGIT_URL" | xmllint --html --xpath "$REPO_URL_XPATH" - \
    | sed -e "s|href=\"|${base_url}|g" -e 's/" /\n/g'
echo

notice the use of curl instead of relying on xmllint, which was refusing to load the site. the resulting URLs are incorrect, as they still have gitweb instead of git, but that is fixed with:

./cgit-lister https://gitweb.torproject.org/ | sed 's#https://gitweb#https://git#'

kind of nasty, but it works.

I would be pretty interested in integrating it with other listers and moving it to the common repo. I guess we can proceed in two ways.

Use the script that we already have. Just run this script via python to get the list of repos.

Convert this script to python code, i.e. write a python code that uses the logic that is used in this script.

I prefer 1st at it is easy to implement as we have the major piece of the lister(the part to list the repo) already made and working nice. And this script will also not go in vain.
And if we write this parsing code again in python then we would need bs4, which is currently not present in requirements.txt

Which way shall I proceed?

not that I get a vote in this, but i'd say convert to python. depending on xmllint is very brittle... i already had to tweak the thing once to make it work at all, and the pipeline is kind of nasty. i think you will have to import some HTML parser at some point anyways, so you might as well bite that bullet now.

i think there is already one parser in use somewhere that's stricter than bs4 - maybe it would work here as well?

In T1659#33593, @anarcat wrote:

but i'd say convert to python. depending on xmllint is very brittle... i already had to tweak the thing once to make it work at all, and the pipeline is kind of nasty. i think you will have to import some HTML parser at some point anyways, so you might as well bite that bullet now.

Now, I think it makes more sense to convert the script to python

Thanks for your interest in working on this @nahimilega , it would be very useful to move forward on a bunch of pending ingestions, including Tor !

Regarding those two options, definitely (2): rewrite this in Python. Just use requests place of wget and, yes, bs4 for the parsing. We're going to eventually need bs4 for any other scraping-based lister, so adding it as a dependency is not a big deal at this point.

vlorentz added a revision: D1610: swh.lister.cgit.Jun 19 2019, 1:23 PM

nahimilega added subscribers: ardumont, olasd.Jun 19 2019, 1:26 PM

anlambert mentioned this in D1660: Split models into smaller chunks to avoid oversized db transactions.Jun 28 2019, 3:49 PM

anlambert mentioned this in rDLSd85bcdac5b8d: simple_lister: Split models into smaller chunks to avoid oversized db….Jun 28 2019, 3:51 PM

nahimilega closed this task as Resolved by committing rDLSb972a2a88d25: swh.lister.cgit.Jun 28 2019, 5:22 PM

nahimilega added a commit: rDLSb972a2a88d25: swh.lister.cgit.

This task has been migrated to GitLab.

rewrite the CGit lister as a proper listerClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

rewrite the CGit lister as a proper lister
Closed, MigratedEdits Locked
Actions

Related Objects
Search...