Page MenuHomeSoftware Heritage

rewrite the CGit lister as a proper lister
Closed, ResolvedPublic

Description

We have a hackish CGit lister as a shell script. We should integrate it with other listers and move it to the common repo.

Event Timeline

zack created this task.Apr 18 2019, 10:38 AM
zack triaged this task as Low priority.

i couldn't find the time to work through the developer setup and the lister tutorial, so I used the shell script to generate a list of projects for tor gitweb.

i had to tweak the script to work with our specific use case, which ended up looking like this:

#!/bin/bash

# Copyright (C) 2015 Stefano Zacchiroli <zack@upsilon.cc>
# License: GNU General Public License (GPL), version 3 or above

# Depends: libxml2-utils

REPO_URL_XPATH="//td[@class='sublevel-repo']/a/@href"

if [ -z "$1" ] ; then
    echo "Usage:   cgit-lister CGIT_BASE_URL"
    echo "Example: cgit-lister http://git.savannah.gnu.org/cgit/"
    echo "         cgit-lister http://anonscm.debian.org/cgit/"
    echo "         cgit-lister http://cgit.freedesktop.org/"
    exit 1
fi
CGIT_URL="$1"
shift

# extrace base_url, excluding URL path
base_url=$(echo "$CGIT_URL" | sed 's|^\(https\?://[^/]*\)/.*|\1|')

# we use the xmllint shell, as there is no way to use a separator other than
# space for xpath results with "xmllint --xpath" :-( The output format of the
# shell sucks as well, but at least is line based.
curl -sSL "$CGIT_URL" | xmllint --html --xpath "$REPO_URL_XPATH" - \
    | sed -e "s|href=\"|${base_url}|g" -e 's/" /\n/g'
echo

notice the use of curl instead of relying on xmllint, which was refusing to load the site. the resulting URLs are incorrect, as they still have gitweb instead of git, but that is fixed with:

./cgit-lister https://gitweb.torproject.org/ | sed 's#https://gitweb#https://git#'

kind of nasty, but it works.

I would be pretty interested in integrating it with other listers and moving it to the common repo. I guess we can proceed in two ways.

  1. Use the script that we already have. Just run this script via python to get the list of repos.
  1. Convert this script to python code, i.e. write a python code that uses the logic that is used in this script.

I prefer 1st at it is easy to implement as we have the major piece of the lister(the part to list the repo) already made and working nice. And this script will also not go in vain.
And if we write this parsing code again in python then we would need bs4, which is currently not present in requirements.txt

Which way shall I proceed?

not that I get a vote in this, but i'd say convert to python. depending on xmllint is very brittle... i already had to tweak the thing once to make it work at all, and the pipeline is kind of nasty. i think you will have to import some HTML parser at some point anyways, so you might as well bite that bullet now.

i think there is already one parser in use somewhere that's stricter than bs4 - maybe it would work here as well?

but i'd say convert to python. depending on xmllint is very brittle... i already had to tweak the thing once to make it work at all, and the pipeline is kind of nasty. i think you will have to import some HTML parser at some point anyways, so you might as well bite that bullet now.

Now, I think it makes more sense to convert the script to python

zack added a comment.Jun 17 2019, 10:01 PM

Thanks for your interest in working on this @nahimilega , it would be very useful to move forward on a bunch of pending ingestions, including Tor !

Regarding those two options, definitely (2): rewrite this in Python. Just use requests place of wget and, yes, bs4 for the parsing. We're going to eventually need bs4 for any other scraping-based lister, so adding it as a dependency is not a big deal at this point.