Page MenuHomeSoftware Heritage

Compute and display distribution of origins by forge
Closed, MigratedEdits Locked

Description

This meta-task tracks the activities related to computing and displaying the distribution of sources by forge/source code provider.
This involves:

  • identifying the forges/source code providers (easy for regularly crawled ones, more tricky for the save code now entries)
  • finding an efficient way of maintaining a counter of sources per forge/source code provider (HyperLogLog again?)
  • setting up an API entry point to get this information
  • displaying this information in a nice readable way on archive.softwareheritage.org (maybe on a dedicated page); options:
    • pie chart (beware, GitHub may use up all the space, so the info will be of little use)
    • sorted list (from bigger to smaller), maybe in a scrollable widget 20 lines high

Some related work has already been done and was tracked in T1463 and T1500 (now closed, why?)

Event Timeline

vlorentz triaged this task as Normal priority.Mar 15 2021, 12:28 PM

After some analysis, the data we need to properly implement this are:

  • the set of lister names and their instance names in order to organize origins by forge types (gitlab, cgit, sourceforge, ...)
  • a precise or estimated count for the origins listed by a given lister instance

Getting the set of listers and their instances can be done with a simple query to the scheduler database:

softwareheritage-scheduler=> select name, instance_name from listers order by name;
 CRAN          | cran
 GNU           | GNU
 bitbucket     | bitbucket
 cgit          | alpinelinux
 cgit          | git.gnu.org.ua
 cgit          | zx2c4
 cgit          | tor
 cgit          | hdiff.luite
 cgit          | gnu-savannah
 cgit          | openembedded
 cgit          | git.joeyh.name
 cgit          | baserock
 cgit          | git-kernel
 cgit          | fedora
 cgit          | qt.io
 cgit          | eclipse
 cgit          | yoctoproject
 debian        | Debian
 debian        | Debian-Security
 gitea         | git.fsfe.org
 gitea         | codeberg.org
 github        | github
 gitlab        | riseup
 gitlab        | lip6
 gitlab        | inria
 gitlab        | freedesktop
 gitlab        | ow2
 gitlab        | common-lisp
 gitlab        | gnome
 gitlab        | gite.lirmm
 gitlab        | gitlab
 gitlab        | framagit
 launchpad     | launchpad
 npm           | npm
 phabricator   | swh
 phabricator   | wikimedia
 phabricator   | blender
 phabricator   | llvm
 phabricator   | kde
 pypi          | pypi
 save-code-now | archive.softwareheritage.org
 sourceforge   | main

To get the count of loaded origins for a given lister instance, the best solution from my point of view is to extend swh-counters features
by processing the URLs from the origin topic of swh-journal. It has the advantage to also process origins submitted through the save code
now service.

For the record, I made a little experiment yersteday by hacking on swh-counters code and adding the following code processing origins:

def process_origins(origins: Dict[bytes, bytes], counters: Redis):
    origins_netloc = defaultdict(set)
    for origin_bytes in origins.values():
        origin = msgpack.loads(origin_bytes)
        parsed_url = urlparse(origin["url"])
        netloc = parsed_url.netloc
        if netloc.endswith("googlecode.com"):
            netloc = "googlecode.com"
        origins_netloc[netloc].add(origin["url"])

    for k, v in origins_netloc.items():
        counters.add(k, v)

I used the following config for swh-counters:

counters:
  cls: redis
  host: localhost

journal:
  brokers:
    - kafka1.internal.softwareheritage.org
    - kafka2.internal.softwareheritage.org
    - kafka3.internal.softwareheritage.org
    - kafka4.internal.softwareheritage.org
    
  prefix: swh.journal.objects
  group_id: anlambert.origin_counts.dev4
  object_types:
    - origin
  batch_size: 1000

I then processed all origins from production archive with the following command:

$ swh counters -C ~/.config/swh/counters.yml journal-client

And this is the estimated counters (HyperLogLog based) obtained, sorted in descending order of number of origins:

b'github.com' 156394620
b'bitbucket.org' 2128683
b'www.npmjs.com' 1679889
b'gitlab.com' 1023330
b'googlecode.com' 790026
b'pypi.org' 325025
b'gitorious.org' 122014
b'git.code.sf.net' 115484
b'svn.code.sf.net' 62191
b'Debian' 38533
b'salsa.debian.org' 33665
b'snapshot.debian.org' 32911
b'git.launchpad.net' 19435
b'framagit.org' 18803
b'cran.r-project.org' 17899
b'hdiff.luite.com' 13843
b'gitlab.gnome.org' 9076
b'gitlab.freedesktop.org' 4755
b'gitlab.inria.fr' 3905
b'codeberg.org' 3632
b'git.savannah.gnu.org' 2970
b'git.baserock.org' 2920
b'anongit.kde.org' 2499
b'phabricator.wikimedia.org' 2236
b'code.google.com' 2230
b'git.kernel.org' 2067
b'fedorapeople.org' 1699
b'ftp.gnu.org' 1579
b'scm.gforge.inria.fr' 1234
b'gitlab.ow2.org' 1120
b'Debian-Security' 1031
b'phabricator.kde.org' 1021
b'0xacab.org' 1018
b'git.torproject.org' 1017
b'gitlab.common-lisp.net' 782
b'www.softwareheritage.org' 741
b'gitlab.riscosopen.org' 528
b'gite.lirmm.fr' 470
b'gricad-gitlab.univ-grenoble-alpes.fr' 447
b'git.alpinelinux.org' 378
b'forgemia.inra.fr' 364
b'git.fsfe.org' 352
b'code.qt.io' 325
b'git.zx2c4.com' 294
b'plmlab.math.cnrs.fr' 288
b'git.renater.fr' 274
b'sourcesup.renater.fr' 274
b'subversion.renater.fr' 229
b'scm.sourcesup.renater.fr' 228
b'git.unistra.fr' 223
b'forge.softwareheritage.org' 174
b'hg.tryton.org' 174
b'git.yoctoproject.org' 169
b'hal.archives-ouvertes.fr' 167
b'opendev.org' 162
b'gitlab.huma-num.fr' 143
b'git.php.net' 113
b'gitlab.irstea.fr' 112
b'doi.org' 98
b'gitlab.adullact.net' 82
b'git.ik.bme.hu' 78
b'git.gnu.org.ua' 77
b'git.joeyh.name' 76
b'forge.univ-lyon1.fr' 75
b'git.libreoffice.org' 69
b'git.eclipse.org' 65
b'forge.grandlyon.com' 63
b'git.sr.ht' 49
b'gitlab.u-psud.fr' 46
b'developer.blender.org' 38
b'git.agesic.gub.uy' 37
b'source.netsurf-browser.org' 37
b'gitbox.apache.org' 36
b'dci-gitlab.cines.fr' 30
b'git.openembedded.org' 30
b'gopkg.in' 30
b'notabug.org' 30
b'git-wip-us.apache.org' 28
b'gitub.u-bordeaux.fr' 27
b'dev.ch-poitiers.fr' 26
b'code.ill.fr' 24
b'gitlab.orfeo-toolbox.org' 24
b'git.sch.bme.hu' 21
b'gitlab.math.unistra.fr' 21
b'edugit.org' 20
b'gist.github.com' 20
b'repo.or.cz' 18
b'foss.heptapod.net' 17
b'gitlabjf.ccomptes.fr' 16
b'forge.frm2.tum.de' 15
b'gitlab.cern.ch' 15
b'git.archlinux.org' 14
b'git.bde-insa-lyon.fr' 14
b'git.neodarz.net' 14
b'sourceware.org' 14
b'edu-git.ac-versailles.fr' 13
b'evilpiepirate.org' 13
b'git.kpe.io' 12
b'git.savannah.nongnu.org' 12
b'gitlab.cerema.fr' 12
b'gitlab.lip6.fr' 11
b'gitlab.xiph.org' 11
b'code.briarproject.org' 10
b'git.ricketyspace.net' 10
b'git.sesse.net' 10
b'gitlab.developers.cam.ac.uk' 10
b'gitlab.fing.edu.uy' 10
b'gogs.librecmc.org' 10
b'hg.code.sf.net' 10
b'hg.libsdl.org' 10
b'software.intel.com' 10
b'android.googlesource.com' 9
b'forge.extranet.logilab.fr' 9
b'git.beta.pole-emploi.fr' 9
b'git.rockbox.org' 9
b'git.singpolyma.net' 9
b'git.unicaen.fr' 9
b'gitlab.oit.duke.edu' 9
b'hg.icculus.org' 9
b'git.ademe.fr' 8
b'git.elephly.net' 8
b'git.infradead.org' 8
b'gitlab.alpinelinux.org' 8
b'gitlab.redox-os.org' 8
b'sourceforge.net' 8
b'anongit.freedesktop.org' 7
b'gerrit.googlesource.com' 7
b'inria.halpreprod.archives-ouvertes.fr' 7
b'review.coreboot.org' 7
b'git.hadrons.org' 6
b'git.pleroma.social' 6
b'gitlab.dune-project.org' 6
b'gitlab.onelab.info' 6
b'gitweb.torproject.org' 6
b'pagure.io' 6
b'spivey.oriel.ox.ac.uk' 6
b'svn.blender.org' 6
b'www.happyassassin.net' 6
b'GitHub.com' 5
b'anonscm.debian.org' 5
b'art1pirat.spdns.org' 5
b'code.videolan.org' 5
b'ec.europa.eu' 5
b'git.blender.org' 5
b'git.enlightenment.org' 5
b'git.mfiano.net' 5
b'git.osmocom.org' 5
b'git.suckless.org' 5
b'go.googlesource.com' 5
b'hal-preprod.archives-ouvertes.fr' 5
b'invent.kde.org' 5
b'mainstream.inf.elte.hu' 5
b'secure.phabricator.com' 5
b'source.puri.sm' 5
b'svn.apache.org' 5
b'svn.linuxfromscratch.org' 5
b'www.home.marutan.net' 5
b'crux.nu' 4
b'git.linux-nfs.org' 4
b'git.netfilter.org' 4
b'git.progress-linux.org' 4
b'git.sv.gnu.org' 4
b'git.zap.org.au' 4
b'hg.logilab.org' 4
b'hg.sr.ht' 4
b'jff.email' 4
b'jugit.fz-juelich.de' 4
b'legacy.helldragon.eu' 4
b'svn.icculus.org' 4
b'www.github.com' 4
b'bzr.ed.am' 3
b'chromium.googlesource.com' 3
b'git.cbaines.net' 3
b'git.dthompson.us' 3
b'git.freebsd.org' 3
b'git.ghostscript.com' 3
b'git.gnu.io' 3
b'git.lepiller.eu' 3
b'git.linuxfromscratch.org' 3
b'git.loetlabor-jena.de' 3
b'git.osdn.net' 3
b'git.pengutronix.de' 3
b'git.pofilo.fr' 3
b'git.startinblox.com' 3
b'git.tuxfamily.org' 3
b'git.zrythm.org' 3
b'gitlab.aei.uni-hannover.de' 3
b'gitlab.in2p3.fr' 3
b'gitlab.linphone.org' 3
b'hg.osdn.net' 3
b'inqlab.net' 3
b'libregit.org' 3
b'source.winehq.org' 3
b'svn.jdownloader.org' 3
b'svn.wildfiregames.com' 3
b'trac.wildfiregames.com' 3
b'www.cs.unm.edu' 3
b'' 2
b'archive.bologna.enea.it' 2
b'atlassian@bitbucket.org' 2
b'bazaar.launchpad.net' 2
b'cgit.freedesktop.org' 2
b'g.iterate.ch' 2
b'gcc.gnu.org' 2
b'git.2f30.org' 2
b'git.busybox.net' 2
b'git.codesynthesis.com' 2
b'git.coolaj86.com' 2
b'git.dpkg.org' 2
b'git.easter-eggs.org' 2
b'git.ffmpeg.org' 2
b'git.gnome.org' 2
b'git.hcoop.net' 2
b'git.ikilote.net' 2
b'git.libssh.org' 2
b'git.maneage.org' 2
b'git.ngyro.com' 2
b'git.openprivacy.ca' 2
b'git.opensvc.com' 2
b'git.osgeo.org' 2
b'git.ring0.de' 2
b'git.rip' 2
b'git.sagemath.org' 2
b'git.samba.org' 2
b'git.synz.io' 2
b'git.systemreboot.net' 2
b'git.theobroma-systems.com' 2
b'git.videolan.org' 2
b'gitbio.ens-lyon.fr' 2
b'gitea.petton.fr' 2
b'gitlab.gwdg.de' 2
b'gitlab.haskell.org' 2
b'gitlab.inf.elte.hu' 2
b'gitlab.isc.org' 2
b'gitlab.mbb.univ-montp2.fr' 2
b'gitlab.mim-libre.fr' 2
b'gitlab.nic.cz' 2
b'gitlab.opengeosys.org' 2
b'gitlab.univ-lr.fr' 2
b'gvipers.imt-lille-douai.fr' 2
b'hal-test.archives-ouvertes.fr' 2
b'hg.mozilla.org' 2
b'hg.nginx.org' 2
b'hg.openjdk.java.net' 2
b'hg.reportlab.com' 2
b'jxself.org' 2
b'lab.louiz.org' 2
b'launchpad.net' 2
b'muddlers.org' 2
b'people.freedesktop.org' 2
b'plugins.svn.wordpress.org' 2
b'profs.scienze.univr.it' 2
b'public-inbox.org' 2
b'repo.hu' 2
b'reviews.llvm.org' 2
b'scm.adullact.net' 2
b'sr.ht' 2
b'sunshinegardens.org' 2
b'svn.savannah.gnu.org' 2
b'svn.thedarkmod.com' 2
b'taylorhakes@github.com' 2
b'tinc-vpn.org' 2
b'voidpoint.io' 2
b'www.cl.cam.ac.uk' 2
b'www.davidsharp.com' 2
b'84.38.177.154' 1
b'abcl.org' 1
b'aomedia.googlesource.com' 1
b'argouml-spl.tigris.org' 1
b'boringssl.googlesource.com' 1
b'bos.seul.org' 1
b'bunnyhero@bitbucket.org' 1
b'buttslol.net' 1
b'c9x.me' 1
b'cm-gitlab.stanford.edu' 1
b'code-repo.d4science.org' 1
b'code.9front.org' 1
b'code.divoplade.fr' 1
b'code.gab.com' 1
b'code.heb12.com' 1
b'code.launchpad.net' 1
b'code.librehq.com' 1
b'code.research.uts.edu.au' 1
b'code.reversed.top' 1
b'ctp2.darkdust.net' 1
b'depp.brause.cc' 1
b'dev.ds-servers.com' 1
b'dev.hostsharing.net' 1
b'dmitri.shuralyov.com' 1
b'dpdk.org' 1
b'dthompson.us' 1
b'dtrebbien@bitbucket.org' 1
b'eldargab@github.com' 1
b'etckeeper.branchable.com' 1
b'filfox.info' 1
b'floppsie.comp.glam.ac.uk' 1
b'forge.clermont-universite.fr' 1
b'foundry.openuru.org' 1
b'framgit.org' 1
b'galexander.org' 1
b'genome-source.gi.ucsc.edu' 1
b'geopsy.org' 1
b'git-annex.branchable.com' 1
b'git-tails.immerda.ch' 1
b'git.0pointer.de' 1
b'git.alsa-project.org' 1
b'git.ardour.org' 1
b'git.assembla.com' 1
b'git.beyermatthi.as' 1
b'git.bouncycastle.org' 1
b'git.centos.org' 1
b'git.clfs.org' 1
b'git.dev.opencascade.org' 1
b'git.dgit.debian.org' 1
b'git.drobilla.net' 1
b'git.e2factory.org' 1
b'git.ebc.li' 1
b'git.embl.de' 1
b'git.foldling.org' 1
b'git.gnunet.org' 1
b'git.gnupg.org' 1
b'git.guilhem.org' 1
b'git.haiku-os.org' 1
b'git.hypra.fr' 1
b'git.ikiwiki.info' 1
b'git.imp.fu-berlin.de' 1
b'git.in-silico.ch' 1
b'git.in-ulm.de' 1
b'git.interior.edu.uy' 1
b'git.jami.net' 1
b'git.kernel.dk' 1
b'git.kyleam.com' 1
b'git.lekensteyn.nl' 1
b'git.ligo.org' 1
b'git.linaro.org' 1
b'git.liw.fi' 1
b'git.lysator.liu.se' 1
b'git.matrix.org' 1
b'git.meli.delivery' 1
b'git.minetest.land' 1
b'git.mpi-cbg.de' 1
b'git.musl-libc.org' 1
b'git.neil.brown.name' 1
b'git.net-core.org' 1
b'git.netsurf-browser.org' 1
b'git.nzoss.org.nz' 1
b'git.open-music-kontrollers.ch' 1
b'git.openldap.org' 1
b'git.openssl.org' 1
b'git.openstack.org' 1
b'git.parat.swiss' 1
b'git.plexbak.nl' 1
b'git.postgresql.org' 1
b'git.proxmox.com' 1
b'git.psyced.org' 1
b'git.pwmt.org' 1
b'git.qemu.org' 1
b'git.qsomula.top' 1
b'git.schottelius.org' 1
b'git.scilab.org' 1
b'git.sdaoden.eu' 1
b'git.sdf.org' 1
b'git.simple-cc.org' 1
b'git.spwhitton.name' 1
b'git.strongswan.org' 1
b'git.sv.nongnu.org' 1
b'git.teknik.io' 1
b'git.tiker.net' 1
b'git.toastfreeware.priv.at' 1
b'git.trustedfirmware.org' 1
b'git.tukaani.org' 1
b'git.vuxu.org' 1
b'git.wow.st' 1
b'git.xiph.org' 1
b'git.zvx8.com' 1
b'git@github.com' 1
b'gitea.eponym.info' 1
b'github.com.cnpmjs.org' 1
b'gitlab.caltech.edu' 1
b'gitlab.ccsd.cnrs.fr' 1
b'gitlab.denx.de' 1
b'gitlab.echothree.com' 1
b'gitlab.exascale-computing.eu' 1
b'gitlab.huawei.com' 1
b'gitlab.irap.omp.eu' 1
b'gitlab.labs.nic.cz' 1
b'gitlab.mister-muffin.de' 1
b'gitlab.mpi-sws.org' 1
b'gitlab.obspm.fr' 1
b'gitlab.omofumi.pl' 1
b'gitlab.petton.fr' 1
b'gitlab.rlp.net' 1
b'gitlab.savoirfairelinux.com' 1
b'gitlab.uni.lu' 1
b'gitweb.dragonflybsd.org' 1
b'gn.googlesource.com' 1
b'gnomint.git.sourceforge.net' 1
b'gnunet.org' 1
b'graphics.rwth-aachen.de:9000' 1
b'guix.gnu.org' 1
b'hg.lilotux.net' 1
b'hg.mozdev.org' 1
b'hg.savannah.gnu.org' 1
b'hih-git.neurologie.uni-tuebingen.de' 1
b'hub.darcs.net' 1
b'ikiwiki.branchable.com' 1
b'jelmer.uk' 1
b'joinup.ec.europa.eu' 1
b'juigitlab.esac.esa.int' 1
b'kernel.googlesource.com' 1
b'keysafe.branchable.com' 1
b'kylheku.com' 1
b'lab.jerasure.org' 1
b'lab.nexedi.com' 1
b'linux-libre.fsfla.org' 1
b'linuxtv.org' 1
b'llvm.org' 1
b'mbb-team.github.io' 1
b'mcabber.com' 1
b'myrepos.branchable.com' 1
b'nix-community.github.io' 1
b'nsz.repo.hu:49100' 1
b'opencircuitdesign.com' 1
b'opensource.ieee.org' 1
b'plugins.trac.wordpress.org' 1
b'pumpa.branchable.com' 1
b'r-36.net' 1
b'resources.oreilly.com' 1
b'review.haiku-os.org' 1
b'riscosopen.org' 1
b'schierlm@git.code.sf.net' 1
b'scm.osdn.net' 1
b'scribus.net' 1
b'software.legiasoft.com' 1
b'source.joinmastodon.org' 1
b'src.fedoraproject.org' 1
b'sre.ring0.de' 1
b'stoyokaramihalev.github.io' 1
b'svn.appwork.org' 1
b'svn.eby-sarna.com' 1
b'svn.filezilla-project.org' 1
b'svn.freebsd.org' 1
b'svn.kibibyte.se' 1
b'svn.osdn.net' 1
b'svn.r-project.org' 1
b'svn.savannah.nongnu.org' 1
b'svn.science.uu.nl' 1
b'svn.so-much-stuff.com' 1
b'svn.tuxfamily.org' 1
b'svn.xvid.org' 1
b'svn.zoy.org' 1
b'thelambdalab.xyz' 1
b'thingshare.ion.nu' 1
b'tildegit.org' 1
b'timorleste.github.io' 1
b'tomakehurst@github.com' 1
b'tuleap.net' 1
b'vicerveza.homeunix.net' 1
b'wenshao@github.com' 1
b'www.aleph1.co.uk' 1
b'www.codesrc.com' 1
b'www.fsfla.org' 1
b'www.gitlab.com' 1
b'www.ipol.im' 1
b'www.kermitproject.org' 1
b'www.kylheku.com' 1
b'www.mercurial-scm.org' 1
b'www.mitsuba-renderer.org' 1
b'www.octave.org' 1
b'www.riverbankcomputing.com' 1
b'www.xtideuniversalbios.org' 1
b'xenbits.xenproject.org' 1
b'xpra.org' 1
b'youpibouh.thefreecat.org' 1
b'zimbra-mirror@bitbucket.org' 1

It should be easy to map a lister instance name to one of these counters and produces automatic display of those data in the webapp.
For those with ambiguities, we can still provide a manual mapping for some edge cases.
We can also get the numbers of origins not linked to a lister in production (gitorious, googlecode, ...).

Anymore thoughts ?

Great work! Awesome.

Anymore thoughts ?
It should be easy to map a lister instance name to one of these counters and produces automatic display of those data in the webapp.

Regarding this, to ease the mapping between a lister and an instance name, we may want to rework the instance names in the scheduler
model (listers table) so that the value is actually the netloc of the origin.
Re-attaching the origin to a lister will then a simple matter of checking the netloc (except for some exception like googlecode you
already in the snippet above)

My 2 cents.

Regarding this, to ease the mapping between a lister and an instance name, we may want to rework the instance names in the scheduler
model (listers table) so that the value is actually the netloc of the origin.

That would be awesome and simplify code to write a lot ! +1

Nice to see this moving forward!

These entries in the counter log look suspicious, though, they are not origins:

b'atlassian@bitbucket.org' 2
b'taylorhakes@github.com' 2
b'bunnyhero@bitbucket.org' 1
b'dtrebbien@bitbucket.org' 1
b'eldargab@github.com' 1
b'git@github.com' 1
b'schierlm@git.code.sf.net' 1
b'tomakehurst@github.com' 1
b'wenshao@github.com' 1
b'zimbra-mirror@bitbucket.org' 1

Nice to see this moving forward!

These entries in the counter log look suspicious, though, they are not origins:

b'atlassian@bitbucket.org' 2
b'taylorhakes@github.com' 2
b'bunnyhero@bitbucket.org' 1
b'dtrebbien@bitbucket.org' 1
b'eldargab@github.com' 1
b'git@github.com' 1
b'schierlm@git.code.sf.net' 1
b'tomakehurst@github.com' 1
b'wenshao@github.com' 1
b'zimbra-mirror@bitbucket.org' 1

Those correspond to origins submitted through save code now requests, all of them ended up not found of course.

@anlambert @rdicosmo

For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1]
to compute stats about what we want scheduler side.

What's missing implementation wise would be to expose an endpoint to actually display said information.

So, the question is, even though the implementation swh.counter started, do we really want that there
or this ^ scheduler side would be enough?

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/cli/origin.py$148-182

Sorry @anlambert, I was late at Monday's meeting and I completely missed this in your weekly plan, I would have pointed this out earlier.

The existing scheduler metrics are probably not complete enough for all we want to display (we should review them so they are), but the swh.scheduler journal client already gathers all the information needed, so we should be able to compute all that we need from the scheduler tables.

The main pain point is that we do have a bunch of origins for which we don't have a listed_origins entry (because they've been archived before the current lister version was deployed, then disappeared, or because we've never listed them in the first place, e.g. for save code now origins).

I think we should be able to "backfill" these known but disabled origins in the listed_origins tables, once (setting them as enabled=false so they don't clutter the scheduling).

@anlambert @rdicosmo

For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1]
to compute stats about what we want scheduler side.

What's missing implementation wise would be to expose an endpoint to actually display said information.

So, the question is, even though the implementation swh.counter started, do we really want that there
or this ^ scheduler side would be enough?

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/cli/origin.py$148-182

For the archive coverage widget in the webapp homepage, I think we should only display the number of origins
that were processed by a loader to reflect the current number of archived projects, plus some origins like
gitorious or googlecode do not have lister metrics in scheduler database.

For the lister metrics extracted from the scheduler, we could add a different widget displaying those after
adding a new endpoint to the scheduler interface to easily get the data.

This comment was removed by anlambert.

After more thoughts about all those metrics, we could revamp the coverage widget into two tabs:

  • one tab displaying metrics about loaded origins with detailed counts by forge and links to search interface to browse them
  • one tab displaying metrics about listed origins from the data extracted from the scheduler database

The idea is to show what we have archived so far and what is planned to be archived or saved again.

The existing scheduler metrics are probably not complete enough for all we want to display (we should review them so they are), but the swh.scheduler journal client already gathers all the information needed, so we > should be able to compute all that we need from the scheduler tables.

Oh I did not see that scheduler journal client process origin visits and their statuses (just read the code) so indeed better use them than relying on swh-counters.
I guess the cli to update metrics is executed periodically in production ?
I will continue in that direction then.

I guess the cli to update metrics is executed periodically in production ?

I don't think that they are yet but that just got a priority increase now ;)

I guess the cli to update metrics is executed periodically in production ?

I don't think that they are yet but that just got a priority increase now ;)

Indeed they are not.

softwareheritage-scheduler=> select * from scheduler_metrics;
 lister_id | visit_type | last_update | origins_known | origins_enabled | origins_never_visited | origins_with_pending_changes 
-----------+------------+-------------+---------------+-----------------+-----------------------+------------------------------
(0 rows)

As @olasd said in a previous comment, even if we compute the metrics, we will miss counters about origins not tight to a lister
(googlecode and gitorious for instance). So I am thinking again about an hybrid approach using the swh-counters metrics
implemented yersteday which gives a rough estimation on the number of origins by network location (as visit statuses are not
processed, only origins) and the scheduler metrics.

As @olasd said in a previous comment, even if we compute the metrics, we will miss counters about origins not tight to a lister
(googlecode and gitorious for instance). So I am thinking again about an hybrid approach using the swh-counters metrics
implemented yersteday which gives a rough estimation on the number of origins by network location (as visit statuses are not
processed, only origins) and the scheduler metrics.

The origin_visit_stats table in the scheduler database has info from all known origins (we processed the full backlog of origin_visit_status entries to fill it up), so we should be able to reprocess/query it in a way that breaks down the origins by netloc too, if that's what we want the web frontend to consume.

But I think the current structure of the listers and listed_origins tables is an opportunity to solve the problem of keeping structured storage for :

  • upstream forges (currently mapped to "lister instances", and with metadata stored ad-hoc in swh-web)
  • mapping of origins to upstream forges ("listed origins")

I think it would be nice to store the structured data we have about forge instances inside the listers table, and use that as a source of truth. That way if next month a gitea instance ends up overtaking github.com, we can semi-automatically move that to the forefront in the frontend without having to change the logic.

To be able to do so, we will have to backfill some manually curated entries in the lister/listed_origins tables for older one-shot imports, as well as for disappeared forges.

Thinking about this further, rather than backfill all the forges we've listed a while ago and have since disappeared, we could make a fallback origin_visit_stats export broken down by netloc, for all the origins with no attached listers (which would also handle those origins recorded by save code now).

Some reports of what have been done so far and some future directions regarding the display of those data in swh-web.

Relevant metrics regarding origins distribution and their loading status have been computed from the data produced by
the listers and stored in scheduler database (thanks to @olasd), see below:

softwareheritage-scheduler=> select name, instance_name, last_update, origins_known, origins_enabled, origins_never_visited, origins_with_pending_changes from listers inner join scheduler_metrics on id = lister_id order by name;
     name      |        instance_name         |          last_update          | origins_known | origins_enabled | origins_never_visited | origins_with_pending_changes 
---------------+------------------------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------
 CRAN          | cran                         | 2021-07-13 12:14:07.919027+00 |         18292 |           18292 |                 18292 |                            0
 GNU           | GNU                          | 2021-07-13 12:14:07.919027+00 |           386 |             386 |                    32 |                           90
 bitbucket     | bitbucket                    | 2021-07-13 12:14:07.919027+00 |       2810325 |         2810325 |               1125199 |                         6892
 cgit          | code.qt.io                   | 2021-07-13 12:14:07.919027+00 |           278 |             278 |                    13 |                            6
 cgit          | git.yoctoproject.org         | 2021-07-13 12:14:07.919027+00 |           175 |             175 |                    10 |                            9
 cgit          | git.eclipse.org              | 2021-07-13 12:14:07.919027+00 |          1375 |            1375 |                  1307 |                            2
 cgit          | git.baserock.org             | 2021-07-13 12:14:07.919027+00 |          1524 |            1524 |                    70 |                            0
 cgit          | gitweb.torproject.org        | 2021-07-13 12:14:07.919027+00 |           519 |             519 |                   519 |                            0
 cgit          | git.openembedded.org         | 2021-07-13 12:14:07.919027+00 |            16 |              16 |                    16 |                            0
 cgit          | git.joeyh.name               | 2021-07-13 12:14:07.919027+00 |            62 |              62 |                    62 |                            0
 cgit          | fedorapeople.org             | 2021-07-13 12:14:07.919027+00 |           866 |             866 |                    53 |                           10
 cgit          | git.zx2c4.com                | 2021-07-13 12:14:07.919027+00 |           159 |             159 |                    14 |                           19
 cgit          | git.alpinelinux.org          | 2021-07-13 12:14:07.919027+00 |             6 |               6 |                     0 |                            1
 cgit          | gnu-savannah                 | 2021-07-13 12:14:07.919027+00 |          1029 |            1029 |                    39 |                           71
 cgit          | git.kernel.org               | 2021-07-13 12:14:07.919027+00 |          1091 |            1091 |                   375 |                           43
 cgit          | git.gnu.org.ua               | 2021-07-13 12:14:07.919027+00 |           145 |             145 |                   145 |                            0
 debian        | Debian-Security              | 2021-07-13 12:14:07.919027+00 |           788 |             788 |                   268 |                            0
 debian        | Debian                       | 2021-07-13 12:14:07.919027+00 |         35100 |           35100 |                    85 |                            0
 gitea         | codeberg.org                 | 2021-07-13 12:14:07.919027+00 |          8233 |            8233 |                  4928 |                          270
 gitea         | git.fsfe.org                 | 2021-07-13 12:14:07.919027+00 |           401 |             401 |                    51 |                            6
 github        | github                       | 2021-07-13 12:14:07.919027+00 |     180516812 |       180516812 |              61374514 |                      1381607
 gitlab        | gitlab                       | 2021-07-13 12:14:07.919027+00 |        200200 |          200200 |                 20638 |                          626
 gitlab        | gitlab.inria.fr              | 2021-07-13 12:14:07.919027+00 |          3343 |            3343 |                  1440 |                          114
 gitlab        | framagit.org                 | 2021-07-13 12:14:07.919027+00 |         20427 |           20427 |                  5265 |                          342
 gitlab        | gitlab.common-lisp.net       | 2021-07-13 12:14:07.919027+00 |           825 |             825 |                    65 |                            8
 gitlab        | gitlab.lip6.fr               | 2021-07-13 12:14:07.919027+00 |            69 |              69 |                    62 |                            0
 gitlab        | gite.lirmm.fr                | 2021-07-13 12:14:07.919027+00 |           638 |             638 |                   255 |                           37
 gitlab        | gitlab.ow2.org               | 2021-07-13 12:14:07.919027+00 |          1297 |            1297 |                   284 |                           37
 gitlab        | gitlab.gnome.org             | 2021-07-13 12:14:07.919027+00 |         13176 |           13176 |                  4771 |                          280
 gitlab        | gitlab.freedesktop.org       | 2021-07-13 12:14:07.919027+00 |          8008 |            8008 |                  3448 |                          226
 gitlab        | 0xacab.org                   | 2021-07-13 12:14:07.919027+00 |          1255 |            1255 |                   428 |                           20
 launchpad     | launchpad                    | 2021-07-13 12:14:07.919027+00 |         24018 |           24018 |                  4957 |                         2367
 npm           | npm                          | 2021-07-13 12:14:07.919027+00 |       1629224 |         1629224 |                  3302 |                          112
 phabricator   | swh                          | 2021-07-13 12:14:07.919027+00 |           189 |             189 |                    29 |                            0
 phabricator   | blender                      | 2021-07-13 12:14:07.919027+00 |             6 |               6 |                     1 |                            0
 phabricator   | blender                      | 2021-07-13 12:14:07.919027+00 |            41 |              41 |                    41 |                            0
 phabricator   | kde                          | 2021-07-13 12:14:07.919027+00 |          1036 |            1036 |                  1025 |                            0
 pypi          | pypi                         | 2021-07-13 12:14:07.919027+00 |        391270 |          391270 |                 73834 |                        18870
 save-code-now | archive.softwareheritage.org | 2021-07-13 12:14:07.919027+00 |            13 |              13 |                     0 |                            0
 save-code-now | archive.softwareheritage.org | 2021-07-13 12:14:07.919027+00 |             6 |               6 |                     0 |                            0
 save-code-now | archive.softwareheritage.org | 2021-07-13 12:14:07.919027+00 |          2199 |            2199 |                     0 |                            0
 sourceforge   | main                         | 2021-07-13 12:14:07.919027+00 |        101843 |               0 |                     0 |                            0
 sourceforge   | main                         | 2021-07-13 12:14:07.919027+00 |           290 |               0 |                     0 |                            0
 sourceforge   | main                         | 2021-07-13 12:14:07.919027+00 |         27617 |               0 |                     0 |                            0
 sourceforge   | main                         | 2021-07-13 12:14:07.919027+00 |         28622 |               0 |                     0 |                            0
 sourceforge   | main                         | 2021-07-13 12:14:07.919027+00 |        181290 |               0 |                     0 |                            0
(46 rows)

The computation of those metrics will be executed in production on a regular basis, probably each day, to keep them up to date.

The web application will have easy access to those metrics thanks to the swh-scheduler RPC API.

Regarding how they will be displayed, the idea is to keep the coverage widget in the homepage presenting origin categories
in a responsible grid (see screenshot below) while enhancing it the following way:

  • only display main forge type with its logo and a counter corresponding to the sum of the number of origins in each instance
  • logo will be made clickable and a modal (or popover) window will be displayed giving more details about the forge type, its listed instances with the number of origins plus other relevant info

At first, continue to store forge/origin type metadata (logo, description, ...) on the swh-web side but eventually a new table
should be created on swh-scheduler database to store those info. It will enable to reuse them in other applications that could
also need them.

Most of the counters that will be displayed will come from swh-scheduler database. It remains a couple of them (gitorious,
googlecode, nixos, guix) that are missing from the scheduler metrics until a backfill process is implemented for those. Thus they will
be hardcoded in swh-web until then.

I think we could also get an accurate count of deposit origins (HAL, IPOL) using swh-deposit API, I will ping @ardumont for that
part once he gets back from holidays.

Thanks for this update, great work!

Only one nit about the display. Using modal windows/popover will mean that there will be no easy way to have, as a user, the full list: one will have to click on each logo one by one, which could be quite annoying. Would it be possible to have a page with a rendering of the table above? (not sure if we want all columns, but at least the last update time and the number of origins per forge instance looks relevant and interesting to me). It coule be either in addition of what you propose (e.g., as a "coverage details" link, leading to the full page), or as a replacement of it (e.g., by making each forge icon just a link to the relevant anchor within the table on the "coverage details" page).

Only one nit about the display. Using modal windows/popover will mean that there will be no easy way to have, as a user, the full list: one will have to click on each logo one by one, which could be quite annoying. Would it be possible to have a page with a rendering of the table above? (not sure if we want all columns, but at least the last update time and the number of origins per forge instance looks relevant and interesting to me). It coule be either in addition of what you propose (e.g., as a "coverage details" link, leading to the full page), or as a replacement of it (e.g., by making each forge icon just a link to the relevant anchor within the table on the "coverage details" page).

My initial thoughts was to not display too much info by default to keep coverage widget simple and readable, so the details on demand approach.
But I agree using modals or popovers is not so great in terms of UX.
Instead, we could split the coverage widget into two tabs:

  • one giving a high level overview of the archived origins, similar to what we have now with logos and counters
  • one giving the details of all forges we archived so far, displayed in a table as you suggested with relevant metrics and links to search origins for a given forge

I think we could also get an accurate count of deposit origins (HAL, IPOL) using swh-deposit API

So this is what I obtained by counting netlocs of deposit origins with status done, doi.org netloc corresponds to IPOL deposits.

Counter({'www.softwareheritage.org': 764,
         'hal.archives-ouvertes.fr': 323,
         'doi.org': 115,
         'inria.halpreprod.archives-ouvertes.fr': 99,
         'software.intel.com': 10,
         'elife.stencila.io': 7})

Some extra netloc mapping for IPOL must be handled on the swh-web side but at least counts will be accurate and retrieved dynamically.

Instead, we could split the coverage widget into two tabs

  • one giving a high level overview of the archived origins, similar to what we have now with logos and counters
  • one giving the details of all forges we archived so far, displayed in a table as you suggested with relevant metrics and links to search origins for a given forge

After more thoughts on that UI, I think displaying origins details (forge instances, search links, lister metrics) on demand directly
in the coverage widget is the best way to go.

In order to not having to click on each forge icon to get origins details, I think a good tradeoff with the table display approach is to
put a bootstrap collapsible under each icon that will hold origin details. Clicking on any collapsible handle will then expand/collapse
all these collapsible elements to show/hide all origins details.

This is how it will look after my first shot on that approach.


I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?
And we know we had some 1.5m origins for Google code, why only 700k shown here?

Also, we really need to review the english: remove 'hopefully'

I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Indeed there is something weird here as we have more than one million gitlab.com origins in database.

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

Looks like something was missed when computing lister metrics from scheduler database, this needs further investigations.

And we know we had some 1.5m origins for Google code, why only 700k shown here?

That number comes from the netloc counters on the whole set of origins used in T3127#66621.
This is an estimation but we get similar results by querying the database for google code origins (git, hg, svn + code.google.com netloc):

softwareheritage=> select count(*) from origin where url like '%.googlecode.com';
 count 
-------
 86641
(1 row)

softwareheritage=> select count(*) from origin where url like '%.googlecode.com/svn';
 count  
--------
 573968
(1 row)

softwareheritage=> select count(*) from origin where url like '%.googlecode.com/hg%';
 count  
--------
 126669
(1 row)

softwareheritage=> select count(*) from origin where url like 'http://code.google.com%';
 count 
-------
  1866
(1 row)

Total number of origins is 789144 after summing all these counters.

I do not know the exact number of origins we had before archiving them but my guess is some of them were
not of interest to load into the archive (T947 for instance).

Also, we really need to review the english: remove 'hopefully'

Sure

I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Indeed there is something weird here as we have more than one million gitlab.com origins in database.

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

Looks like something was missed when computing lister metrics from scheduler database, this needs further investigations.

Indeed, please do look into this, thanks.

And we know we had some 1.5m origins for Google code, why only 700k shown here?

That number comes from the netloc counters on the whole set of origins used in T3127#66621.

This is an estimation but we get similar results by querying the database for google code origins (git, hg, svn + code.google.com netloc):

softwareheritage=> select count(*) from origin where url like '%.googlecode.com';
 count 
-------
 86641
(1 row)

softwareheritage=> select count(*) from origin where url like '%.googlecode.com/svn';
 count  
--------
 573968
(1 row)

softwareheritage=> select count(*) from origin where url like '%.googlecode.com/hg%';
 count  
--------
 126669
(1 row)

softwareheritage=> select count(*) from origin where url like 'http://code.google.com%';
 count 
-------
  1866
(1 row)

Total number of origins is 789144 after summing all these counters.

I do not know the exact number of origins we had before archiving them but my guess is some of them were
not of interest to load into the archive (T947 for instance).

Thanks for these details: this count is missing the 800k git origins: @ardumont and @olasd should be able to tell you how to find them

Thanks for these details: this count is missing the 800k git origins: @ardumont and @olasd should be able to tell you how to find them

Git origins correspond to URLs matching the following regexp http://.*.googlecode.com (first SQL query in above comment).
Based on T673#13217, there is 88307 googlecode git origins not 800k.

In T3127#67581, @anlambert wrote:

    I am a bit puzzled by the numbers shown: eeally we have only 200k origins for GitLab.com.?

Indeed there is something weird here as we have more than one million gitlab.com origins in database.

softwareheritage=> select count(*) from origin where url like 'https://gitlab.com/%';
  count  
---------
 1023499
(1 row)

Looks like something was missed when computing lister metrics from scheduler database, this needs further investigations.

Indeed, please do look into this, thanks.

@rdicosmo, the issues with gitlab listing have been identified. You can find the details in T3442.

The computation of those metrics will be executed in production on a regular basis, probably each day, to keep them up to date.

That part is deployed so on staging and production, those scheduler metrics are updated daily.

Is there a reason not to close this task?

Is there a reason not to close this task?

Nope, this is done, closing it as resolved.