Page MenuHomeSoftware Heritage

Deal with IRIs
Closed, MigratedEdits Locked

Description

Currently, origins in the archive are identified by their IRI, such as this one: https://archive.softwareheritage.org/browse/origin/https://gitorious.org/systemy-zdarzeniowe%C4%85%C5%9B%C4%87/systemy-zdarzeniowe-gitorious-wiki.git/directory/

However, database schemas (origin-related tables and skipped_content), variables (grep -Ri url swh-environment/*/swh), APIs (origin-related tables and skipped_content for swh-storage, all origin arguments of swh-web), specifications, and documentation refer to them as URLs.

IMO, the easiest solution would be to keep the current format and names, and just update the specifications/documentation to mention they are actually IRIs

Event Timeline

vlorentz triaged this task as Normal priority.Jan 30 2020, 4:26 PM
vlorentz created this task.

I'm fine with switching to IRIs in the doc, just please expand what it means on first use (with a mention like "they are like URIs but"), as I don't think the acronym is that well-known yet, especially in the US.

zack renamed this task from Dealing with IRIs to SWHID: deal with IRIs.Apr 24 2020, 10:28 AM
vlorentz renamed this task from SWHID: deal with IRIs to Deal with IRIs.Apr 24 2020, 1:29 PM

I wrote that little script to check the number of origin IRIs and URIs in the archive

from pprint import pprint

from rfc3987 import parse

from swh.web.common import service

batch_size = 100000
nb_iris = 0
nb_uris = 0
nb_origins = 0

iris = []
no_uri_iris = []

def process_origins(origins):
    global nb_origins, nb_iris, nb_uris, iris
    nb_origins += len(origins)
    for origin in origins:
        try:
            parse(origin['url'], rule='URI')
            nb_uris += 1
        except ValueError:
            try:
                parse(origin['url'], rule='IRI')
                nb_iris += 1
                iris.append(origin['url'])
            except:
                no_uri_iris.append(origin['url'])
                pass
    print(f'nb_origins = {nb_origins}, nb_iris = {nb_iris}, nb_uris = {nb_uris}')

origins = list(service.lookup_origins(origin_count=batch_size))

while len(origins) == batch_size:
    process_origins(origins)
    origins = list(service.lookup_origins(origin_from=origins[-1]["id"]+1, origin_count=batch_size))
process_origins(origins)

pprint(iris)
pprint(no_uri_iris)

There is exactly two origins with IRIs:

nb_origins = 115660314, nb_iris = 2, nb_uris = 115660251

['https://gitorious.org/systemy-zdarzenioweąść/systemy-zdarzeniowe-gitorious-wiki.git',
 'https://gitorious.org/systemy-zdarzenioweąść/systemy-zdarzeniowe.git']

Also a couple of origins have invalid URI/IRI:

['http://code.google.com/eclipselabs/m/mobile-web-development-with-phonegap/sv\\',
 'http://code.qt.io/{non-gerrit}/qt-labs/coroutine.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qml1-shadersplugin.git',
 'http://code.qt.io/{graveyard}/qt-historical.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtmodularization.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/webclient.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/simplegl.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/webscraps.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtscript-browser-env.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtjambi-awtbridge.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/remotecontrolwidget.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/itemviews-ng.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtuitest.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/modelviewer.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/systemtests.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/devnet-examples.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtspotify.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/segmentedbutton.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/graphics-dojo.git',
 'http://code.qt.io/{graveyard}/qt-creator-historical.git',
 'http://code.qt.io/{graveyard}/quick3d.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qt-compositor.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/bm2.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qmlcanvas.git',
 'http://code.qt.io/{graveyard}/qlogger.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtestlib-tools.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtscript-remote-debugging.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/mobile-demos.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qtcollator.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/doxygen2qthelp.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/simulator.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/kineticscroller.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/devdays-graphicssystem-plugin.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/maemo5-homescreen.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/scxml.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qml-gesturearea.git',
 'http://code.qt.io/{graveyard}/qtjsbackend.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qmlogre.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/symbian-overlay.git',
 'http://code.qt.io/{graveyard}/qtbinaryjson.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qt5-launch-demo.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qml-object-model.git',
 'http://code.qt.io/{graveyard}/qtmultimediakit.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/scene-graph.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/wolfenqt.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qt-autotester.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/nacl.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/devdays-windowsystem-server.git',
 'http://code.qt.io/{non-gerrit}/qt/qt3support.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/doctools.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qml-toucharea.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/qml-gestures-examples.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/widgets-ng.git',
 'http://code.qt.io/{graveyard}/qtx11support.git',
 'http://code.qt.io/{graveyard}/qtphonon.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/opencl.git',
 'http://code.qt.io/{graveyard}/qtjsondb.git',
 'http://code.qt.io/{graveyard}/v4vm.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/bm.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/scene-graph-demo.git',
 'http://code.qt.io/{non-gerrit}/qt-labs/meespot.git']