Page MenuHomeSoftware Heritage

Some objects from the original GitHub import have never actually been imported.
Started, Work in Progress, HighPublic

Description

For around 20.000 origins that have been imported in the original github run, we failed to actually import the data, but we still created occurrences referencing commits (that are now dangling).

When doing incremental updates over these repositories, the git loader assumed that the revisions the occurrences pointed to were indeed imported, and therefore we have never filled the gaps by reimporting the data.

We should reimport it from the original git clones.

Related Objects

Event Timeline

olasd created this task.Nov 13 2017, 6:59 PM
olasd updated the task description. (Show Details)
douardda added a project: Restricted Project.Nov 19 2018, 3:29 PM
douardda added a project: Restricted Project.
olasd changed the task status from Open to Work in Progress.Thu, Jan 23, 6:37 PM

List of revisions with no parents (1259):

\copy (select id from revision_history where not exists (select 1 from revision where revision.id = parent_id)) to 'revisions_missing_parent';

List of origin sha1s containing orphan revisions (according to swh-graph) (255 origins. feels a bit low).

: > walks
sort revisions_missing_parent | cut -c 4- | while read rev; do
  url="http://graph.internal.softwareheritage.org:5009/graph/randomwalk/swh:1:rev:$rev/ori?direction=backward"
  (GET $url; echo; echo) >> walks
done
grep swh:1:ori walks | sort | uniq > origins

Origin URLs:

reloading (by hand) the origins with missing revisions.

for url in open('origin_urls').readlines():
    url = url.strip()
    print(url)
    ret = None
    try:
        l = GitLoader(url=url, ignore_history=True)
        ret = l.load()
    except Exception as e:
        ret = e
    print(url, ret)

... currently in progress