HomeSoftware Heritage

Fix crash when indexing the same directory twice with non-deterministic order

Description

Fix crash when indexing the same directory twice with non-deterministic order

persist_index_computations deduplicated row entries based on the entire
content of the row; but postgresql enforces the 'id' should be unique.

This was not an issue in older version of swh-indexer, because all
operations were deterministic, given a specific directory as input.

The recent switch to rdflib introduced non-determinism, so different
outputs may be returned for the same directory id; causing the
deduplication to not be good enough to avoid duplicate ids.

With this commit, deduplication is now done on 'id', as expected.

As a side-effect, persist_index_computations is now more efficient
because:

  1. it runs in linear time instead of quadratic in the number of metadata items
  2. it only compares dir ids, instead of the content of indexed metadata (which is arbitrarily large JSON-like data)

Details

Provenance
vlorentzAuthored on Sep 8 2022, 11:35 AM
vlorentzPushed on Sep 8 2022, 4:58 PM
Differential Revision
D8417: Fix crash when indexing the same directory twice with non-deterministic order
Parents
rDCIDXdd0274193f52: github: Add support for 'topics'
Branches
Unknown
Tags
Unknown
Build Status
Buildable 31404
Build 49125: test-and-buildJenkins console · Jenkins