persist_index_computations deduplicated row entries based on the entire
content of the row; but postgresql enforces the 'id' should be unique.
This was not an issue in older version of swh-indexer, because all
operations were deterministic, given a specific directory as input.
The recent switch to rdflib introduced non-determinism, so different
outputs may be returned for the same directory id; causing the
deduplication to not be good enough to avoid duplicate ids.
With this commit, deduplication is now done on 'id', as expected.
As a side-effect, persist_index_computations is now more efficient
because:
- it runs in linear time instead of quadratic in the number of metadata items
- it only compares dir ids, instead of the content of indexed metadata (which is arbitrarily large JSON-like data)