Page MenuHomeSoftware Heritage

Manipulate origin URLs instead of origin ids.
ClosedPublic

Authored by vlorentz on Jun 7 2019, 4:23 PM.

Details

Summary

Depends on D1559 (allows querying the storage without an origin id).

(This diff looks huge, but most of it is just updating all the test data)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-urls
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6304
Build 8729: tox-on-jenkinsJenkins
Build 8728: arc lint + arc unit

Event Timeline

douardda added a subscriber: douardda.
douardda added inline comments.
swh/indexer/origin_head.py
46–49

why changing the behavior of head selection mechanism here?

why cannot we know at this point which type the processed origin is?

In all cases, if this try-based solution is now mandatory, it would be nice to have it encapsulated in a generic self.get_head() method IMHO.

This revision now requires changes to proceed.Jun 13 2019, 2:26 PM
vlorentz added inline comments.
swh/indexer/origin_head.py
46–49

why cannot we know at this point which type the processed origin is?

That requires a new API endpoint in the storage, but indeed, we could (and should).

swh/indexer/origin_head.py
46–49
  • Drop origin ids from tests as well
  • Use new-style snapshot_add in the tests (long overdue!)
  • Use origin_visit_get_latest instead of snapshot_get_latest (which is deprecated), in order to know the visit type.
ardumont added a subscriber: ardumont.

Sounds good.

I'm missing a migration script for the actual data to backfill the new origin-url from the existing rows.
Or am i missing something?

Request changes for the sake of discussion on that point.

Cheers,

swh/indexer/sql/40-swh-func.sql
426

Did not read the rest yet... how are we backfilling the existing rows in the indexer db? Is there a script for that?

swh/indexer/storage/__init__.py
771

Unify with the other docstring one way (append : Url of the origin) or the other (drop the redundant definition)...

/me singing You've got the power! tududu du du du tududu du du du ;)

(~> lookup snap music if you don't grok that ;)

This revision now requires changes to proceed.Jun 19 2019, 10:32 AM
swh/indexer/sql/40-swh-func.sql
426
  1. deploy indexers with that patch
  2. run a full pass on all origins (which I planned on doing anyway, since we had to drop the task queue)
  3. delete rows without an origin_url if any (these would be origins that had metadata but no longer do)

Don't know why the build now.

swh/indexer/sql/40-swh-func.sql
426

Sounds fine to me, thanks.

Because Jenkins timeouted while compiling the package so it did not send it to PyPI the first time. I'm triggering a rebuild

This revision is now accepted and ready to land.Jun 24 2019, 3:23 PM
This revision was landed with ongoing or failed builds.Jun 24 2019, 4:36 PM
This revision was automatically updated to reflect the committed changes.