Page MenuHomeSoftware Heritage

Manipulate origin URLs instead of origin ids.
ClosedPublic

Authored by vlorentz on Jun 7 2019, 4:23 PM.

Details

Summary

Depends on D1559 (allows querying the storage without an origin id).

(This diff looks huge, but most of it is just updating all the test data)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-urls
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6126
Build 8443: tox-on-jenkinsJenkins
Build 8442: arc lint + arc unit

Event Timeline

vlorentz created this revision.Jun 7 2019, 4:23 PM
douardda requested changes to this revision.Jun 13 2019, 2:26 PM
douardda added a subscriber: douardda.
douardda added inline comments.
swh/indexer/origin_head.py
34–35

why changing the behavior of head selection mechanism here?

why cannot we know at this point which type the processed origin is?

In all cases, if this try-based solution is now mandatory, it would be nice to have it encapsulated in a generic self.get_head() method IMHO.

This revision now requires changes to proceed.Jun 13 2019, 2:26 PM
vlorentz planned changes to this revision.Jun 13 2019, 2:41 PM
vlorentz added inline comments.
swh/indexer/origin_head.py
34–35

why cannot we know at this point which type the processed origin is?

That requires a new API endpoint in the storage, but indeed, we could (and should).

vlorentz added inline comments.Jun 14 2019, 10:34 AM
swh/indexer/origin_head.py
34–35
vlorentz updated this revision to Diff 5247.Jun 14 2019, 3:05 PM
  • Drop origin ids from tests as well
  • Use new-style snapshot_add in the tests (long overdue!)
  • Use origin_visit_get_latest instead of snapshot_get_latest (which is deprecated), in order to know the visit type.
vlorentz updated this revision to Diff 5248.Jun 14 2019, 3:08 PM

add missing aliases.

vlorentz updated this revision to Diff 5320.Jun 18 2019, 5:34 PM

also patch the storage.

vlorentz updated this revision to Diff 5321.Jun 18 2019, 5:36 PM

bump dependency version

vlorentz edited the summary of this revision. (Show Details)Jun 18 2019, 5:38 PM
ardumont requested changes to this revision.Jun 19 2019, 10:32 AM
ardumont added a subscriber: ardumont.

Sounds good.

I'm missing a migration script for the actual data to backfill the new origin-url from the existing rows.
Or am i missing something?

Request changes for the sake of discussion on that point.

Cheers,

swh/indexer/sql/40-swh-func.sql
427 ↗(On Diff #5322)

Did not read the rest yet... how are we backfilling the existing rows in the indexer db? Is there a script for that?

swh/indexer/storage/__init__.py
769

Unify with the other docstring one way (append : Url of the origin) or the other (drop the redundant definition)...

/me singing You've got the power! tududu du du du tududu du du du ;)

(~> lookup snap music if you don't grok that ;)

This revision now requires changes to proceed.Jun 19 2019, 10:32 AM
vlorentz updated this revision to Diff 5331.Jun 19 2019, 10:48 AM

unify docstrings

vlorentz added inline comments.Jun 19 2019, 11:03 AM
swh/indexer/sql/40-swh-func.sql
427 ↗(On Diff #5322)
  1. deploy indexers with that patch
  2. run a full pass on all origins (which I planned on doing anyway, since we had to drop the task queue)
  3. delete rows without an origin_url if any (these would be origins that had metadata but no longer do)
vlorentz marked an inline comment as done.Jun 19 2019, 11:04 AM

Don't know why the build now.

swh/indexer/sql/40-swh-func.sql
427 ↗(On Diff #5322)

Sounds fine to me, thanks.

Because Jenkins timeouted while compiling the package so it did not send it to PyPI the first time. I'm triggering a rebuild

ardumont accepted this revision.Jun 19 2019, 11:43 AM
douardda accepted this revision.Jun 24 2019, 3:23 PM
This revision is now accepted and ready to land.Jun 24 2019, 3:23 PM
This revision was landed with ongoing or failed builds.Jun 24 2019, 4:36 PM
This revision was automatically updated to reflect the committed changes.