Page MenuHomeSoftware Heritage

storage*: Add origin_visit_status_add endpoint
ClosedPublic

Authored by ardumont on Jun 3 2020, 12:33 PM.

Details

Summary

Open origin_visit_status_add endpoint to add origin visit statuses.

Related to T2310

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Why return the object? For consistency with the other _add endpoints, it should return None. And also take an iterable as argument

swh/storage/in_memory.py
837–841

I don't think we should check this; it will crash on replays.

(And so would checking the origin existence, but origin_visit_add has the same issue so I think we should keep that one for now, for the sake of consistency.)

Why return the object? For consistency with the other _add endpoints, it should return None.

yes, well, i never quite like the other endpoints.
I guess that leaks here. I'll adapt.

And also take an iterable as argument

I recall something about calling endpoints which adds only one object to be called <add>_one.
Do we add those or simply the one adding iterable of those?

swh/storage/cassandra/storage.py
836

By the way, do I remove those by inlining directly the main endpoint?
I'm unsure so i thought, better ask ;)

Why return the object? For consistency with the other _add endpoints, it should return None.

yes, well, i never quite like the other endpoints.
I guess that leaks here. I'll adapt.

Also, counter example though: origin_visit_add returns an OriginVisit... ;)

olasd added inline comments.
swh/storage/in_memory.py
837–841

I suggest having a origin_visit_status_add_one, used by loaders, which does the sanity checks, and a origin_visit_status_add, which takes an iterable, and can be used by replayers to mass-replay these objects from a journal topic.

swh/storage/in_memory.py
837–841

Thanks!

I was currently renaming to origin_visit_status_add_one and was slowly convincing myself to add the other one on the iterable but i was blocking on the checks...

Now that's making sense!

ardumont retitled this revision from storage*: Add origin_visit_status_add endpoint to storage*: Add origin_visit_status_add_one endpoint.Jun 3 2020, 5:50 PM

Adapt according to review/discussion limiting to origin_visit_status_add_one.

I'll most probably open another diff for origin_visit_status_add which deals with
Iterable[OriginVisitStatus].

Build has FAILED

Patch application report for D3212 (id=11410)

Rebasing onto eef4900db5...

Current branch diff-target is up to date.
Changes applied before test
commit 496fb5f52a47a2133f5c7d7ccd9f744775e56b9e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/213/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/213/console

ardumont added a project: Storage manager.
ardumont marked an inline comment as done.
ardumont added inline comments.
swh/storage/cassandra/storage.py
852–853

conclusion: It was checked indeed by attr-strict ;)

ardumont edited the summary of this revision. (Show Details)

Update endpoint interface docstring according to latest development

Build has FAILED

Patch application report for D3212 (id=11411)

Rebasing onto eef4900db5...

Current branch diff-target is up to date.
Changes applied before test
commit 9854f3647f67b3bdd411327e93e002c09f29c9ca
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/214/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/214/console

Build is green

Patch application report for D3212 (id=11424)

Rebasing onto f9b2ca3fec...

Current branch diff-target is up to date.
Changes applied before test
commit 43dfedf5640a0e8b2f16c663326b56e367773e2e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/222/ for more details.

Build is green

Patch application report for D3212 (id=11426)

Rebasing onto f9b2ca3fec...

Current branch diff-target is up to date.
Changes applied before test
commit 33848ec9c895549e4b50f5f2e38b7f190ed882f5
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/223/ for more details.

Drop unnecessary instruction for in-memory implementation

Build is green

Patch application report for D3212 (id=11429)

Rebasing onto f9b2ca3fec...

Current branch diff-target is up to date.
Changes applied before test
commit 3cd4c8100a621caf4a96a2e46c66d7e1aba7ce5f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/225/ for more details.

You didn't address my comment about non-existing visits.

And you're building model objects in tests, which diverges from other tests. Is it intentional?

swh/storage/cassandra/storage.py
856–857

swap these two lines

swh/storage/in_memory.py
844–845

swap (though it doesn't really matter for the in-mem backend)

swh/storage/storage.py
904–905

swap

You didn't address my comment about non-existing visits.

I was under the impression we settled on doing sanity checks on the add_one
implementation. Thus why it's not changed for that part.

And you're building model objects in tests, which diverges from other tests.
Is it intentional?

Well, it's simple model objects, I did not put much thought into it really.

What's the necessary change here then, creating origin-visit-status within the
storage_data?

swap*

I'll adapt those.

Thanks for the input ;)

Swap journal and backend add instructions order

Build is green

Patch application report for D3212 (id=11436)

Rebasing onto f9b2ca3fec...

Current branch diff-target is up to date.
Changes applied before test
commit c631224fd85f90e3f21d31c960f4cfcc4b0c4f62
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add_one endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/226/ for more details.

swh/storage/interface.py
804

nitpick: origin/visit/status/add_one

douardda added inline comments.
swh/storage/in_memory.py
837–841

I am not very found of this way of doing things.

The whole xxx_add_one/xxx_add for single stuff/iterable of stuffs looks useless to me and somehow add unecessary complexity (for the API user). And I am really not found of the idea having 2 endpoints with similar names and same semantic (add origin visit) that actually do not have the same semantic (with or without validation check).

I would prefer a single endpoint xxx_add that lakes an iterable, since this is the most common api "kind" of entry point we have, and have an optional flag argument "enable_sanity_check" (or similar) to disable these in the context of a mass insertion (replayer use case).

ardumont retitled this revision from storage*: Add origin_visit_status_add_one endpoint to storage*: Add origin_visit_status_add endpoint.
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)

Adapt according to latest exchange with new signature:
origin_visit_status_add(Iterable[OriginVisitStatus, bool) -> None

Build is green

Patch application report for D3212 (id=11458)

Rebasing onto 25f584ffc4...

First, rewinding head to replay your work on top of it...
Applying: storage*: Add origin_visit_status_add endpoint
Changes applied before test
commit d8506d28dcd1005f68d19db9de56c11af31cbffd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/230/ for more details.

Clarify docstring on new endpoint

Fix test name (a add_one name got left behind)

Build is green

Patch application report for D3212 (id=11459)

Rebasing onto 25f584ffc4...

First, rewinding head to replay your work on top of it...
Applying: storage*: Add origin_visit_status_add endpoint
Changes applied before test
commit 957cc66844eec2dd296c7530a7ecd1956773e423
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/231/ for more details.

Build is green

Patch application report for D3212 (id=11460)

Rebasing onto 25f584ffc4...

First, rewinding head to replay your work on top of it...
Applying: storage*: Add origin_visit_status_add endpoint
Changes applied before test
commit ac524142640eeafb89083080ec067d215445754a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    storage*: Add origin_visit_status_add endpoint
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/232/ for more details.

ardumont edited the summary of this revision. (Show Details)

Rework commit message

Build is green

Patch application report for D3212 (id=11461)

Rebasing onto 25f584ffc4...

First, rewinding head to replay your work on top of it...
Applying: Open `origin_visit_status_add` endpoint to add origin visit statuses
Changes applied before test
commit 51e66d90b3978abbad89af0affe31a72147dec92
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    Open `origin_visit_status_add` endpoint to add origin visit statuses
    
    This endpoints define 2 behaviors:
    
    - one with sanity checks. This allows to prevent any writes in case checks
      failure are detected. This one is expected to be the default behavior used by
      loaders.
    
    - one without sanity checks. This allows mass replay when we know in advance
      that some data could be there yet. This is expected to be used with replayer
      for example. As mentioned in the docstring, it's up to the caller to ensure
      errors are caught.
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/233/ for more details.

swh/storage/in_memory.py
837–841

I don't know. My suggestion was mainly driven by only seeing a single API user which needs both the "multiple entries being added at once" and the "disable sanity checking" features.

The replayer is kind of a special snowflake here as it needs to be able to process objects in inconsistent orders; I don't like the way these quirks are starting to permeate through our API to support this single user.

Now (forgot to submit)...

I'm unclear on what to do when with_sanity_checks are False (in regards to
test for example)

Because, for example, with unknown origin or unknown origin visit, I would
expect some implementations to fail nonetheless raising another exception type error
(e.g. storage for one with integrity check, in-memory with KeyError or something)...

  • drop with-sanity-checks flag
  • check only origin existence
  • adapt commit message
  • adapt endpoint docstring accordingly

Build is green

Patch application report for D3212 (id=11466)

Rebasing onto 25f584ffc4...

First, rewinding head to replay your work on top of it...
Applying: Open `origin_visit_status_add` endpoint to add origin visit statuses
Changes applied before test
commit 88a39d0f8e01710c95094298884e22f184ab7a1d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    Open `origin_visit_status_add` endpoint to add origin visit statuses
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/234/ for more details.

This revision is now accepted and ready to land.Jun 5 2020, 4:48 PM
douardda added inline comments.
swh/storage/in_memory.py
837–841

The replayer is kind of a special snowflake here as it needs to be able to process objects in inconsistent orders; I don't like the way these quirks are starting to permeate through our API to support this single user.

I agree the replayer usage is not the same as a loader usage, as I said on IRC. Not sure it is a "special snowflake" however. It will be a core feature of the archive.

I agree that this specific use case may require specific access point to a storage (be it a specific set of endpoints, or other solution).

Currently, I'm not sure I see what quirks you are referring to.

  • pgstorage: Fix missing db, cur parameters in origin_get call
  • Rename origin to origin_url in origin existence check (3 implems)

Build is green

Patch application report for D3212 (id=11475)

Rebasing onto c75da7a90e...

First, rewinding head to replay your work on top of it...
Applying: Open `origin_visit_status_add` endpoint to add origin visit statuses
Changes applied before test
commit 22ce17f06b5e352de0f830d22b9e389fc7e7a185
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    Open `origin_visit_status_add` endpoint to add origin visit statuses
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/235/ for more details.

Build is green

Patch application report for D3212 (id=11479)

Rebasing onto 88271f80dc...

Current branch diff-target is up to date.
Changes applied before test
commit dcef916e5edb5c21f02f2ca462ac58238890a419
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 3 11:12:34 2020 +0200

    Open `origin_visit_status_add` endpoint to add origin visit statuses
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/237/ for more details.