Page MenuHomeSoftware Heritage

Web API endpoint /known/
ClosedPublic

Authored by DanSeraf on Jan 24 2020, 3:36 PM.

Details

Summary

New endpoint to check if a list of PIDs exist in the archive (T2176)

Get information about the Software Heritage persistent identifiers:

  • input: a JSON list of SWH PIDs
  • output: a dictionary with: pid as key and a boolean value['found'] to check if the

corrisponding pid exists or not

Diff Detail

Repository
rDWAPPS Web applications
Branch
known-api-endpoint
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10276
Build 15254: Cypress tests for swh-web diffsJenkins
Build 15253: tox-on-jenkinsJenkins
Build 15252: arc lint + arc unit

Event Timeline

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/web/api/views/known.py
15

This method should probably be in swh/web/api/views/identifiers.py.

39

You should use the parser from swh.model instead of manually splitting. (and pass the parsed object to group_swh_persistent_identifiers, so it doesn't have to parse it again)

42

response is unassigned if the method is not POST.

swh/web/common/utils.py
423–425

You should add a small wrapper around parse_persistent_identifier to handle the error, that can be shared with resolve_swh_persistent_id.

swh/web/tests/common/test_service.py
881–899

You should also add a test for when they are not (all) missing.

swh/web/tests/common/test_utils.py
117–126

this should be in swh/web/tests/api/test_identifiers.py

This revision now requires changes to proceed.Jan 24 2020, 4:08 PM
swh/web/api/views/known.py
42

There is no other method available, should i assigned it anyways?

  • known moved to identifiers view, parse pids before grouping them
  • test present pids
  • /known/ endpoint tests
  • test_service: testing missing hashes lookup with present pids

You shouldn't remove the handling of ValidationError exceptions; because it raises an error 500 Internal Server Error, which should only be returned if there is a bug in the server code.

Instead you should raise a BadInputExc, like your code initially did.

swh/web/api/views/identifiers.py
88–92 ↗(On Diff #9228)

shorter version:

persistent_ids = [parse_persistent_identifier(pid) for pid in request.data]
94 ↗(On Diff #9228)

You can use a dict comprehension:

response = {str(pid): {'found': False} for pid in persistent_ids}
102–106 ↗(On Diff #9228)

You can assign the dicts with content directly:

response was already initialized with found set to False for all pids, so you only need to do this:

for pid in persistent_ids:
    if pid.object_id not in missing_hashes:
        response[str(pid)]['found'] = True
swh/web/api/views/known.py
42

You should return an error 405 Method Not Allowed.

swh/web/common/service.py
1175

We're trying to use type annotations in new functions, so it should be:

def lookup_missing_hashes(grouped_pids: Dict[str, List[bytes]]) -> Set[str]:
This revision now requires changes to proceed.Jan 27 2020, 12:27 PM
  • returning 400 instead of 500 for invalid pid, type annotation on lookup_missing_hashes
  • rebase
zack requested changes to this revision.Jan 27 2020, 2:52 PM
zack added a subscriber: zack.
zack added inline comments.
swh/web/api/views/identifiers.py
82 ↗(On Diff #9236)

the indentation of the dictionary keys do not look right here

83 ↗(On Diff #9236)

I don't like "found" as a key here, as the name of the action is neither "search" nor "find" here.

I think it should be:

  • either "known: true/false",
  • or "status: known/unknown"

The latter would allow to convey more information in the future, e.g., an object that is present but hidden.

AFAICT this would be the first time we convey a non-boolean status of objects in a public API, so we probably do not want to go there; we can reconsider for API 2.0. Hence, my proposal is "known: true/false" for now.

(Naming is the most difficult problem in CS (cit.) *g*)

swh/web/common/service.py
1189–1199

these should be either a series of if/elif/elif/elif, or a dictionary with a single lookup

swh/web/common/utils.py
402–403

the indentation of the dictionary keys do not look right here

This revision now requires changes to proceed.Jan 27 2020, 2:52 PM
fix /known/ test expected values
  • returning 400 instead of 500 for invalid pid, type annotation on lookup_missing_hashes

This part is ok now, but cf my earlier comment: "You should add a small wrapper around parse_persistent_identifier to handle the error, that can be shared with resolve_swh_persistent_id."

swh/web/api/views/identifiers.py
83 ↗(On Diff #9236)

The latter would allow to convey more information in the future, e.g., an object that is present but hidden.

It would imply breaking the API compatibility, so no.

This revision now requires changes to proceed.Jan 27 2020, 3:44 PM
swh/web/api/views/identifiers.py
83 ↗(On Diff #9236)

It would imply breaking the API compatibility, so no.

This is not the place to have this conversation, but nack. Because (1) with API version 1 we make no API compatibility promises yet (that will happen starting version 2); and (2) adding fields to an existing dictionary should be possible (and documented) as a way to evolve an API that clients should expect.

  • returning 400 instead of 500 for invalid pid, type annotation on lookup_missing_hashes

This part is ok now, but cf my earlier comment: "You should add a small wrapper around parse_persistent_identifier to handle the error, that can be shared with resolve_swh_persistent_id."

Should i handle the error in parse_persistent_identifier while creating the pid?

Should i handle the error in parse_persistent_identifier while creating the pid?

No, parse_persistent_identifier is in swh.model, which shouldn't deal with webserver stuff (and BadInputExc is an HTTP error). You should add a function in swh-web that calls parse_persistent_identifier, catches ValidationError, and re-raises a BadInputExc.

utils.py: function to get a parsed persistent identifier

The only method accepted by the endpoint is POST so it returns 405 for other methods

Then you don't need if request.method == 'POST':, right?

douardda added inline comments.
swh/web/api/views/known.py
42

I agree this is the job of the framework (since the method has already been declared as POST only) to return the proper 405, but also that the if request.method == 'POST': should not be there at all.

This revision is now accepted and ready to land.Jan 28 2020, 1:37 PM