Page MenuHomeSoftware Heritage

Return origin object for analogues URLs
AbandonedPublic

Authored by jayeshv on Jun 13 2022, 2:10 PM.

Details

Reviewers
anlambert
Group Reviewers
Reviewers
Summary

Get origin object for URLs with and without a trailing slash.

Diff Detail

Repository
rDGQL GraphQL API
Branch
origin-apis
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 29840
Build 46641: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 46640: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7980 (id=28755)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.
Changes applied before test
commit 188b8d0282df42992db86281c79a504bc1b21b41
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

Link to build: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/
See console output for more information: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 13 2022, 2:12 PM
Harbormaster failed remote builds in B29838: Diff 28755!

Build is green

Patch application report for D7980 (id=28756)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.
Changes applied before test
commit 04aa78500427750fd1b8e17a3ec25599f8c800e6
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/51/ for more details.

Build is green

Patch application report for D7980 (id=28757)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.
Changes applied before test
commit bf3cf795db27e62e43a530c717dd93650a888cd8
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/52/ for more details.

anlambert added a subscriber: anlambert.

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

This revision now requires changes to proceed.Jun 13 2022, 3:06 PM
swh/graphql/utils/utils.py
35

Could be a one liner:

return [url, url.rstrip("/") + "/"]

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

Thanks. I will add functional tests for this.
I agree that it is a good idea to have a shared storage proxy.
good idea that we can discuss about the shared layer in the upcoming sprint.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
as we define in SWH. Maybe this can be a core function as well. (I forgot to add the logic for ://" character sequencemangled into ":,
fixing it now).

I kept a separate stub in origin_node to add any specific
logic that we may have to add for just origin URLs (on top of analogous URLs).

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

Thanks. I will add functional tests for this.
I agree that it is a good idea to have a shared storage proxy.
good idea that we can discuss about the shared layer in the upcoming sprint.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
as we define in SWH. Maybe this can be a core function as well. (I forgot to add the logic for ://" character sequencemangled into ":,
fixing it now).

I kept a separate stub in origin_node to add any specific
logic that we may have to add for just origin URLs (on top of analogous URLs).

I found a few missing cases in the existing archive in swh-web.

eg: We replace http:/abc.com with http://abc.com but we will miss
checking for a URL with trailing slash here, It will be easier to fix
such cases if we make this a utility function and use it everywhere
to compare URLs(even outside the origin context)

Why should this be implemented by the API (as opposed to UIs)?

(I forgot to add the logic for ://" character sequencemangled into ":, fixing it now)

I do not think you need it for graphql. That code is there to workaround some bugs in windows outlook client and external SWHID resolvers
to ensure to not have a 404 when attempting to browse a SWHID (see details in rDWAPPS126e3ea6c9bfaa9b93940d000182d4deab472b9a).

With graphql, requesting such an origin that does not exist should return an error.

Why should this be impleme

(I forgot to add the logic for ://" character sequencemangled into ":, fixing it now)

I do not think you need it for graphql. That code is there to workaround some bugs in windows outlook client and external SWHID resolvers
to ensure to not have a 404 when attempting to browse a SWHID (see details in rDWAPPS126e3ea6c9bfaa9b93940d000182d4deab472b9a).

With graphql, requesting such an origin that does not exist should return an error.

Ok. Then I just need logic to manage trailing slash. Thanks

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs. I think it makes sense to return
data with or without a slash in both UI and API.

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs. I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs

Yes but basically, you will only need to do that processing when querying a single origin to the storage.
So adding an indirection for a one liner function that you will use only once is not really useful here.

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

I too agree with this.

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

Well that was not he intention of this discussion, this is only about the new graphql API :-)

It make sense to discuss this also for the current API, but it should be a dedicated task; this would change the behavior of the currently published public API, so let's discuss it in a dedicated task.

swh/graphql/utils/utils.py
35

This is not the same as the original version; if the given url argument ends with a /, it will return a list with twice the same value.

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

Well that was not he intention of this discussion, this is only about the new graphql API :-)

It make sense to discuss this also for the current API, but it should be a dedicated task; this would change the behavior of the currently published public API, so let's discuss it in a dedicated task.

In fact, this was the original Web API behavior. The regression was introduced in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4 and the endpoint doc says
it should return a 404 if the origin URL cannot be found in the archive. That particular case was not covered by tests so that's why I missed the regression.

swh/graphql/utils/utils.py
35

Indeed, I am tired these days ...

This change is not required