Paths

Table of Contentst

Differential D7980

Return origin object for analogues URLs
AbandonedPublic
Actions

Authored by jayeshv on Jun 13 2022, 2:10 PM.

Details

Reviewers

anlambert

Group Reviewers

Reviewers

Summary

Get origin object for URLs with and without a trailing slash.

Diff Detail

Repository

rDGQL GraphQL API

Branch

origin-apis

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 29840
Build 46641: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 46640: arc lint + arc unit

Event Timeline

jayeshv created this revision.Jun 13 2022, 2:10 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJun 13 2022, 2:10 PM

Build has FAILED

Patch application report for D7980 (id=28755)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.

Changes applied before test

commit 188b8d0282df42992db86281c79a504bc1b21b41
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

Link to build: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/
See console output for more information: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 13 2022, 2:12 PM

Harbormaster failed remote builds in B29838: Diff 28755!

mypy fix

Missing test

Build is green

Patch application report for D7980 (id=28756)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.

Changes applied before test

commit 04aa78500427750fd1b8e17a3ec25599f8c800e6
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/51/ for more details.

Harbormaster completed remote builds in B29839: Diff 28756.Jun 13 2022, 2:19 PM

Build is green

Patch application report for D7980 (id=28757)

Rebasing onto f32d98cd15...

Current branch diff-target is up to date.

Changes applied before test

commit bf3cf795db27e62e43a530c717dd93650a888cd8
Author: Jayesh Velayudhan <jayesh@softwareheritage.org>
Date:   Mon Jun 13 14:00:14 2022 +0200

    Return origin object for analogues URLs
    
    Get origin object for URLs with and without a trailing slash.

See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/52/ for more details.

Harbormaster completed remote builds in B29840: Diff 28757.Jun 13 2022, 2:21 PM

jayeshv requested review of this revision.Jun 13 2022, 2:21 PM

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

This revision now requires changes to proceed.Jun 13 2022, 3:06 PM

anlambert added inline comments.Jun 13 2022, 3:08 PM

swh/graphql/utils/utils.py
35	Could be a one liner: return [url, url.rstrip("/") + "/"]

In D7980#207423, @anlambert wrote:

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

Thanks. I will add functional tests for this.
I agree that it is a good idea to have a shared storage proxy.
good idea that we can discuss about the shared layer in the upcoming sprint.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
as we define in SWH. Maybe this can be a core function as well. (I forgot to add the logic for ://" character sequencemangled into ":,
fixing it now).

I kept a separate stub in origin_node to add any specific
logic that we may have to add for just origin URLs (on top of analogous URLs).

In D7980#207450, @jayeshv wrote:

In D7980#207423, @anlambert wrote:

I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.

This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).

This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.

Also could you add a functional test for this ?

Thanks. I will add functional tests for this.
I agree that it is a good idea to have a shared storage proxy.
good idea that we can discuss about the shared layer in the upcoming sprint.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
as we define in SWH. Maybe this can be a core function as well. (I forgot to add the logic for ://" character sequencemangled into ":,
fixing it now).

I kept a separate stub in origin_node to add any specific
logic that we may have to add for just origin URLs (on top of analogous URLs).

I found a few missing cases in the existing archive in swh-web.

eg: We replace http:/abc.com with http://abc.com but we will miss
checking for a URL with trailing slash here, It will be easier to fix
such cases if we make this a utility function and use it everywhere
to compare URLs(even outside the origin context)

Why should this be implemented by the API (as opposed to UIs)?

(I forgot to add the logic for ://" character sequencemangled into ":, fixing it now)

I do not think you need it for graphql. That code is there to workaround some bugs in windows outlook client and external SWHID resolvers
to ensure to not have a 404 when attempting to browse a SWHID (see details in rDWAPPS126e3ea6c9bfaa9b93940d000182d4deab472b9a).

With graphql, requesting such an origin that does not exist should return an error.

In D7980#207464, @vlorentz wrote:

Why should this be impleme

In D7980#207466, @anlambert wrote:

(I forgot to add the logic for ://" character sequencemangled into ":, fixing it now)

I do not think you need it for graphql. That code is there to workaround some bugs in windows outlook client and external SWHID resolvers
to ensure to not have a 404 when attempting to browse a SWHID (see details in rDWAPPS126e3ea6c9bfaa9b93940d000182d4deab472b9a).

With graphql, requesting such an origin that does not exist should return an error.

Ok. Then I just need logic to manage trailing slash. Thanks

In D7980#207464, @vlorentz wrote:

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs. I think it makes sense to return
data with or without a slash in both UI and API.

In D7980#207468, @jayeshv wrote:

In D7980#207464, @vlorentz wrote:

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs. I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs

Yes but basically, you will only need to do that processing when querying a single origin to the storage.
So adding an indirection for a one liner function that you will use only once is not really useful here.

In D7980#207469, @anlambert wrote:

In D7980#207468, @jayeshv wrote:

In D7980#207464, @vlorentz wrote:

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

In D7980#207558, @douardda wrote:

In D7980#207469, @anlambert wrote:

In D7980#207468, @jayeshv wrote:

In D7980#207464, @vlorentz wrote:

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

I too agree with this.

In D7980#207558, @douardda wrote:

In D7980#207469, @anlambert wrote:

In D7980#207468, @jayeshv wrote:

In D7980#207464, @vlorentz wrote:

Why should this be implemented by the API (as opposed to UIs)?

This is part of our current REST APIs.

which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.

I think it makes sense to return
data with or without a slash in both UI and API.

I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.

I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.

If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.

As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

In D7980#207563, @anlambert wrote:

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

Well that was not he intention of this discussion, this is only about the new graphql API :-)

It make sense to discuss this also for the current API, but it should be a dedicated task; this would change the behavior of the currently published public API, so let's discuss it in a dedicated task.

douardda added inline comments.Jun 14 2022, 2:52 PM

swh/graphql/utils/utils.py
35	This is not the same as the original version; if the given `url` argument ends with a /, it will return a list with twice the same value.

In D7980#207569, @douardda wrote:

In D7980#207563, @anlambert wrote:

Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.

Well that was not he intention of this discussion, this is only about the new graphql API :-)

It make sense to discuss this also for the current API, but it should be a dedicated task; this would change the behavior of the currently published public API, so let's discuss it in a dedicated task.

In fact, this was the original Web API behavior. The regression was introduced in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4 and the endpoint doc says
it should return a 404 if the origin URL cannot be found in the archive. That particular case was not covered by tests so that's why I missed the regression.

anlambert added inline comments.Jun 14 2022, 2:56 PM