Get origin object for URLs with and without a trailing slash.
Diff Detail
- Repository
- rDGQL GraphQL API
- Branch
- origin-apis
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 29840 Build 46641: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 46640: arc lint + arc unit
Event Timeline
Build has FAILED
Patch application report for D7980 (id=28755)
Rebasing onto f32d98cd15...
Current branch diff-target is up to date.
Changes applied before test
commit 188b8d0282df42992db86281c79a504bc1b21b41 Author: Jayesh Velayudhan <jayesh@softwareheritage.org> Date: Mon Jun 13 14:00:14 2022 +0200 Return origin object for analogues URLs Get origin object for URLs with and without a trailing slash.
Link to build: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/
See console output for more information: https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/50/console
Build is green
Patch application report for D7980 (id=28756)
Rebasing onto f32d98cd15...
Current branch diff-target is up to date.
Changes applied before test
commit 04aa78500427750fd1b8e17a3ec25599f8c800e6 Author: Jayesh Velayudhan <jayesh@softwareheritage.org> Date: Mon Jun 13 14:00:14 2022 +0200 Return origin object for analogues URLs Get origin object for URLs with and without a trailing slash.
See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/51/ for more details.
Build is green
Patch application report for D7980 (id=28757)
Rebasing onto f32d98cd15...
Current branch diff-target is up to date.
Changes applied before test
commit bf3cf795db27e62e43a530c717dd93650a888cd8 Author: Jayesh Velayudhan <jayesh@softwareheritage.org> Date: Mon Jun 13 14:00:14 2022 +0200 Return origin object for analogues URLs Get origin object for URLs with and without a trailing slash.
See https://jenkins.softwareheritage.org/job/DGQL/job/tests-on-diff/52/ for more details.
I think it would be better to handle that process directly in the Archive.get_origin function, adding an utility function for a small set of instructions seems a bit overkill.
This is how it is done in swh-web for instance (with some extra processing that should not be needed for the graphql case).
This makes me think that we should start moving storage object conversion code out of swh-web as basically you will end up re-implementing
a lot of features we already have. I think adding a new storage proxy that will do object conversion to JSON on the fly while keeping the storage
API could be a nice solution. This is a subject we should talk about during the code review sprint of that week.
Also could you add a functional test for this ?
swh/graphql/utils/utils.py | ||
---|---|---|
35 | Could be a one liner: return [url, url.rstrip("/") + "/"] |
Thanks. I will add functional tests for this.
I agree that it is a good idea to have a shared storage proxy.
good idea that we can discuss about the shared layer in the upcoming sprint.
But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
as we define in SWH. Maybe this can be a core function as well. (I forgot to add the logic for ://" character sequencemangled into ":,
fixing it now).
I kept a separate stub in origin_node to add any specific
logic that we may have to add for just origin URLs (on top of analogous URLs).
I found a few missing cases in the existing archive in swh-web.
eg: We replace http:/abc.com with http://abc.com but we will miss
checking for a URL with trailing slash here, It will be easier to fix
such cases if we make this a utility function and use it everywhere
to compare URLs(even outside the origin context)
(I forgot to add the logic for ://" character sequencemangled into ":, fixing it now)
I do not think you need it for graphql. That code is there to workaround some bugs in windows outlook client and external SWHID resolvers
to ensure to not have a 404 when attempting to browse a SWHID (see details in rDWAPPS126e3ea6c9bfaa9b93940d000182d4deab472b9a).
With graphql, requesting such an origin that does not exist should return an error.
This is part of our current REST APIs. I think it makes sense to return
data with or without a slash in both UI and API.
I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.
But I don't think adding this logic to that layer (archive layer) is a good idea at least in this case. What
I'm trying to do is to make a utility function to identify all the analogous URLs
Yes but basically, you will only need to do that processing when querying a single origin to the storage.
So adding an indirection for a one liner function that you will use only once is not really useful here.
which is not a good enough argument to keep it. Let's not reproduce (potential) mistakes, if we can.
I think it makes sense to return
data with or without a slash in both UI and API.I also agree, extra trailing slash or missing trailing slash in origin URL is a common source of errors when searching for an origin,
better handle that case API side.
I disagree; one expect the API to behave as dumb/simple as possible: you ask for an origin URL, the API should return data on that very specific URL only. Imagine we do have stored both versions of the origin URL (for historical reasons), with and without a trailing /, we want the API to allow the user query one or the other.
If need be, provide an dedicated API endpoint to search for known similar origins, but make it a dedicated one, not part of the main get_origin endpoint.
As @vlorentz suggested, this "smart" behavior is indeed an UI concern, not an API one.
Ack, I will make the change to the /api/1/origin/ endpoint to remove the URL processing then.
Well that was not he intention of this discussion, this is only about the new graphql API :-)
It make sense to discuss this also for the current API, but it should be a dedicated task; this would change the behavior of the currently published public API, so let's discuss it in a dedicated task.
swh/graphql/utils/utils.py | ||
---|---|---|
35 | This is not the same as the original version; if the given url argument ends with a /, it will return a list with twice the same value. |
In fact, this was the original Web API behavior. The regression was introduced in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4 and the endpoint doc says
it should return a 404 if the origin URL cannot be found in the archive. That particular case was not covered by tests so that's why I missed the regression.
swh/graphql/utils/utils.py | ||
---|---|---|
35 | Indeed, I am tired these days ... |