Page MenuHomeSoftware Heritage

browse: Disambiguate URLs related to origin / snapshot context
ClosedPublic

Authored by anlambert on Apr 22 2020, 6:07 PM.

Details

Summary

Still on the road to T2342, I thought removing ambiguities in current browse URL scheme
was a prerequisite to ensure all archive objects can be correctly browsed and referenced
with updated SWHIDs.

So disambiguate and/or deprecate the following browse URLs:

  • GET /browse/origin/(origin_url)/content/(path)/
  • GET /browse/origin/(origin_url)/directory/(path)/
  • GET /browse/origin/(origin_url)/visit/(timestamp)/branches/
  • GET /browse/origin/(origin_url)/visit/(timestamp)/content/(path)/
  • GET /browse/origin/(origin_url)/visit/(timestamp)/directory/(path)/
  • GET /browse/origin/(origin_url)/visit/(timestamp)/log/
  • GET /browse/origin/(origin_url)/visit/(timestamp)/releases/
  • GET /browse/snapshot/(snapshot_id)/content/(path)/
  • GET /browse/snapshot/(snapshot_id)/directory/(path)/

Replace their use by providing timestamp and path as query parameters to these endpoints
for consistency:

  • GET /browse/origin/(origin_url)/content/
  • GET /browse/origin/(origin_url)/directory/
  • GET /browse/origin/(origin_url)/log/
  • GET /browse/origin/(origin_url)/branches/
  • GET /browse/origin/(origin_url)/releases/
  • GET /browse/snapshot/(snapshot_id)/content/
  • GET /browse/snapshot/(snapshot_id)/directory/

Endpoints documentation have also been updated to reflect these changes.

Sorry for the huge diff but I did not have a choice to process all URLs in one batch.

Closes T2115
Closes T2135

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D3046 (id=10828)

Rebasing onto 6708b108fb...

Current branch diff-target is up to date.
Changes applied before test
commit 79b2177e8c4e8f2875718c251d0bb0a5fab0feaf
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Apr 22 17:28:02 2020 +0200

    browse: Disambiguate URLs related to origin / snapshot context browsing
    
    Disambiguate and deprecate the following browse URLs:
    
      - GET /browse/origin/(origin_url)/content/(path)/
    
      - GET /browse/origin/(origin_url)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/branches/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/content/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/log/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/(path)/
    
      - GET /browse/snapshot/(snapshot_id)/directory/(path)/
    
    Replace their use by providing timestamp and path as query parameters to these endpoints:
    
      - GET /browse/origin/(origin_url)/content/
    
      - GET /browse/origin/(origin_url)/directory/
    
      - GET /browse/origin/(origin_url)/log/
    
      - GET /browse/origin/(origin_url)/branches/
    
      - GET /browse/origin/(origin_url)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/
    
      - GET /browse/snapshot/(snapshot_id)/directory/
    
    Update endpoints documentation to reflect these changes.
    
    Related to T2135
    Closes T2115

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/77/ for more details.

anlambert retitled this revision from browse: Disambiguate URLs related to origin / snapshot context browsing to browse: Disambiguate URLs related to origin / snapshot context.Apr 22 2020, 6:19 PM

Build is green

Patch application report for D3046 (id=10831)

Rebasing onto 6708b108fb...

Current branch diff-target is up to date.
Changes applied before test
commit b54b0d46dfbd11aa2e26b183229d1ebec624b7ce
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Apr 22 17:28:02 2020 +0200

    browse: Disambiguate URLs related to origin / snapshot context browsing
    
    Disambiguate and/or deprecate the following browse URLs:
    
      - GET /browse/origin/(origin_url)/content/(path)/
    
      - GET /browse/origin/(origin_url)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/branches/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/content/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/log/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/(path)/
    
      - GET /browse/snapshot/(snapshot_id)/directory/(path)/
    
    Replace their use by providing timestamp and path as query parameters to these endpoints
    for consistency:
    
      - GET /browse/origin/(origin_url)/content/
    
      - GET /browse/origin/(origin_url)/directory/
    
      - GET /browse/origin/(origin_url)/log/
    
      - GET /browse/origin/(origin_url)/branches/
    
      - GET /browse/origin/(origin_url)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/
    
      - GET /browse/snapshot/(snapshot_id)/directory/
    
    Update endpoints documentation to reflect these changes.
    
    Related to T2135
    Closes T2115

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/78/ for more details.

ardumont added inline comments.
docs/uri-scheme-browse-origin.rst
578

I am not sure we want to keep that dateutil.parser.parse. Its function input
is a plain string (quite large) and its output is sometimes unexpected [1]

We already discussed in other modules that we should converge on using iso8601
dates (well use that module at least). So might be, it'd be best to use more
conventional timestamp formats, iso8601 sounds like one sensible choice for at
least one format.

I imagine we have that behavior already so that might be another task entirely.
I just mention it here since we are reworking endpoints.

[1] https://forge.softwareheritage.org/D2463#change-xBZ2jUSgNbl8

Cool!

Wouldn't this be a good time to also move the URL to a query parameter? (Because of the double-slash issue that can happen on clients)

Wouldn't this be a good time to also move the URL to a query parameter? (Because of the double-slash issue that can happen on clients)

I refrain from asking this as this would be quite a change :D
Maybe that could be done as an extra step.

Looks good.

(Remains my question about keeping dateutil.parser.parse_date though)

This revision is now accepted and ready to land.Apr 23 2020, 11:54 AM

oh and also

/or deprecate the following browse URLs:

for how long do we deprecate? What's our timeframe policy to remove those, if we have any?

Wouldn't this be a good time to also move the URL to a query parameter? (Because of the double-slash issue that can happen on clients)

From my point of view, having the browsed object identifier as URL argument is better for consistency with the rest of the browse URL scheme.

Regarding the double-slash issue, I think we already mitigated it in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4

I am not sure we want to keep that dateutil.parser.parse. Its function input
is a plain string (quite large) and its output is sometimes unexpected [1]

We already discussed in other modules that we should converge on using iso8601
dates (well use that module at least). So might be, it'd be best to use more
conventional timestamp formats, iso8601 sounds like one sensible choice for at
least one format.

I imagine we have that behavior already so that might be another task entirely.
I just mention it here since we are reworking endpoints.

Well, hopefully we already use ISO dates in origin visit URLs (see https://archive.softwareheritage.org/browse/origin/https://github.com/python/cpython/visit/2020-01-18T09:31:26Z/directory/ for instance).

I agree that restricting the date format is a good idea for consistency. I propose to only support Unix timestamps and ISO dates.

for how long do we deprecate? What's our timeframe policy to remove those, if we have any?

I do not really know, at least a couple of months in order to check if these endpoints are still requested in production.

I agree that restricting the date format is a good idea for consistency. I propose to only support Unix timestamps and ISO dates.

Awesome, i know at least @douardda will be happy to hear it ;)
Maybe make make that happen in a new diff when you have the time (as you wish ;)

Cheers mate ;)

I agree that restricting the date format is a good idea for consistency. I propose to only support Unix timestamps and ISO dates.

Awesome, i know at least @douardda will be happy to hear it ;)
Maybe make make that happen in a new diff when you have the time (as you wish ;)

Cheers mate ;)

Of course, this is out of scope to integrate these changes in that diff.

I created a new task on the subject (T2375)

Build is green

Patch application report for D3046 (id=10842)

Rebasing onto 6708b108fb...

Current branch diff-target is up to date.
Changes applied before test
commit 6945c4efa49b05877a3be5e6bb72af988ebbe5bd
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Wed Apr 22 17:28:02 2020 +0200

    browse: Disambiguate URLs related to origin / snapshot context browsing
    
    Disambiguate and/or deprecate the following browse URLs:
    
      - GET /browse/origin/(origin_url)/content/(path)/
    
      - GET /browse/origin/(origin_url)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/branches/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/content/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/directory/(path)/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/log/
    
      - GET /browse/origin/(origin_url)/visit/(timestamp)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/(path)/
    
      - GET /browse/snapshot/(snapshot_id)/directory/(path)/
    
    Replace their use by providing timestamp and path as query parameters to these endpoints
    for consistency:
    
      - GET /browse/origin/(origin_url)/content/
    
      - GET /browse/origin/(origin_url)/directory/
    
      - GET /browse/origin/(origin_url)/log/
    
      - GET /browse/origin/(origin_url)/branches/
    
      - GET /browse/origin/(origin_url)/releases/
    
      - GET /browse/snapshot/(snapshot_id)/content/
    
      - GET /browse/snapshot/(snapshot_id)/directory/
    
    Update endpoints documentation to reflect these changes.
    
    Closes T2115
    Closes T2135

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/79/ for more details.

Wouldn't this be a good time to also move the URL to a query parameter? (Because of the double-slash issue that can happen on clients)

From my point of view, having the browsed object identifier as URL argument is better for consistency with the rest of the browse URL scheme.

Except those with a path, starting with this diff...

Regarding the double-slash issue, I think we already mitigated it in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4

That only fixes trailing slashes, not all double slashes (eg. http:// in the origin url)

Wouldn't this be a good time to also move the URL to a query parameter? (Because of the double-slash issue that can happen on clients)

From my point of view, having the browsed object identifier as URL argument is better for consistency with the rest of the browse URL scheme.

Except those with a path, starting with this diff...

Regarding the double-slash issue, I think we already mitigated it in rDWAPPS56fb869644ad45e0648f4ad45c07619d468019b4

That only fixes trailing slashes, not all double slashes (eg. http:// in the origin url)

All right, will turn origin_url as a query parameter then before landing that diff.

You can do it in another diff, that will be less code to review

You can do it in another diff, that will be less code to review

Yes, you're right. Landing this then.