Page MenuHomeSoftware Heritage

Add utility to sanitize and retrieve canonical github urls
ClosedPublic

Authored by ardumont on May 16 2022, 3:03 PM.

Details

Summary

Based on the work in webapp [1]

This new code is within a new arborescence as some more code will get moved alongside in
a new commit (the new session github currently in swh.lister module).

Related to T4232

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/assets/src/utils/functions.js$118-160

Diff Detail

Event Timeline

Build is green

Patch application report for D7836 (id=28302)

Rebasing onto dae963440b...

Current branch diff-target is up to date.
Changes applied before test
commit dd1158bb1f308da55e8dd852a1916d60867e839b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon May 16 15:01:25 2022 +0200

    Add utility to sanitize and retrieve canonical github urls
    
    Related to T4232

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/407/ for more details.

anlambert added inline comments.
swh/core/utils.py
219–233

I think you should add support to pass GitHub API token in Authorization HTTP header, there is good chances we hit the rate limit without it.

swh/core/utils.py
219–233

Right, i had it in mind when i started and then I forgot it...

Move code in a dedicated module (explanation will be added in the diff description)

Build is green

Patch application report for D7836 (id=28307)

Rebasing onto dae963440b...

Current branch diff-target is up to date.
Changes applied before test
commit b9252925472099fe28714da568fe462a46ba70b7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon May 16 15:01:25 2022 +0200

    Add utility to sanitize and retrieve canonical github urls
    
    This new code is within a new arborescence as some more code will get moved alongside in
    a new commit (the new session github currently in swh.lister module).
    
    Related to T4232

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/408/ for more details.

swh/core/utils.py
219–233

I'll plug this in another diff.
I need to refactor [1] out of swh.lister here ^.

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/github/utils.py

vlorentz added inline comments.
swh/core/github/utils.py
22–26 ↗(On Diff #28307)

simpler, and also catches the "https://github.com/Foo/Bar/.git" case

swh/core/github/utils.py
22–26 ↗(On Diff #28307)

sneaky!

swh/core/github/utils.py
22–26 ↗(On Diff #28307)

ymmv, i don't find it easier in the end and, currently:

return re.match(r"(.*?)(.git)?/?", url.lower())[1]

fails on the first test case though:

E       AssertionError: assert '' == 'https://github.com/user/repo'
E         - https://github.com/user/repo
ardumont edited the summary of this revision. (Show Details)

Take care of the extra case not dealt with yet

I could not make the suggested improvment work though ¯\_(ツ)_/¯

This revision is now accepted and ready to land.May 16 2022, 4:19 PM

Build is green

Patch application report for D7836 (id=28308)

Rebasing onto dae963440b...

Current branch diff-target is up to date.
Changes applied before test
commit a5ac30bc5d39e33fd5dfc4d0052d486b49d54bfb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon May 16 15:01:25 2022 +0200

    Add utility to sanitize and retrieve canonical github urls
    
    This new code is within a new arborescence as some more code will get moved alongside in
    a new commit (the new session github currently in swh.lister module).
    
    Related to T4232

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/409/ for more details.

  • Add missing copyright headers
  • Simplify a bit more the current sanitization code (no regexp though)
  • Improve docstring (to mention the anonymous http code <- for now)

Build is green

Patch application report for D7836 (id=28309)

Rebasing onto dae963440b...

Current branch diff-target is up to date.
Changes applied before test
commit 0d2b85f7f9af818c387a8b19a8ee15be01c341e2
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon May 16 15:01:25 2022 +0200

    Add utility to sanitize and retrieve canonical github urls
    
    This new code is within a new arborescence as some more code will get moved alongside in
    a new commit (the new session github currently in swh.lister module).
    
    Related to T4232

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/410/ for more details.

Improve docstring and commit message further

Build is green

Patch application report for D7836 (id=28310)

Rebasing onto dae963440b...

Current branch diff-target is up to date.
Changes applied before test
commit 60f384d35ad8d1e1fe442ea031d63262a5cafa22
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon May 16 15:01:25 2022 +0200

    Add utility function to retrieve canonical github urls
    
    This new code is within a new arborescence as some more code will get moved alongside in
    a new commit (the new session github currently in swh.lister module).
    
    That current code is making anonymous requests to the github api for now.
    
    Related to T4232

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/411/ for more details.