Page MenuHomeSoftware Heritage

cpan: Use a fake origin URL instead of an HTTP one
AbandonedPublic

Authored by anlambert on Oct 10 2022, 4:32 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Maniphest Tasks
T2833: cpan.loader - archive Perl modules from CPAN
Summary

CPAN hosts a lot of legacy modules known as backpan that do not have
an HTML landing page so use fake origin URL pattern below instead:

cpan://{author}/{module_name}

author corresponds to the normalized CPAN user account, not the full
author name, while module_name is the distribution name.

For instance the distribution File-ManualFlock is a backpan
so URL https://metacpan.org/dist/File-ManualFlock does not exist and returns 404.
The only HTML page we can found for this distribution is the backpan directory
for the associated CPAN user WCATLAN.

Related to T2833

Depends on D8648

Diff Detail

Repository
rDLS Listers
Branch
cpan-fake-origin-url
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32185
Build 50401: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50400: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8649 (id=31232)

Could not rebase; Attempt merge onto 108816f232...

Updating 108816f..cd19b69
Fast-forward
 swh/lister/cpan/__init__.py                        |   8 +-
 swh/lister/cpan/lister.py                          | 149 +++++++++++--
 ...TU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw== |  50 -----
 ...NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1 |  16 --
 .../v1__search_scroll_page1                        | 247 +++++++++++++++++++++
 .../v1__search_scroll_page2                        |  39 ++++
 .../v1__search_scroll_page3                        |  85 +++++++
 .../v1__search_scroll_page4                        | 131 +++++++++++
 ...ibution__search,fields=name,size=1000,scroll=1m |  52 -----
 .../https_fastapi.metacpan.org/v1_release__search  | 246 ++++++++++++++++++++
 swh/lister/cpan/tests/test_lister.py               | 166 ++++++++++++--
 11 files changed, 1030 insertions(+), 159 deletions(-)
 delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==
 delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1
 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page1
 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page2
 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page3
 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page4
 delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_distribution__search,fields=name,size=1000,scroll=1m
 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_release__search
Changes applied before test
commit cd19b69f92903bdb369b398787aac870dabc21b3
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Mon Oct 10 16:19:11 2022 +0200

    cpan: Use a fake origin URL instead of an HTTP one
    
    CPAN hosts a lot of legacy modules known as backpan that do not have
    an HTML landing page so use fake origin URL pattern below instead:
    
            cpan://{author}/{module_name}
    
    author corresponds to the normalized CPAN user account, not the full
    author name, while module_name is the distribution name.
    
    Related to T2833

commit 8d26db1cf78bddfb005addd2bc41fdca44fc19f4
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Mon Oct 10 15:55:54 2022 +0200

    cpan: Fix module version extraction for some edge cases
    
    CPAN API can return versions that are not of str type: either
    int or float.
    
    When version equals 0, it means that version failed to be parsed
    by CPAN so we try to extract it from release name in that case.
    
    Otherwise we ensure to convert the version to str type.
    
    Related to T2833

commit 2177ac9f5a08c2bd276f494b2aa4c8f0d4239e65
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Tue Sep 27 16:34:38 2022 +0200

    cpan: Improve listing process by querying the metacpan release endpoint
    
    Instead of querying the metacpan distribution endpoint to list origins,
    prefer to use the release endpoint instead enabling to list all artifacts
    associated to CPAN packages by scrolling results.
    
    Compared to previous implementation, it enables to compute a last_update
    date for all CPAN packages but also to obtain artifact sha256 checksums
    that will be used by the CPAN loader to check downloads integrity.
    
    As the multiple versions of a module are spread across multiple pages
    from the CPAN API, origins are sent to the scheduler once all pages
    processed, it is also faster to proceed that way.
    
    Also compute extrinsic metadata URL for each perl module versions in
    order for the cpan loader to query it.
    
    Related to T2833

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/776/ for more details.

anlambert edited the summary of this revision. (Show Details)
anlambert edited the summary of this revision. (Show Details)

Why is it an issue that it doesn't point anywhere? The https:// URL will at least work for most packages, while cpan:// won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.

Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)

Why is it an issue that it doesn't point anywhere? The https:// URL will at least work for most packages, while cpan:// won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.

Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)

It just felt weird to me to have so many 404 for the produced origin URLs but you are right, better abandoning this.

I can get the info if a module is a backpan or not during the listing but could not find any URL that links to all versions
for a specific backpan module.