Page MenuHomeSoftware Heritage

Cpan: Implement incremental mode
AbandonedPublic

Authored by franckbret on Nov 9 2022, 3:42 PM.

Details

Reviewers
anlambert
Group Reviewers
Reviewers
Summary

Improve the Elastic Search, http api get query to retrieve only new or updated origins since the last lister execution.
Related T2833

Diff Detail

Repository
rDLS Listers
Branch
cpan-incremental
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32756
Build 51321: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 51320: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8824 (id=31816)

Rebasing onto e1f3f87c73...

Current branch diff-target is up to date.
Changes applied before test
commit f71934515462eedddc510a8b39bba7ae6a3fc97e
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Nov 9 15:37:16 2022 +0100

    Cpan: Implement incremental mode
    
    Improve the Elastic Search, http api get query to retrieve only new or updated origins since the last lister execution.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/855/ for more details.

anlambert added a subscriber: anlambert.

@franckbret, as explained in my inline comment we cannot use the date filtering on the release index of CPAN elasticsearch.

The only incremental mode we can implement here is to filter the ListedOrigininstances sent to the scheduler according to the
last_updatevalue, if it is greater than the date from the lister state, we can yield it.

Nevertheless, I am not sure if it is worth it as a full listing takes around 10 minutes, which is pretty fast.

swh/lister/cpan/lister.py
190–201

We cannot use that filter here as we are querying the release index of CPAN elasticsearch.
This index lists release artifacts for all CPAN modules but if we filter on the release date, all
release artifacts lesser than the provided date will not be listed but we want to collect all release
artifacts associated to a module.

This revision now requires changes to proceed.Nov 10 2022, 11:01 AM

@franckbret, as explained in my inline comment we cannot use the date filtering on the release index of CPAN elasticsearch.

The only incremental mode we can implement here is to filter the ListedOrigininstances sent to the scheduler according to the
last_updatevalue, if it is greater than the date from the lister state, we can yield it.

Nevertheless, I am not sure if it is worth it as a full listing takes around 10 minutes, which is pretty fast.

Yep I did miss that we want all artifacts. Better closing this one.

Abandon revision because in this case we can not really get advantages of an incremental mode