Page MenuHomeSoftware Heritage

cran: Fix some lister issues
ClosedPublic

Authored by anlambert on Feb 5 2021, 12:58 PM.

Details

Summary

Found a couple of issues while retesting the CRAN lister locally:

  • some dates could not be parsed
  • some packages might be listed twice

That diff contains two commits fixing those.

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D5025 (id=17914)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit 1b5cbd5df579762eae104e4e1fa1367d7c9f16d7
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:51:20 2021 +0100

    cran: Prevent multiple listing of an origin
    
    A CRAN package can appear twice in the JSON list returned by the
    list_all_packages.R script, most recent version of the package
    appearing first.
    
    So handle that edge case to avoid error when sending origins to
    the scheduler.

commit 56d3ae62fb3c62efe1a55a36b8360ddfd52467b9
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:37:49 2021 +0100

    cran: Robustify package date parsing code
    
    Add support for parsing date with milliseconds and ensure locale
    is set to en_US in order to properly parse month and day of week
    in text format.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/241/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/241/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 5 2021, 1:00 PM
Harbormaster failed remote builds in B19033: Diff 17914!

Remove locale modification when parsing date.

Build is green

Patch application report for D5025 (id=17916)

Rebasing onto 4245c5046f...

Current branch diff-target is up to date.
Changes applied before test
commit c95bf9a79088bf892754082f381534b7ffa219ff
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:51:20 2021 +0100

    cran: Prevent multiple listing of an origin
    
    A CRAN package can appear twice in the JSON list returned by the
    list_all_packages.R script, most recent version of the package
    appearing first.
    
    So handle that edge case to avoid error when sending origins to
    the scheduler.

commit a4319538606691666180939d1b2db67c610c8ef1
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:37:49 2021 +0100

    cran: Add support for parsing date with milliseconds

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/242/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/lister/cran/lister.py
48–57
111–119

we need py 3.9 to do packaged_at_str = packaged_at_str.removesuffix(" UTC") :(

114–118

now I'm crying

128

shouldn't we make it an error?

This revision is now accepted and ready to land.Feb 5 2021, 2:14 PM
swh/lister/cran/lister.py
128

I do not think so, there's only two dates that cannot be parsed in CRAN data (one with CDT instead of UTC, another one with missing time) for old packages. Sending origin update date to the scheduler it not mandatory so I do not think it is a big deal to miss the parsing of a few ones.

I recall cran dates are hard to parse, see the loader's corresponding test [1]
I have no idea if it's the same date the cran script we run in this context returns the same dates though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/cran/tests/test_cran.py$38

Build is green

Patch application report for D5025 (id=17922)

Rebasing onto 2461c97bbb...

Current branch diff-target is up to date.
Changes applied before test
commit 1803b707e4ba6e41e84976abfd18ff1d530b7ac7
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:51:20 2021 +0100

    cran: Prevent multiple listing of an origin
    
    A CRAN package can appear twice in the JSON list returned by the
    list_all_packages.R script, most recent version of the package
    appearing first.
    
    So handle that edge case to avoid error when sending origins to
    the scheduler.

commit b4c4c20bb92717d5f0d93aa624e24fbc8678f153
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Fri Feb 5 12:37:49 2021 +0100

    cran: Add support for parsing date with milliseconds

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/245/ for more details.

I recall cran dates are hard to parse, see the loader's corresponding test [1]
I have no idea if it's the same date the cran script we run in this context returns the same dates though.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/cran/tests/test_cran.py$38

Only two dates with different formats cannot be parsed so I think we should be good here.

swh-lister_1                    | [2021-02-05 13:40:18,627: INFO/MainProcess] Received task: swh.lister.cran.tasks.CRANListerTask[c0240b23-8c1d-4929-ad98-39de13252152]  
swh-lister_1                    | [2021-02-05 13:40:18,630: DEBUG/ForkPoolWorker-1] Loading config file /lister.yml
swh-lister_1                    | [2021-02-05 13:40:18,641: DEBUG/ForkPoolWorker-1] Executing R script /srv/softwareheritage/venv/lib/python3.7/site-packages/swh/lister/cran/list_all_packages.R
swh-lister_1                    | [2021-02-05 13:40:20,782: DEBUG/ForkPoolWorker-1] Could not parse DamiaNN package release date: 2016-09-13
swh-lister_1                    | [2021-02-05 13:40:22,598: DEBUG/ForkPoolWorker-1] Could not parse JGR package release date: 2020-04-07 02:19:408 CDT