Page MenuHomeSoftware Heritage

Start rust crates lister
ClosedPublic

Authored by franckbret on Mar 17 2022, 12:31 PM.

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 27629
Build 43247: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 43246: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D7367 (id=26633)

Rebasing onto fd03941c5f...

Current branch diff-target is up to date.
Changes applied before test
commit fa7869cf9f9a863591d7e85d2659671abbcf6aca
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    WIP: Start rust crates lister

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/477/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/477/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 17 2022, 12:34 PM
Harbormaster failed remote builds in B27516: Diff 26633!

@franckbret Hello, thanks for the diff, you need to "accept" the L3 document (well read it first ;) so we can see the content of your diff.

Cheers,

Build is green

Patch application report for D7367 (id=26663)

Rebasing onto fd03941c5f...

Current branch diff-target is up to date.
Changes applied before test
commit 421ce9b70ebd7f6390c4287bf0290aee384469d9
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    WIP: Start rust crates lister

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/480/ for more details.

Pulling this as draft

Build is green

Patch application report for D7367 (id=26734)

Rebasing onto fd03941c5f...

Current branch diff-target is up to date.
Changes applied before test
commit 060d0308ee75cf598b1b206e880ea029142fa1d5
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    WIP: Start rust crates lister

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/481/ for more details.

LGTM. A few minor requests below

swh/lister/crates/lister.py
38

@ardumont do we need this to be configurable, btw?

68–76
124

date of the last commit touching the crate file?

swh/lister/crates/tests/data/fake_crates_repository_init.sh
34

add rm .git/hooks/*.sample to save 23kB on the 25kB tarball

Alphare requested changes to this revision.Mar 21 2022, 6:06 PM
Alphare added a subscriber: Alphare.

Oops, it seems like this got out of draft too fast. arc says --update and --draft are mutually exclusive for some reason so it's not obvious to me how to tell Phabricator that we still want a diff to stay draft even if the CI passes.

Anyway, I'm doing my "pre-review" here then. ;)

swh/lister/crates/__init__.py
1
swh/lister/crates/lister.py
28

Suggesting a slight rewording aside from fixing the typos:

It basically fetches https://github.com/rust-lang/crates.io-index.git to a temp directory and then walks through each file to get the crate's info.

62–65
83
97
100

What should we do about the yanked information? It seems like it would be useful to store. @vlorentz may have an opinion

124

last_update serves only as an implementation detail for the SWH scheduler to know when to schedule a new load.

We can indeed use the date of the last commit touching the particular file we're looking at. FYI The crates git repo's history is periodically squashed to preserve space, but that does not change anything in that case.

git log -1 --pretty="format:%ci" /path/to/file) is the method to get the timestamp.

126

@vlorentz is this the normal way of passing arguments to the loader? I'm wondering because I'm also thinking about using this for the Golang lister, so it's both birds with one stone

swh/lister/crates/tests/__init__.py
1
swh/lister/crates/tests/data/fake_crates_repository_init.sh
5
swh/lister/crates/tests/test_lister.py
78

You need to check that scheduler_origins_sorted and expected_origins_sorted are the same length (and not 0), otherwise you may fail to detect missing origins.

This revision now requires changes to proceed.Mar 21 2022, 6:06 PM
swh/lister/crates/lister.py
38

Good question, it depends.

Is that something which is voluminous?
Can we keep it around or is it better to scratch it?

Anyway, I guess we'd be on the safe side if we could make this configurable.

93

let the logger do the formatting.

swh/lister/crates/tests/__init__.py
29

i know that code from other loader tests i think, maybe we need to move that down so it's commonly used (later).

swh/lister/crates/lister.py
38

currently 140MB to download and 157MB on disk

126

I don't know. @ardumont ?

swh/lister/crates/tests/test_lister.py
78

good catch, I missed that.

Fixes after first code review

Build is green

Patch application report for D7367 (id=26788)

Rebasing onto ff0035a60b...

First, rewinding head to replay your work on top of it...
Applying: WIP: Start rust crates lister
Changes applied before test
commit 9a19ce872fa81a390a585ec9b009dfb0f2c699c3
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    WIP: Start rust crates lister

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/482/ for more details.

swh/lister/crates/lister.py
124

last_update serves only as an implementation detail for the SWH scheduler to know when to schedule a new load.

albeit an important one ;)

126

yes, it is.

swh/lister/crates/lister.py
38

does that change much in between runs?

lgtm

We can always evolve the part of the /tmp working folder later.

@franckbret Can you please drop the WIP from the diff description (and the commit message)?

Out of curiosityr, have you make this lister run in docker or something?

swh/lister/crates/lister.py
38

it's a git repo with dozens of commits a day, so... yes, but slowly

franckbret planned changes to this revision.

Hi, thanks you all for the review.

Yes, I had successfully run the script with Docker.
Will drop the WIP now.

lgtm

We can always evolve the part of the /tmp working folder later.

@franckbret Can you please drop the WIP from the diff description (and the commit message)?

Out of curiosityr, have you make this lister run in docker or something?

lister: Add new rust crates lister

The Crates lister retrieves crates package for Rust lang.

It basically fetches https://github.com/rust-lang/crates.io-index.git
to a temp directory and then walks through each file to get the
crate's info.

Build is green

Patch application report for D7367 (id=26843)

Rebasing onto ff0035a60b...

First, rewinding head to replay your work on top of it...
Applying: lister: Add new rust crates lister
Changes applied before test
commit fe087aa2aa83dfadb824822d84e90fb7a7a18dce
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    lister: Add new rust crates lister
    
    The Crates lister retrieves crates package for Rust lang.
    
    It basically fetches https://github.com/rust-lang/crates.io-index.git
    to a temp directory and then walks through each file to get the
    crate's info.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/483/ for more details.

Looks good to me now. There still remains the question of whether we want to store the yanked info for each release somewhere. Off the top of my head it would make the (upcoming) incremental listing/loading less efficient. Maybe that's out of scope for now. :)

This revision is now accepted and ready to land.Mar 24 2022, 9:12 AM
franckbret retitled this revision from WIP: Start rust crates lister to Start rust crates lister.Mar 24 2022, 9:14 AM

Build is green

Patch application report for D7367 (id=26932)

Rebasing onto ff0035a60b...

Current branch diff-target is up to date.
Changes applied before test
commit fea6fc04aab5e0ba54476f5793bc7f3b51de3b63
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Mar 17 12:13:04 2022 +0100

    lister: Add new rust crates lister
    
    The Crates lister retrieves crates package for Rust lang.
    
    It basically fetches https://github.com/rust-lang/crates.io-index.git
    to a temp directory and then walks through each file to get the
    crate's info.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/484/ for more details.

This revision was automatically updated to reflect the committed changes.