Page MenuHomeSoftware Heritage

crates.lister: Implement incremental mode
ClosedPublic

Authored by franckbret on Jul 8 2022, 12:54 PM.

Details

Summary

Add incremental mode support based on a 'last_commit' state, used to get
new package versions from git diff range of commits.

Related T4104

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Make usage of dulwich for to replace some previous subprocess git commands.
Add a test to ensure everything runs fine in incremental mode even if there is no new commit since last lister invocation.

Build has FAILED

Patch application report for D8101 (id=29382)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit e88a360a8b419a1dbc9bdac9f424420de89c1561
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/553/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/553/console

franckbret added inline comments.
swh/lister/crates/lister.py
86–94

Sure

swh/lister/crates/tests/test_lister.py
124–139

It's test data for incremental mode, here I suppose 0.1.0 and 0.1.1 has already been loaded by a previous lister run. When the lister runs in incremental mode, what it get is based on the diff of the commits since the last run. Do I miss something?

12:00 <+ardumont> franckbret: can you please have a look at my comments in D8033 first?
12:00 -- Notice(swhbot): D8033 (author: franckbret, Needs Review) on swh-lister: Arch User Repository (AUR) lister <https://forge.softwareheritage.org/D8033>
12:02 <+ardumont> (but yeah, sure, for the other one, i'll check before the end of the week)

As vlorentz has done a first pass already, i did not and i trust their judgement ;)

@ardumont Ooops, sorry just saw that comment. Will try to answer to D8033 today. Sorry for the delay but when coming home from Paris I got the covid, so last weeks have been complicated...

Have you considered using Dulwich (a Git library we already use) instead of shelling out to git? It looks like it would be easier than parsing output from git's UI

@vlorentz I've replaced the vast majority of python subprocess call to git with dulwich api.

By the way and as you can see in current code I did not found a suitable way to get something like "git log -p" output.

When experimenting with it I've successfully retrieved a diff using dulwich.patch.write_tree_diff method, but the problem was that I was not able to get related commits ids... Without commit ids I can not set an accurate last_update...

Do you have any ideas on how to accomplish something similar with Dulwich?

Also CI is now broken on mypy step, what can I do to fix?

15:33:26  mypy run-test: commands[0] | mypy swh
15:33:38  swh/lister/crates/lister.py:15: error: Cannot find implementation or library stub for module named "dulwich"
15:33:38  swh/lister/crates/lister.py:16: error: Cannot find implementation or library stub for module named "dulwich.repo"

Also CI is now broken on mypy step, what can I do to fix?

15:33:26  mypy run-test: commands[0] | mypy swh
15:33:38  swh/lister/crates/lister.py:15: error: Cannot find implementation or library stub for module named "dulwich"
15:33:38  swh/lister/crates/lister.py:16: error: Cannot find implementation or library stub for module named "dulwich.repo"

most likely this [1] would help you ;)

[1] https://forge.softwareheritage.org/source/swh-loader-git/browse/master/mypy.ini$11-12

By the way and as you can see in current code I did not found a suitable way to get something like "git log -p" output.

When experimenting with it I've successfully retrieved a diff using dulwich.patch.write_tree_diff method, but the problem was that I was not able to get related commits ids... Without commit ids I can not set an accurate last_update...

Do you have any ideas on how to accomplish something similar with Dulwich?

I don't think you need to compute text diffs at all, you can look at both files' content directly, parse each line, and compare them one-to-one (or put them in set objects and use .symmetric_difference() if you don't care about order)

Also CI is now broken on mypy step, what can I do to fix?

15:33:26  mypy run-test: commands[0] | mypy swh
15:33:38  swh/lister/crates/lister.py:15: error: Cannot find implementation or library stub for module named "dulwich"
15:33:38  swh/lister/crates/lister.py:16: error: Cannot find implementation or library stub for module named "dulwich.repo"

Copy this from swh-loader-git/mypy.ini:

[mypy-dulwich.*]
ignore_missing_imports = True
swh/lister/crates/lister.py
128
swh/lister/crates/tests/test_lister.py
124–139

At the end of each run, loaders create a Snapshot object, with a branch for each of the current versions of the package; so it needs to be aware of all versions, even if they were already loaded.

Existing versions should only be missing from subsequent snapshots if they are removed from the origin.

Add a dulwich entry to mypy.ini to set ignore_missing_imports = True

Build has FAILED

Patch application report for D8101 (id=29393)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 7a36677dbffb4e81cebfe2efb07c8b4478074795
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/555/
See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/555/console

franckbret added inline comments.
swh/lister/crates/tests/test_lister.py
124–139

Ok, didn't understand it this way at first! So in incremental mode I can read each new or changed files since last_commit and return all versions found in the file.

Each crate version entry has a "yanked" value. Is it ok for you if I return only version with yanked=false?

Build is green

Patch application report for D8101 (id=29399)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 05a711c131f0b944e45751866894cda749c11c07
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/556/ for more details.

swh/lister/crates/tests/test_lister.py
124–139

Hard question! We don't have a way to represent "soft deletions" like this in SWH.

I err on the side of including it anyway; otherwise we would miss packages that were published and quickly yanked.

It would be nice to have metadata somewhere (maybe extrinsic metadata on the snapshot object? or just rely on the git index being archived as a regular git repo) to keep a record of the yank status, but it should not affect the main archive.

Change the way we list origins in incremental mode

It now returns all versions of a package when we detect a new version.
Incremental tests has been improved to not be based on harcoded comit id.

Build is green

Patch application report for D8101 (id=29414)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 00577233f35c475eccc4c246232647263b5d2cf4
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/557/ for more details.

Make use of finalize method to remove repository directory

Build is green

Patch application report for D8101 (id=29418)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 285df1a9996b1c5d3ae2a06524ab80f5bfb8f567
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/558/ for more details.

franckbret added inline comments.
swh/lister/crates/tests/test_lister.py
167

hardcoding this commit id in the test means it needs to be recomputing every time swh/lister/crates/tests/data/fake_crates_repository_init.sh runs.

You need to either make swh/lister/crates/tests/data/fake_crates_repository_init.sh fully deterministic (enforce dates and author info), or compute this id dynamically.

I have used a different strategy in tests using dulwich to compute the commit id dynamically

franckbret marked an inline comment as done.

Add 'yanked' to artifacts data

'yanked' is a flag used to determine if a package version has been unreleased

Build is green

Patch application report for D8101 (id=29419)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit fd842ccb74be33887fffe9a96f35109001076ba0
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/559/ for more details.

Don't forget these:

  • Where is last_update used? I don't see it in the artifacts list in the tests
  • Finally, could you reword your commit message to be in the imperative mood and mention it is the crates lister? eg. crates: Implement incremental listing

(and this diff's title too, to be consistent with the commit title)

Add 'yanked' to artifacts data

'yanked' is a flag used to determine if a package version has been unreleased

That's not the right place; this format is meant to describe tarballs: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json
but the "yanked" status comes from higher-level objects.

swh/lister/crates/lister.py
143–150

looks like dead code

192–197

I don't think you actually need to decode paths here. Can you try this patch?

Removing conversions avoids bugs if crates.io ever adds non-UTF8 filenames. (Don't bother adding a test for this)

196

just to be safe

251–253

add a test for this (both when successful and failed)

swh/lister/crates/tests/test_lister.py
169–171

using iteration variables outside the loop is a little unsettling

Some changes and a new test

Remove dead code
Some changes to reduce bytes conversion
Add a test to ensure repository is effectively cleaned up

Cleaner code to get a dynamic git id in a test

Build is green

Patch application report for D8101 (id=29422)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit b2a868f4061fb5194c9898ccb1b489d0fd19d579
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/560/ for more details.

Build is green

Patch application report for D8101 (id=29424)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 4d9b91b9f199a3283a0aee8d255eb8cdb1bcde20
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/561/ for more details.

franckbret added inline comments.
swh/lister/crates/lister.py
251–253

I added a test to ensure it cleans up successfully after lister runs, but can't find a way to make it fail with current code.

One way could be by improving get_index_repository method to raise if directory already exists, but looks like blocking and useless unless you have a better idea?

swh/lister/crates/lister.py
251–253

nah, it's fine

255

not needed to test this either; shutil.rmtree raises an exception when it fails

swh/lister/crates/tests/test_lister.py
231–235

first one doesn't really test anything

franckbret marked an inline comment as done.

Remove useless assert statement

Build is green

Patch application report for D8101 (id=29440)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 64b3a8fd6a0e50b52f1e10f5f60753ccc78198df
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/562/ for more details.

Rewrite an assertion on a test

Build is green

Patch application report for D8101 (id=29442)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit 50f7fd9687333bdafd3f7e510aded5d1f81afd86
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    Crates lister implements incremental mode
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/563/ for more details.

Don't forget these:

  • Where is last_update used? I don't see it in the artifacts list in the tests
  • Finally, could you reword your commit message to be in the imperative mood and mention it is the crates lister? eg. crates: Implement incremental listing

(and this diff's title too, to be consistent with the commit title)

Add 'yanked' to artifacts data

'yanked' is a flag used to determine if a package version has been unreleased

That's not the right place; this format is meant to describe tarballs: https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json
but the "yanked" status comes from higher-level objects.

I missed this one. Will move yanked from artifacts data.

Move 'yanked' information from 'artifacts' to 'metadata''
Add some more repository related debug logging

Build is green

Patch application report for D8101 (id=29493)

Rebasing onto 1bf11aa26d...

Current branch diff-target is up to date.
Changes applied before test
commit b90176e55c988509dcd4bf50215dfb3eef67dd14
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    crates: Implement incremental listing
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/567/ for more details.

ardumont retitled this revision from Lister implements incremental mode to crates.lister: Implement incremental mode.Aug 2 2022, 12:11 PM

lgtm

couple of suggestions inline.

swh/lister/crates/lister.py
133–134

let logging instructions do the formatting.

175
229–230

i'm not sure whether you actually want to wrap version['version'] into a string or not (i assumed it was a string already).

(Also, feel free to dismiss this comment if that does not make any sense to you ;)

This revision is now accepted and ready to land.Aug 2 2022, 12:31 PM

(I've dropped val from reviewers since they are in vacation and you attended most if not all of their request changes.)

Move 'yanked' information from 'artifacts' to 'metadata''

Thanks. I'm not a huge fan of the name, 'metadata' is already an overloaded term. What about crates_metadata, to make it clear this is a format specific to this lister?
Either way, please document it and make sure this will be picked up by the loader.

franckbret marked 3 inline comments as done.

Fix some typos
Rename 'metadata' dict to 'crates_metadata'

swh/lister/crates/lister.py
229–230

For now let's let it as is. It's the same flat structure as artifacts and the loader do something similar when it receives the artifacts data. Regarding crates_metadata and the yanked value, for now it's not used in the loader for anything else than being informative.

Build is green

Patch application report for D8101 (id=29578)

Rebasing onto d34a6232a6...

First, rewinding head to replay your work on top of it...
Applying: crates.lister: Implement incremental mode:
Changes applied before test
commit 51e4c9d804d886446682f3b6c6c168abef7affc7
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    crates.lister: Implement incremental mode:
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/571/ for more details.

Build is green

Patch application report for D8101 (id=29586)

Rebasing onto d34a6232a6...

Current branch diff-target is up to date.
Changes applied before test
commit a6f796b26867a078f44825a82ca71942f12eed7b
Author: Franck Bret <franck.bret@octobus.net>
Date:   Fri Jul 8 12:46:11 2022 +0200

    crates.lister: Implement incremental mode:
    
    Add incremental mode support based on a 'last_commit' state, used to get
    new package versions from git diff range of commits.

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/572/ for more details.