Page MenuHomeSoftware Heritage

Exporters: add option to write in a deterministic location
ClosedPublic

Authored by seirl on Mar 8 2022, 11:31 PM.

Details

Summary

Some use cases, such as building reproducible test datasets, require
exporting data in a deterministic location. This adds a config option to
exporters to make them always write in the same file.

It also refactors the shared logic to generate file UUIDs.

Diff Detail

Repository
rDDATASET Datasets
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7322 (id=26479)

Rebasing onto 68f9bd2028...

First, rewinding head to replay your work on top of it...
Applying: Exporters: add option to write in a deterministic location
Changes applied before test
commit 040c6000e5e4a03fae2ed096a8dcc4317b0696fd
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Mar 8 23:28:42 2022 +0100

    Exporters: add option to write in a deterministic location
    
    Some use cases, such as building reproducible test datasets, require
    exporting data in a deterministic location. This adds a config option to
    exporters to make them always write in the same file.
    
    It also refactors the shared logic to generate file UUIDs.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/27/ for more details.

seirl requested review of this revision.Mar 8 2022, 11:33 PM
ardumont added a subscriber: ardumont.

lgtm, one suggestion and one comment inline.

swh/dataset/exporter.py
62–65
swh/dataset/utils.py
32

i don't know what that does and it's not mentioned in the diff description.
Should probably in a dedicated commit/diff.

man zstd says:

•   -f, --force: overwrite output without prompting, and (de)compress symbolic links
This revision is now accepted and ready to land.Mar 9 2022, 12:20 PM
vlorentz added inline comments.
swh/dataset/exporter.py
62–65

and the function is missing a type annotation

swh/dataset/utils.py
32

It is related to the diff. Before that diff, the file names could not conflict because they would contain an UUID, so there was no ambiguity to what happened in this case. Now that there can be a name conflict, we need to specify what happens in this case (and we want the "overwrite" behavior)

Build has FAILED

Patch application report for D7322 (id=26624)

Rebasing onto 68f9bd2028...

Current branch diff-target is up to date.
Changes applied before test
commit 65a9f90348a0ef4335bbe765dcb9080e0e145c87
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Mar 8 23:28:42 2022 +0100

    Exporters: add option to write in a deterministic location
    
    Some use cases, such as building reproducible test datasets, require
    exporting data in a deterministic location. This adds a config option to
    exporters to make them always write in the same file.
    
    It also refactors the shared logic to generate file UUIDs.

Link to build: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/28/
See console output for more information: https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/28/console

Build is green

Patch application report for D7322 (id=27005)

Rebasing onto f588e20a41...

Current branch diff-target is up to date.
Changes applied before test
commit 31081e4121e8143b68ad0413c20a3b667ae28951
Author: Antoine Pietri <antoine.pietri1@gmail.com>
Date:   Tue Mar 29 15:29:12 2022 +0200

    Exporters: add option to write in a deterministic location
    
    Some use cases, such as building reproducible test datasets, require
    exporting data in a deterministic location. This adds a config option to
    exporters to make them always write in the same file.
    
    It also refactors the shared logic to generate file UUIDs.

See https://jenkins.softwareheritage.org/job/DDATASET/job/tests-on-diff/87/ for more details.