Page MenuHomeSoftware Heritage

Unify frontier definition between track-all vs track-first strategies
ClosedPublic

Authored by aeviso on Dec 6 2021, 10:59 AM.

Details

Summary

Previous definition for track-all was prone to inconsistencies in case
the ingestion process crashes. Also, it was only meant to act differently
for revisions that share content adn have the exact same timestamp (not a
major improvement after all).

Depends on D6734.

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6746 (id=24501)

Could not rebase; Attempt merge onto dd1d7aa233...

Updating dd1d7aa..ef6ed6c
Fast-forward
 sql/upgrades/002.sql                               |  17 ++
 swh/provenance/api/serializers.py                  |   5 +-
 swh/provenance/api/server.py                       |  44 +++-
 swh/provenance/cli.py                              |  40 +++
 swh/provenance/directory.py                        |  86 +++++++
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/interface.py                        |  54 ++--
 swh/provenance/mongo/backend.py                    |  48 ++--
 swh/provenance/origin.py                           |   2 +-
 swh/provenance/postgresql/provenance.py            | 274 ++++++++++-----------
 swh/provenance/provenance.py                       |  70 +++++-
 swh/provenance/revision.py                         |  87 ++-----
 swh/provenance/sql/30-schema.sql                   |  71 +++---
 swh/provenance/tests/test_cli.py                   |   1 +
 swh/provenance/tests/test_conflict_resolution.py   |  43 ++--
 swh/provenance/tests/test_directory_flatten.py     |  72 ++++++
 swh/provenance/tests/test_directory_iterator.py    |  29 +++
 swh/provenance/tests/test_history_graph.py         |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       |   2 +-
 swh/provenance/tests/test_provenance_storage.py    |  39 ++-
 .../tests/test_revision_content_layer.py           |  51 +++-
 21 files changed, 677 insertions(+), 364 deletions(-)
 create mode 100644 sql/upgrades/002.sql
 create mode 100644 swh/provenance/directory.py
 create mode 100644 swh/provenance/tests/test_directory_flatten.py
 create mode 100644 swh/provenance/tests/test_directory_iterator.py
Changes applied before test
commit ef6ed6c1e0f176dc730d5141819fd0387e1bb613
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Dec 6 10:54:23 2021 +0100

    Unify frontier definition between track-all vs track-first strategies
    
    Previous definition for track-all was prone to inconsistencies in case
    the ingestion process crashes. Also, it was only meant to act differently
    for revisions that share content adn have the exact same timestamp (not a
    major improvement after all).

commit f7ea16a592c024de1de605f004fc9afc4d5a0f0c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Dec 2 17:19:39 2021 +0100

    Refactor `raise_on_commit` logic with a decorator

commit 7b4b3f24b274b64840ee1f050926a113b860137f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 16:53:46 2021 +0100

    Add new flag to skip directory flattening while processing revisions

commit 5448b6ee5bc799c73cfe49d67c97768dadfbb8cc
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 15:45:12 2021 +0100

    Add support to flatten directories in the isochrone frontiers separately
    
    Building on the previous commit, a new entry point is added to the module
    allowing to iterate over a list of directories that are already identified
    as isochrone frontiers in the provenance model, but no flat models for
    their content has been created yet. This iteration produces such flat
    models.

commit 812df71d99daacb25d1df73522cb754b0842af83
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 15:00:28 2021 +0100

    Unify parameter order between provenance and archive objects across the module

commit 765135807ee60342f0b9e62d584c5bd46fedb069
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 13:21:33 2021 +0100

    Add explicit flag for flattenned directories to `ProvenanceStorageInterface`
    
    Both contents and directories should always have an associated date in
    the storage. Flattening of a direcory is know explicitly acknowledged
    by setting the newly added flag.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/529/ for more details.

douardda added a subscriber: douardda.

There is a small typo in the commit message (adn instead of and)

Also maybe add something like

Unify frontier definition between track-all vs track-first strategies

in favor of the track-first strategy. Previous definition [...]

This revision is now accepted and ready to land.Dec 10 2021, 3:13 PM

Build is green

Patch application report for D6746 (id=24723)

Could not rebase; Attempt merge onto dd1d7aa233...

Updating dd1d7aa..8179fe7
Fast-forward
 sql/upgrades/002.sql                               |  17 ++
 swh/provenance/api/serializers.py                  |   5 +-
 swh/provenance/api/server.py                       |  44 +++-
 swh/provenance/cli.py                              | 111 ++++++++-
 swh/provenance/directory.py                        |  86 +++++++
 swh/provenance/graph.py                            |   4 +-
 swh/provenance/interface.py                        |  54 ++--
 swh/provenance/mongo/backend.py                    |  48 ++--
 swh/provenance/origin.py                           |   2 +-
 swh/provenance/postgresql/provenance.py            | 274 ++++++++++-----------
 swh/provenance/provenance.py                       |  72 +++++-
 swh/provenance/revision.py                         |  87 ++-----
 swh/provenance/sql/30-schema.sql                   |  71 +++---
 swh/provenance/tests/test_cli.py                   |   1 +
 swh/provenance/tests/test_conflict_resolution.py   |  43 ++--
 swh/provenance/tests/test_directory_flatten.py     |  72 ++++++
 swh/provenance/tests/test_directory_iterator.py    |  29 +++
 swh/provenance/tests/test_history_graph.py         |   2 +-
 swh/provenance/tests/test_isochrone_graph.py       |   2 +-
 swh/provenance/tests/test_provenance_storage.py    |  39 ++-
 .../tests/test_revision_content_layer.py           |  51 +++-
 21 files changed, 742 insertions(+), 372 deletions(-)
 create mode 100644 sql/upgrades/002.sql
 create mode 100644 swh/provenance/directory.py
 create mode 100644 swh/provenance/tests/test_directory_flatten.py
 create mode 100644 swh/provenance/tests/test_directory_iterator.py
Changes applied before test
commit 8179fe75a077b9b28b148db27dd4e76b2e680a6a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Dec 6 10:54:23 2021 +0100

    Unify frontier definition between track-all vs track-first strategies
    
    in favor of the track-first strategy. Previous definition for track-all
    was prone to inconsistencies in case the ingestion process crashes. Also,
    it was only meant to act differently for revisions that share content
    and have the exact same timestamp (not a major improvement after all).

commit 78b8b77cdaaa302e140df25e9c98f0a25dfe3278
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Dec 2 17:19:39 2021 +0100

    Refactor `raise_on_commit` logic with a decorator

commit 5a86c235de7b8c1b74aed370a600ade36c3412f6
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 16:53:46 2021 +0100

    Add new flag to skip directory flattening while processing revisions

commit 0f2025f6ef454616537103fa720479987cba1278
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 15:45:12 2021 +0100

    Add support to flatten directories in the isochrone frontiers separately
    
    Building on the previous commit, a new entry point is added to the module
    allowing to iterate over a list of directories that are already identified
    as isochrone frontiers in the provenance model, but no flat models for
    their content has been created yet. This iteration produces such flat
    models.

commit 052e25da505c77da90d1c54ce0ade775117422e4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 15:00:28 2021 +0100

    Unify parameter order between provenance and archive objects across the module

commit f4f48923e86ef0054642165bcb9ecf4387d70bb8
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Dec 1 13:21:33 2021 +0100

    Add explicit flag for flattenned directories to `ProvenanceStorageInterface`
    
    Both contents and directories should always have an associated date in the storage.
    Flattening of a directory is now explicitly acknowledged by setting the newly added
    flag. The idea is to allow to postpone the creation of flat models for directories
    in the isochrone frontier (the algorithm will be refactored in the commits to come).

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/540/ for more details.