Page MenuHomeSoftware Heritage

Add support to filter files a minimum size
ClosedPublic

Authored by aeviso on Oct 28 2021, 2:28 PM.

Details

Summary

The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index.

Depends on D6680.

Diff Detail

Event Timeline

Build is green

Patch application report for D6578 (id=23908)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..1048e73
Fast-forward
 .gitignore                                 |   4 +-
 mypy.ini                                   |   3 +
 pytest.ini                                 |   2 +
 requirements-test.txt                      |   1 +
 requirements.txt                           |   1 +
 swh/provenance/__init__.py                 |   8 +
 swh/provenance/api/client.py               | 597 +++++++++++++++++++++
 swh/provenance/api/server.py               | 808 ++++++++++++++++++++++++++++-
 swh/provenance/archive.py                  |   2 +-
 swh/provenance/cli.py                      |  35 +-
 swh/provenance/graph.py                    |   3 +-
 swh/provenance/model.py                    |   4 +-
 swh/provenance/postgresql/archive.py       |  14 +-
 swh/provenance/revision.py                 |  11 +-
 swh/provenance/sql/30-schema.sql           |  20 +-
 swh/provenance/sql/40-funcs.sql            |  50 +-
 swh/provenance/storage/archive.py          |  16 +-
 swh/provenance/tests/conftest.py           |  24 +-
 swh/provenance/tests/data/generate_repo.py |   2 +-
 swh/provenance/util.py                     |   5 +
 tox.ini                                    |   3 +-
 21 files changed, 1545 insertions(+), 68 deletions(-)
Changes applied before test
commit 1048e73c084670d22d0bca4ef2a69c0627bed3a3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size

commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Sep 21 16:13:53 2021 +0200

    Add support for remote backend on existing storage tests

commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:58:32 2021 +0200

    Fix `yaml.load` deprecated warning

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/476/ for more details.

Build is green

Patch application report for D6578 (id=23911)

Could not rebase; Attempt merge onto ef49e3100c...

Updating ef49e31..3e58a02
Fast-forward
 .gitignore                                 |   4 +-
 mypy.ini                                   |   3 +
 pytest.ini                                 |   2 +
 requirements-test.txt                      |   1 +
 requirements.txt                           |   1 +
 swh/provenance/__init__.py                 |   8 +
 swh/provenance/api/client.py               | 597 +++++++++++++++++++++
 swh/provenance/api/server.py               | 808 ++++++++++++++++++++++++++++-
 swh/provenance/archive.py                  |   2 +-
 swh/provenance/cli.py                      |  35 +-
 swh/provenance/graph.py                    |   3 +-
 swh/provenance/model.py                    |   4 +-
 swh/provenance/postgresql/archive.py       |  14 +-
 swh/provenance/revision.py                 |  12 +-
 swh/provenance/sql/30-schema.sql           |  20 +-
 swh/provenance/sql/40-funcs.sql            |  50 +-
 swh/provenance/storage/archive.py          |  16 +-
 swh/provenance/tests/conftest.py           |  24 +-
 swh/provenance/tests/data/generate_repo.py |   2 +-
 swh/provenance/util.py                     |   5 +
 tox.ini                                    |   3 +-
 21 files changed, 1545 insertions(+), 69 deletions(-)
Changes applied before test
commit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size

commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Sep 21 16:13:53 2021 +0200

    Add support for remote backend on existing storage tests

commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:58:32 2021 +0200

    Fix `yaml.load` deprecated warning

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/477/ for more details.

vlorentz added a subscriber: vlorentz.

Looks good, but could you update the diff and commit message to give the motivation for the change?

This revision is now accepted and ready to land.Oct 28 2021, 2:55 PM

Looks good, but could you update the diff and commit message to give the motivation for the change?

Do you mean something like this?

Build is green

Patch application report for D6578 (id=23921)

Could not rebase; Attempt merge onto 30d8899bcf...

Updating 30d8899..3e58a02
Fast-forward
 .gitignore                           |   4 +-
 mypy.ini                             |   3 +
 pytest.ini                           |   2 +
 requirements-test.txt                |   1 +
 requirements.txt                     |   1 +
 swh/provenance/__init__.py           |   8 +
 swh/provenance/api/client.py         | 597 ++++++++++++++++++++++++++
 swh/provenance/api/server.py         | 808 ++++++++++++++++++++++++++++++++++-
 swh/provenance/archive.py            |   2 +-
 swh/provenance/cli.py                |  35 +-
 swh/provenance/graph.py              |   3 +-
 swh/provenance/model.py              |   4 +-
 swh/provenance/postgresql/archive.py |  14 +-
 swh/provenance/revision.py           |  12 +-
 swh/provenance/sql/30-schema.sql     |  20 +-
 swh/provenance/sql/40-funcs.sql      |  50 ++-
 swh/provenance/storage/archive.py    |  16 +-
 swh/provenance/tests/conftest.py     |  24 +-
 swh/provenance/util.py               |   5 +
 tox.ini                              |   3 +-
 20 files changed, 1544 insertions(+), 68 deletions(-)
Changes applied before test
commit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size

commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Sep 21 16:13:53 2021 +0200

    Add support for remote backend on existing storage tests

commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/478/ for more details.

Do you mean something like this?

Yes, thanks. And make sure to include it in the commit message too (it's more discoverable than the diff when browsing the git history)

Build is green

Patch application report for D6578 (id=24103)

Could not rebase; Attempt merge onto 30d8899bcf...

Updating 30d8899..c85aa01
Fast-forward
 .gitignore                           |   4 +-
 mypy.ini                             |   3 +
 pytest.ini                           |   2 +
 requirements-test.txt                |   1 +
 requirements.txt                     |   1 +
 swh/provenance/__init__.py           |   8 +
 swh/provenance/api/client.py         | 597 ++++++++++++++++++++++++++
 swh/provenance/api/server.py         | 808 ++++++++++++++++++++++++++++++++++-
 swh/provenance/archive.py            |   2 +-
 swh/provenance/cli.py                |  35 +-
 swh/provenance/graph.py              |   3 +-
 swh/provenance/model.py              |   4 +-
 swh/provenance/postgresql/archive.py |  15 +-
 swh/provenance/revision.py           |  12 +-
 swh/provenance/sql/30-schema.sql     |  20 +-
 swh/provenance/sql/40-funcs.sql      |  50 ++-
 swh/provenance/storage/archive.py    |  16 +-
 swh/provenance/tests/conftest.py     |  24 +-
 swh/provenance/util.py               |   5 +
 tox.ini                              |   3 +-
 20 files changed, 1544 insertions(+), 69 deletions(-)
Changes applied before test
commit c85aa01eaa350d102f0b316a403e430ce9baf02c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size
    
    The idea is to be able to filter files that are not meaningful from the
    provenance point of view. For instance, the empty file. This modification
    allows to define a minimum size for files to be considered for the
    provenance index.

commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Sep 21 16:13:53 2021 +0200

    Add support for remote backend on existing storage tests

commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 13:59:00 2021 +0200

    Improve timeout logic on remote storage client side

commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 18 11:52:04 2021 +0200

    Export batch size and prefetch count as parameters for remote storage

commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Oct 11 16:06:03 2021 +0200

    Send several items per message in the remote provenance storage

commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:49:44 2021 +0200

    Fix config file parsing for server initilization

commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Oct 8 14:41:42 2021 +0200

    Improve routing key computation for paths

commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Sep 15 13:39:59 2021 +0200

    Improve server/client shoutdown logic and error handling
    
    Add StatsD support to client to be compliant with the other provenance
    storage implementations

commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Aug 31 13:36:34 2021 +0200

    Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
    
    Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
    Improve connection error handling on both client and server side.
    
    Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
    Each exchange handles all entity related insertions, dispatching to different
    queues depending on the requested `ProvenanceStorageInterface` methods (16
    queues per methods). For instance, the `content` exchange handles all requests
    for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
    `CNT_IN_DIR` (ie. relations with content as source). In each case, requests
    are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
    content.

commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Aug 20 12:21:27 2021 +0200

    Add new RabbitMQ-based client/server API
    
    Get methods in the `ProvenanceStorageInterface` are called through a server that
    guarantees conflict-free writings to the underlying database.
    
    Set methods are called directly from the client to avoid RCP overhead for reads.
    
    The server spawns multiple processes to handle independent requests concurrently.

commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 14 12:03:47 2021 +0200

    Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
    
    Previous version was storing arrays of strings representing tuples for the
    denormalized relations (`dst` and `loc` of the relation resp.). While that
    simplified the check for duplicates, it turned out to be very inefficient
    in terms of disk usage. The new version has two distinct lists if `bigint`
    (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
    lists should be zipped, and repeated tuples filtered.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/480/ for more details.

Build is green

Patch application report for D6578 (id=24269)

Could not rebase; Attempt merge onto 94baaab052...

Updating 94baaab..584845d
Fast-forward
 swh/provenance/archive.py            |  2 +-
 swh/provenance/cli.py                |  4 +-
 swh/provenance/graph.py              |  3 +-
 swh/provenance/model.py              |  4 +-
 swh/provenance/postgresql/archive.py | 15 +++----
 swh/provenance/provenance.py         | 77 +++++++++++++-----------------------
 swh/provenance/revision.py           | 12 ++++--
 swh/provenance/storage/archive.py    | 16 ++++----
 swh/provenance/tests/conftest.py     | 34 +++++++++-------
 9 files changed, 81 insertions(+), 86 deletions(-)
Changes applied before test
commit 584845d3715ea6c536e7cf5f697cac628032416f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Thu Oct 28 14:21:52 2021 +0200

    Add support to filter files a minimum size
    
    The idea is to be able to filter files that are not meaningful from the
    provenance point of view. For instance, the empty file. This modification
    allows to define a minimum size for files to be considered for the
    provenance index.

commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Tue Nov 23 16:11:09 2021 +0100

    Reorder flushing operations to avoid unnecessary updated in the storage

commit 62a31f6f986bb38ced99331ab66eb0717600ea5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Wed Nov 24 11:10:40 2021 +0100

    Rework conftest and improve type annotations

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/487/ for more details.