The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index.
Depends on D6680.
Differential D6578
Add support to filter files a minimum size Authored by aeviso on Oct 28 2021, 2:28 PM.
Details
The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index. Depends on D6680.
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D6578 (id=23908)Could not rebase; Attempt merge onto ef49e3100c... Updating ef49e31..1048e73 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 +++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 11 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 +- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/tests/data/generate_repo.py | 2 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 21 files changed, 1545 insertions(+), 68 deletions(-) Changes applied before testcommit 1048e73c084670d22d0bca4ef2a69c0627bed3a3
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 14:21:52 2021 +0200
Add support to filter files a minimum size
commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Sep 21 16:13:53 2021 +0200
Add support for remote backend on existing storage tests
commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:59:00 2021 +0200
Improve timeout logic on remote storage client side
commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 18 11:52:04 2021 +0200
Export batch size and prefetch count as parameters for remote storage
commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 11 16:06:03 2021 +0200
Send several items per message in the remote provenance storage
commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:49:44 2021 +0200
Fix config file parsing for server initilization
commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:41:42 2021 +0200
Improve routing key computation for paths
commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Wed Sep 15 13:39:59 2021 +0200
Improve server/client shoutdown logic and error handling
Add StatsD support to client to be compliant with the other provenance
storage implementations
commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Aug 31 13:36:34 2021 +0200
Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
Improve connection error handling on both client and server side.
Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
Each exchange handles all entity related insertions, dispatching to different
queues depending on the requested `ProvenanceStorageInterface` methods (16
queues per methods). For instance, the `content` exchange handles all requests
for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
`CNT_IN_DIR` (ie. relations with content as source). In each case, requests
are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
content.
commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Aug 20 12:21:27 2021 +0200
Add new RabbitMQ-based client/server API
Get methods in the `ProvenanceStorageInterface` are called through a server that
guarantees conflict-free writings to the underlying database.
Set methods are called directly from the client to avoid RCP overhead for reads.
The server spawns multiple processes to handle independent requests concurrently.
commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 14 12:03:47 2021 +0200
Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
Previous version was storing arrays of strings representing tuples for the
denormalized relations (`dst` and `loc` of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if `bigint`
(ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.
commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:58:32 2021 +0200
Fix `yaml.load` deprecated warningSee https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/476/ for more details. Comment Actions Build is green Patch application report for D6578 (id=23911)Could not rebase; Attempt merge onto ef49e3100c... Updating ef49e31..3e58a02 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 +++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 +- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/tests/data/generate_repo.py | 2 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 21 files changed, 1545 insertions(+), 69 deletions(-) Changes applied before testcommit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 14:21:52 2021 +0200
Add support to filter files a minimum size
commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Sep 21 16:13:53 2021 +0200
Add support for remote backend on existing storage tests
commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:59:00 2021 +0200
Improve timeout logic on remote storage client side
commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 18 11:52:04 2021 +0200
Export batch size and prefetch count as parameters for remote storage
commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 11 16:06:03 2021 +0200
Send several items per message in the remote provenance storage
commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:49:44 2021 +0200
Fix config file parsing for server initilization
commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:41:42 2021 +0200
Improve routing key computation for paths
commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Wed Sep 15 13:39:59 2021 +0200
Improve server/client shoutdown logic and error handling
Add StatsD support to client to be compliant with the other provenance
storage implementations
commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Aug 31 13:36:34 2021 +0200
Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
Improve connection error handling on both client and server side.
Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
Each exchange handles all entity related insertions, dispatching to different
queues depending on the requested `ProvenanceStorageInterface` methods (16
queues per methods). For instance, the `content` exchange handles all requests
for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
`CNT_IN_DIR` (ie. relations with content as source). In each case, requests
are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
content.
commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Aug 20 12:21:27 2021 +0200
Add new RabbitMQ-based client/server API
Get methods in the `ProvenanceStorageInterface` are called through a server that
guarantees conflict-free writings to the underlying database.
Set methods are called directly from the client to avoid RCP overhead for reads.
The server spawns multiple processes to handle independent requests concurrently.
commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 14 12:03:47 2021 +0200
Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
Previous version was storing arrays of strings representing tuples for the
denormalized relations (`dst` and `loc` of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if `bigint`
(ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.
commit 30d8899bcfd60019b84064eba6916af0b2b5173e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:58:32 2021 +0200
Fix `yaml.load` deprecated warningSee https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/477/ for more details. Comment Actions Looks good, but could you update the diff and commit message to give the motivation for the change? Comment Actions Build is green Patch application report for D6578 (id=23921)Could not rebase; Attempt merge onto 30d8899bcf... Updating 30d8899..3e58a02 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 ++++++++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 ++- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 20 files changed, 1544 insertions(+), 68 deletions(-) Changes applied before testcommit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 14:21:52 2021 +0200
Add support to filter files a minimum size
commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Sep 21 16:13:53 2021 +0200
Add support for remote backend on existing storage tests
commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:59:00 2021 +0200
Improve timeout logic on remote storage client side
commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 18 11:52:04 2021 +0200
Export batch size and prefetch count as parameters for remote storage
commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 11 16:06:03 2021 +0200
Send several items per message in the remote provenance storage
commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:49:44 2021 +0200
Fix config file parsing for server initilization
commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:41:42 2021 +0200
Improve routing key computation for paths
commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Wed Sep 15 13:39:59 2021 +0200
Improve server/client shoutdown logic and error handling
Add StatsD support to client to be compliant with the other provenance
storage implementations
commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Aug 31 13:36:34 2021 +0200
Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
Improve connection error handling on both client and server side.
Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
Each exchange handles all entity related insertions, dispatching to different
queues depending on the requested `ProvenanceStorageInterface` methods (16
queues per methods). For instance, the `content` exchange handles all requests
for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
`CNT_IN_DIR` (ie. relations with content as source). In each case, requests
are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
content.
commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Aug 20 12:21:27 2021 +0200
Add new RabbitMQ-based client/server API
Get methods in the `ProvenanceStorageInterface` are called through a server that
guarantees conflict-free writings to the underlying database.
Set methods are called directly from the client to avoid RCP overhead for reads.
The server spawns multiple processes to handle independent requests concurrently.
commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 14 12:03:47 2021 +0200
Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
Previous version was storing arrays of strings representing tuples for the
denormalized relations (`dst` and `loc` of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if `bigint`
(ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/478/ for more details. Comment Actions Yes, thanks. And make sure to include it in the commit message too (it's more discoverable than the diff when browsing the git history) Comment Actions Build is green Patch application report for D6578 (id=24103)Could not rebase; Attempt merge onto 30d8899bcf... Updating 30d8899..c85aa01 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 ++++++++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 15 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 ++- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 20 files changed, 1544 insertions(+), 69 deletions(-) Changes applied before testcommit c85aa01eaa350d102f0b316a403e430ce9baf02c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 14:21:52 2021 +0200
Add support to filter files a minimum size
The idea is to be able to filter files that are not meaningful from the
provenance point of view. For instance, the empty file. This modification
allows to define a minimum size for files to be considered for the
provenance index.
commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Sep 21 16:13:53 2021 +0200
Add support for remote backend on existing storage tests
commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 13:59:00 2021 +0200
Improve timeout logic on remote storage client side
commit aa8dc0ea8f67748e53076f2143ba2f6dad150498
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 18 11:52:04 2021 +0200
Export batch size and prefetch count as parameters for remote storage
commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Mon Oct 11 16:06:03 2021 +0200
Send several items per message in the remote provenance storage
commit fa5c6b763913bef84a128d152cb25f081edf399d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:49:44 2021 +0200
Fix config file parsing for server initilization
commit eaf8ad8026de592629d8c9286cf19db2690acfa0
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Oct 8 14:41:42 2021 +0200
Improve routing key computation for paths
commit 4243290997d281ece591c711e6748de341599e2d
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Wed Sep 15 13:39:59 2021 +0200
Improve server/client shoutdown logic and error handling
Add StatsD support to client to be compliant with the other provenance
storage implementations
commit df083f60f1eeeb9257992a639c9c1a9937ce62f4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Aug 31 13:36:34 2021 +0200
Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss
Use `pika.SelectConnection` and make an explicit handle of its life-cycle.
Improve connection error handling on both client and server side.
Change the RabbitMQ scheme to use 5 exchanges (one per entity + location).
Each exchange handles all entity related insertions, dispatching to different
queues depending on the requested `ProvenanceStorageInterface` methods (16
queues per methods). For instance, the `content` exchange handles all requests
for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and
`CNT_IN_DIR` (ie. relations with content as source). In each case, requests
are forwarded to 1 of 16 possible workers, depending on the sha1 id of the
content.
commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Fri Aug 20 12:21:27 2021 +0200
Add new RabbitMQ-based client/server API
Get methods in the `ProvenanceStorageInterface` are called through a server that
guarantees conflict-free writings to the underlying database.
Set methods are called directly from the client to avoid RCP overhead for reads.
The server spawns multiple processes to handle independent requests concurrently.
commit 743b5954068fcc98203d9d254c53c076856e3426
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 14 12:03:47 2021 +0200
Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor
Previous version was storing arrays of strings representing tuples for the
denormalized relations (`dst` and `loc` of the relation resp.). While that
simplified the check for duplicates, it turned out to be very inefficient
in terms of disk usage. The new version has two distinct lists if `bigint`
(ie. internal ids) for `dst` and `loc` resp. To check for duplicates the
lists should be zipped, and repeated tuples filtered.See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/480/ for more details. Comment Actions Build is green Patch application report for D6578 (id=24269)Could not rebase; Attempt merge onto 94baaab052... Updating 94baaab..584845d Fast-forward swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 4 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 15 +++---- swh/provenance/provenance.py | 77 +++++++++++++----------------------- swh/provenance/revision.py | 12 ++++-- swh/provenance/storage/archive.py | 16 ++++---- swh/provenance/tests/conftest.py | 34 +++++++++------- 9 files changed, 81 insertions(+), 86 deletions(-) Changes applied before testcommit 584845d3715ea6c536e7cf5f697cac628032416f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Thu Oct 28 14:21:52 2021 +0200
Add support to filter files a minimum size
The idea is to be able to filter files that are not meaningful from the
provenance point of view. For instance, the empty file. This modification
allows to define a minimum size for files to be considered for the
provenance index.
commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Tue Nov 23 16:11:09 2021 +0100
Reorder flushing operations to avoid unnecessary updated in the storage
commit 62a31f6f986bb38ced99331ab66eb0717600ea5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date: Wed Nov 24 11:10:40 2021 +0100
Rework conftest and improve type annotationsSee https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/487/ for more details. |