The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index.
Depends on D6680.
Differential D6578
Add support to filter files a minimum size aeviso on Oct 28 2021, 2:28 PM. Authored by
Details
The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index. Depends on D6680.
Diff Detail
Event TimelineComment Actions Build is green Patch application report for D6578 (id=23908)Could not rebase; Attempt merge onto ef49e3100c... Updating ef49e31..1048e73 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 +++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 11 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 +- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/tests/data/generate_repo.py | 2 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 21 files changed, 1545 insertions(+), 68 deletions(-) Changes applied before testcommit 1048e73c084670d22d0bca4ef2a69c0627bed3a3 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Sep 21 16:13:53 2021 +0200 Add support for remote backend on existing storage tests commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:59:00 2021 +0200 Improve timeout logic on remote storage client side commit aa8dc0ea8f67748e53076f2143ba2f6dad150498 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 11:52:04 2021 +0200 Export batch size and prefetch count as parameters for remote storage commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 11 16:06:03 2021 +0200 Send several items per message in the remote provenance storage commit fa5c6b763913bef84a128d152cb25f081edf399d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:49:44 2021 +0200 Fix config file parsing for server initilization commit eaf8ad8026de592629d8c9286cf19db2690acfa0 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:41:42 2021 +0200 Improve routing key computation for paths commit 4243290997d281ece591c711e6748de341599e2d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Sep 15 13:39:59 2021 +0200 Improve server/client shoutdown logic and error handling Add StatsD support to client to be compliant with the other provenance storage implementations commit df083f60f1eeeb9257992a639c9c1a9937ce62f4 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Aug 31 13:36:34 2021 +0200 Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss Use `pika.SelectConnection` and make an explicit handle of its life-cycle. Improve connection error handling on both client and server side. Change the RabbitMQ scheme to use 5 exchanges (one per entity + location). Each exchange handles all entity related insertions, dispatching to different queues depending on the requested `ProvenanceStorageInterface` methods (16 queues per methods). For instance, the `content` exchange handles all requests for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and `CNT_IN_DIR` (ie. relations with content as source). In each case, requests are forwarded to 1 of 16 possible workers, depending on the sha1 id of the content. commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Aug 20 12:21:27 2021 +0200 Add new RabbitMQ-based client/server API Get methods in the `ProvenanceStorageInterface` are called through a server that guarantees conflict-free writings to the underlying database. Set methods are called directly from the client to avoid RCP overhead for reads. The server spawns multiple processes to handle independent requests concurrently. commit 743b5954068fcc98203d9d254c53c076856e3426 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit 30d8899bcfd60019b84064eba6916af0b2b5173e Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:58:32 2021 +0200 Fix `yaml.load` deprecated warning See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/476/ for more details. Comment Actions Build is green Patch application report for D6578 (id=23911)Could not rebase; Attempt merge onto ef49e3100c... Updating ef49e31..3e58a02 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 +++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 +- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/tests/data/generate_repo.py | 2 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 21 files changed, 1545 insertions(+), 69 deletions(-) Changes applied before testcommit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Sep 21 16:13:53 2021 +0200 Add support for remote backend on existing storage tests commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:59:00 2021 +0200 Improve timeout logic on remote storage client side commit aa8dc0ea8f67748e53076f2143ba2f6dad150498 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 11:52:04 2021 +0200 Export batch size and prefetch count as parameters for remote storage commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 11 16:06:03 2021 +0200 Send several items per message in the remote provenance storage commit fa5c6b763913bef84a128d152cb25f081edf399d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:49:44 2021 +0200 Fix config file parsing for server initilization commit eaf8ad8026de592629d8c9286cf19db2690acfa0 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:41:42 2021 +0200 Improve routing key computation for paths commit 4243290997d281ece591c711e6748de341599e2d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Sep 15 13:39:59 2021 +0200 Improve server/client shoutdown logic and error handling Add StatsD support to client to be compliant with the other provenance storage implementations commit df083f60f1eeeb9257992a639c9c1a9937ce62f4 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Aug 31 13:36:34 2021 +0200 Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss Use `pika.SelectConnection` and make an explicit handle of its life-cycle. Improve connection error handling on both client and server side. Change the RabbitMQ scheme to use 5 exchanges (one per entity + location). Each exchange handles all entity related insertions, dispatching to different queues depending on the requested `ProvenanceStorageInterface` methods (16 queues per methods). For instance, the `content` exchange handles all requests for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and `CNT_IN_DIR` (ie. relations with content as source). In each case, requests are forwarded to 1 of 16 possible workers, depending on the sha1 id of the content. commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Aug 20 12:21:27 2021 +0200 Add new RabbitMQ-based client/server API Get methods in the `ProvenanceStorageInterface` are called through a server that guarantees conflict-free writings to the underlying database. Set methods are called directly from the client to avoid RCP overhead for reads. The server spawns multiple processes to handle independent requests concurrently. commit 743b5954068fcc98203d9d254c53c076856e3426 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. commit 30d8899bcfd60019b84064eba6916af0b2b5173e Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:58:32 2021 +0200 Fix `yaml.load` deprecated warning See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/477/ for more details. Comment Actions Looks good, but could you update the diff and commit message to give the motivation for the change? Comment Actions Build is green Patch application report for D6578 (id=23921)Could not rebase; Attempt merge onto 30d8899bcf... Updating 30d8899..3e58a02 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 ++++++++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 14 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 ++- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 20 files changed, 1544 insertions(+), 68 deletions(-) Changes applied before testcommit 3e58a02592c87e5dda53c3e73fbf4063cebde4f7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Sep 21 16:13:53 2021 +0200 Add support for remote backend on existing storage tests commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:59:00 2021 +0200 Improve timeout logic on remote storage client side commit aa8dc0ea8f67748e53076f2143ba2f6dad150498 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 11:52:04 2021 +0200 Export batch size and prefetch count as parameters for remote storage commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 11 16:06:03 2021 +0200 Send several items per message in the remote provenance storage commit fa5c6b763913bef84a128d152cb25f081edf399d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:49:44 2021 +0200 Fix config file parsing for server initilization commit eaf8ad8026de592629d8c9286cf19db2690acfa0 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:41:42 2021 +0200 Improve routing key computation for paths commit 4243290997d281ece591c711e6748de341599e2d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Sep 15 13:39:59 2021 +0200 Improve server/client shoutdown logic and error handling Add StatsD support to client to be compliant with the other provenance storage implementations commit df083f60f1eeeb9257992a639c9c1a9937ce62f4 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Aug 31 13:36:34 2021 +0200 Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss Use `pika.SelectConnection` and make an explicit handle of its life-cycle. Improve connection error handling on both client and server side. Change the RabbitMQ scheme to use 5 exchanges (one per entity + location). Each exchange handles all entity related insertions, dispatching to different queues depending on the requested `ProvenanceStorageInterface` methods (16 queues per methods). For instance, the `content` exchange handles all requests for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and `CNT_IN_DIR` (ie. relations with content as source). In each case, requests are forwarded to 1 of 16 possible workers, depending on the sha1 id of the content. commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Aug 20 12:21:27 2021 +0200 Add new RabbitMQ-based client/server API Get methods in the `ProvenanceStorageInterface` are called through a server that guarantees conflict-free writings to the underlying database. Set methods are called directly from the client to avoid RCP overhead for reads. The server spawns multiple processes to handle independent requests concurrently. commit 743b5954068fcc98203d9d254c53c076856e3426 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/478/ for more details. Comment Actions Yes, thanks. And make sure to include it in the commit message too (it's more discoverable than the diff when browsing the git history) Comment Actions Build is green Patch application report for D6578 (id=24103)Could not rebase; Attempt merge onto 30d8899bcf... Updating 30d8899..c85aa01 Fast-forward .gitignore | 4 +- mypy.ini | 3 + pytest.ini | 2 + requirements-test.txt | 1 + requirements.txt | 1 + swh/provenance/__init__.py | 8 + swh/provenance/api/client.py | 597 ++++++++++++++++++++++++++ swh/provenance/api/server.py | 808 ++++++++++++++++++++++++++++++++++- swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 35 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 15 +- swh/provenance/revision.py | 12 +- swh/provenance/sql/30-schema.sql | 20 +- swh/provenance/sql/40-funcs.sql | 50 ++- swh/provenance/storage/archive.py | 16 +- swh/provenance/tests/conftest.py | 24 +- swh/provenance/util.py | 5 + tox.ini | 3 +- 20 files changed, 1544 insertions(+), 69 deletions(-) Changes applied before testcommit c85aa01eaa350d102f0b316a403e430ce9baf02c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index. commit 9a1a6169375b3591b81162dc0bce7f9c3d735e6c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Sep 21 16:13:53 2021 +0200 Add support for remote backend on existing storage tests commit 9358df82cc7255340caadaa13ae3b53fbe5e1cc7 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 13:59:00 2021 +0200 Improve timeout logic on remote storage client side commit aa8dc0ea8f67748e53076f2143ba2f6dad150498 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 18 11:52:04 2021 +0200 Export batch size and prefetch count as parameters for remote storage commit a9bc8845740f18bcf4befe9c521c2b1b8c4fd769 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Mon Oct 11 16:06:03 2021 +0200 Send several items per message in the remote provenance storage commit fa5c6b763913bef84a128d152cb25f081edf399d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:49:44 2021 +0200 Fix config file parsing for server initilization commit eaf8ad8026de592629d8c9286cf19db2690acfa0 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Oct 8 14:41:42 2021 +0200 Improve routing key computation for paths commit 4243290997d281ece591c711e6748de341599e2d Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Sep 15 13:39:59 2021 +0200 Improve server/client shoutdown logic and error handling Add StatsD support to client to be compliant with the other provenance storage implementations commit df083f60f1eeeb9257992a639c9c1a9937ce62f4 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Aug 31 13:36:34 2021 +0200 Rework `ProvenanceStorageRabbitMQWorker` to handle connection loss Use `pika.SelectConnection` and make an explicit handle of its life-cycle. Improve connection error handling on both client and server side. Change the RabbitMQ scheme to use 5 exchanges (one per entity + location). Each exchange handles all entity related insertions, dispatching to different queues depending on the requested `ProvenanceStorageInterface` methods (16 queues per methods). For instance, the `content` exchange handles all requests for `content_add` and `relation_add` for both relations `CNT_EARLY_IN_REV` and `CNT_IN_DIR` (ie. relations with content as source). In each case, requests are forwarded to 1 of 16 possible workers, depending on the sha1 id of the content. commit 69596d600a120c13d0cd2ed0d4e48584e8b9dc7c Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Fri Aug 20 12:21:27 2021 +0200 Add new RabbitMQ-based client/server API Get methods in the `ProvenanceStorageInterface` are called through a server that guarantees conflict-free writings to the underlying database. Set methods are called directly from the client to avoid RCP overhead for reads. The server spawns multiple processes to handle independent requests concurrently. commit 743b5954068fcc98203d9d254c53c076856e3426 Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 14 12:03:47 2021 +0200 Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor Previous version was storing arrays of strings representing tuples for the denormalized relations (`dst` and `loc` of the relation resp.). While that simplified the check for duplicates, it turned out to be very inefficient in terms of disk usage. The new version has two distinct lists if `bigint` (ie. internal ids) for `dst` and `loc` resp. To check for duplicates the lists should be zipped, and repeated tuples filtered. See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/480/ for more details. Comment Actions Build is green Patch application report for D6578 (id=24269)Could not rebase; Attempt merge onto 94baaab052... Updating 94baaab..584845d Fast-forward swh/provenance/archive.py | 2 +- swh/provenance/cli.py | 4 +- swh/provenance/graph.py | 3 +- swh/provenance/model.py | 4 +- swh/provenance/postgresql/archive.py | 15 +++---- swh/provenance/provenance.py | 77 +++++++++++++----------------------- swh/provenance/revision.py | 12 ++++-- swh/provenance/storage/archive.py | 16 ++++---- swh/provenance/tests/conftest.py | 34 +++++++++------- 9 files changed, 81 insertions(+), 86 deletions(-) Changes applied before testcommit 584845d3715ea6c536e7cf5f697cac628032416f Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Thu Oct 28 14:21:52 2021 +0200 Add support to filter files a minimum size The idea is to be able to filter files that are not meaningful from the provenance point of view. For instance, the empty file. This modification allows to define a minimum size for files to be considered for the provenance index. commit 966fe3e8d506ce8b4fddf6e9ad29db4dae9943ab Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Tue Nov 23 16:11:09 2021 +0100 Reorder flushing operations to avoid unnecessary updated in the storage commit 62a31f6f986bb38ced99331ab66eb0717600ea5b Author: Andres Ezequiel Viso <aeviso@softwareheritage.org> Date: Wed Nov 24 11:10:40 2021 +0100 Rework conftest and improve type annotations See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/487/ for more details. |