Doesn't this deserve a state-of-the-art kind of thing? Are there documentation material on the subject? How does other (big) cassandra users handle this?

Apr 19 2021, 2:14 PM · Storage manager

olasd added a comment to T2602: Investigate how to upgrade the schema of the Cassandra storage.

In T2602#63432, @vlorentz wrote:

For the harder cases, that involve changes to the PK, we could do something like this:

create a new table with a new name (eg. revision_v[n+1]; like we do in swh-search except Cassandra does not support aliases)

start an extra storage backend, that reads from that table instead of the old one (eg. revision_v[n]), and also reads from all the other tables as usual

have a multiplexing storage proxy (like we have for the objstorage), that queries this new backend (which reads from v[n+1]), and falls back to the old backend (which reads from v[n])

Apr 19 2021, 1:59 PM · Storage manager

vlorentz removed a parent task for T3089: Remove the 'metadata' column of the 'revision' table: T2471: NPM package angular-ts-manage fails to be properly loaded.

Apr 19 2021, 12:43 PM · Storage manager, Archive content

Apr 16 2021

vlorentz added a comment to T2602: Investigate how to upgrade the schema of the Cassandra storage.

What we can do, however:

Apr 16 2021, 1:45 PM · Storage manager

vlorentz added a subtask for T1892: Cassandra as a storage backend: T2602: Investigate how to upgrade the schema of the Cassandra storage.

Apr 16 2021, 1:36 PM · meta-task, Storage manager

vlorentz added a parent task for T2602: Investigate how to upgrade the schema of the Cassandra storage: T1892: Cassandra as a storage backend.

Apr 16 2021, 1:36 PM · Storage manager

Apr 15 2021

vlorentz placed T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage up for grabs.

Apr 15 2021, 3:17 PM · Storage manager, Extrinsic metadata

vlorentz added a parent task for T2564: migrate existing revisions metadata extra_headers to actual extra_headers field: T3090: Make loaders not rely on the 'metadata' column of the 'revision' table.

Apr 15 2021, 3:15 PM · System administration, Storage manager

vlorentz closed T3090: Make loaders not rely on the 'metadata' column of the 'revision' table, a subtask of T3089: Remove the 'metadata' column of the 'revision' table, as Resolved.

Apr 15 2021, 3:15 PM · Storage manager, Archive content

vlorentz closed T3142: Make loaders write to the ExtId storage, a subtask of T3143: Migrate revision metadata to extid in the storage, as Resolved.

Apr 15 2021, 3:15 PM · System administration, Storage manager, Core Loader

Apr 14 2021

KShivendu closed T2316: Align row deduplication of all _add endpoints on release_add as Resolved.

Apr 14 2021, 5:59 PM · Easy hack, Storage manager

Apr 12 2021

olasd updated the task description for T3245: List all the objects that should be impacted by a given takedown request.

Apr 12 2021, 4:24 PM · Storage manager

olasd changed the status of T3245: List all the objects that should be impacted by a given takedown request from Open to Work in Progress.

Apr 12 2021, 4:24 PM · Storage manager

Apr 9 2021

anlambert added a comment to T3145: Docs : Postgres DB schema missing .

Schema image is now properly displayed: https://docs.softwareheritage.org/devel/swh-storage/sql-storage.html#sql-storage

Apr 9 2021, 3:17 PM · Storage manager, Documentation

ardumont closed T3145: Docs : Postgres DB schema missing as Resolved.

Apr 9 2021, 2:23 PM · Storage manager, Documentation

ardumont added a comment to T3145: Docs : Postgres DB schema missing .

Thanks @faux @KShivendu @anlambert, team work ;)

Apr 9 2021, 2:23 PM · Storage manager, Documentation

ardumont merged T3227: DB Schema link broken in docs under swh-storage. into T3145: Docs : Postgres DB schema missing .

Apr 9 2021, 2:22 PM · Storage manager, Documentation

Apr 6 2021

vlorentz merged task T3185: Migrate extrinsic metadata from 'revision' to 'raw_extrinsic_metadata' tables into T2513: Copy metadata on revisions to the extrinsic metadata storage.

Apr 6 2021, 5:14 PM · System administration, Storage manager

vlorentz added a comment to T3143: Migrate revision metadata to extid in the storage.

if you remember the crash times (.zsh_history?), we could find a range of candidate SWHIDs...

Apr 6 2021, 5:12 PM · System administration, Storage manager, Core Loader

olasd closed T3143: Migrate revision metadata to extid in the storage as Resolved.

The migration script has now run to completion (took around a week).

Apr 6 2021, 4:53 PM · System administration, Storage manager, Core Loader

olasd added a revision to T3143: Migrate revision metadata to extid in the storage: D5430: Add sha512 as a valid field in dsc metadata.

Apr 6 2021, 4:48 PM · System administration, Storage manager, Core Loader

vlorentz added a parent task for T3089: Remove the 'metadata' column of the 'revision' table: T3201: Mirror: unsupported Unicode escape sequence.

Apr 6 2021, 2:20 PM · Storage manager, Archive content

vlorentz added a comment to T1487: Add a public API endpoint to retrieve a set of files with a given name.

@KShivendu The linked script is a start. As it is, it requires direct access to the DB; so you need to create abstractions for it in swh-storage and swh-web

Apr 6 2021, 12:50 PM · Easy hack, Storage manager, Object storage

vlorentz closed T1377: in-memory storage: compute all counters as Resolved.

ok, thanks. It's actually tested in test_stat_counters in swh-storage/swh/storage/tests/storage_tests.py, which is used to test all four classes.

Apr 6 2021, 12:47 PM · Easy hack, Storage manager

Apr 5 2021

KShivendu added a comment to T1487: Add a public API endpoint to retrieve a set of files with a given name.

Hi guys. Any pointers on where to start?

Apr 5 2021, 1:57 PM · Easy hack, Storage manager, Object storage

KShivendu added a comment to T1377: in-memory storage: compute all counters.

I might be wrong but, I think it has been completed. Check out these :

Apr 5 2021, 12:24 PM · Easy hack, Storage manager

Apr 3 2021

vlorentz closed T2290: Implement origin_metadata endpoints in swh/storage/cassandra/ as Resolved.

No longer relevant

Apr 3 2021, 9:06 AM · Easy hack, Storage manager

Apr 1 2021

vlorentz updated the task description for T1892: Cassandra as a storage backend.

Apr 1 2021, 11:48 AM · meta-task, Storage manager

vlorentz updated the task description for T1892: Cassandra as a storage backend.

Apr 1 2021, 11:48 AM · meta-task, Storage manager

vlorentz updated the task description for T1892: Cassandra as a storage backend.

Apr 1 2021, 11:48 AM · meta-task, Storage manager

vlorentz added a subtask for T1117: Origin search is *slow* when you look for very common words: T2590: Finish the indexer -> swh-search pipeline.

Apr 1 2021, 10:51 AM · Web app, Storage manager

Mar 30 2021

olasd changed the status of T3143: Migrate revision metadata to extid in the storage from Open to Work in Progress.

Mar 30 2021, 7:43 PM · System administration, Storage manager, Core Loader

olasd added a comment to T3143: Migrate revision metadata to extid in the storage.

I've deployed the extid schema changes on all storages, and I've started the migration script on getty.

Mar 30 2021, 7:42 PM · System administration, Storage manager, Core Loader

vsellier added a project to T3143: Migrate revision metadata to extid in the storage: System administration.

Mar 30 2021, 5:26 PM · System administration, Storage manager, Core Loader

vlorentz added a project to T3143: Migrate revision metadata to extid in the storage: Storage manager.

Mar 30 2021, 4:57 PM · System administration, Storage manager, Core Loader

Mar 29 2021

vlorentz renamed T3185: Migrate extrinsic metadata from 'revision' to 'raw_extrinsic_metadata' tables from Migrate extrinsic metadata to Migrate extrinsic metadata from 'revision' to 'raw_extrinsic_metadata' tables.

Mar 29 2021, 4:06 PM · System administration, Storage manager

vlorentz triaged T3185: Migrate extrinsic metadata from 'revision' to 'raw_extrinsic_metadata' tables as Normal priority.

Mar 29 2021, 4:05 PM · System administration, Storage manager

Mar 25 2021

vlorentz closed T1910: Redesign origin search using a dedicated component (swh-search), a subtask of T1117: Origin search is *slow* when you look for very common words, as Resolved.

Mar 25 2021, 11:16 AM · Web app, Storage manager

vlorentz closed T1910: Redesign origin search using a dedicated component (swh-search) as Resolved.

Mar 25 2021, 11:16 AM · Archive search, Storage manager

vlorentz closed T1910: Redesign origin search using a dedicated component (swh-search), a subtask of T1892: Cassandra as a storage backend, as Resolved.

Mar 25 2021, 11:16 AM · meta-task, Storage manager

Mar 23 2021

vlorentz added a comment to T2686: Use hashes for all kafka keys.

(and we should keep the origin topic; we already have an ExtSWHID for origins anyway)

Mar 23 2021, 2:55 PM · Data Model, Storage manager

olasd added a comment to T2686: Use hashes for all kafka keys.

The following objects remain:

Mar 23 2021, 2:47 PM · Data Model, Storage manager

vlorentz closed T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects, a subtask of T2668: Package loaders should write extrinsic metadata on directories instead of revisions/releases, as Resolved.

Mar 23 2021, 2:33 PM · Package Loader, Storage manager, Extrinsic metadata

vlorentz closed T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects, a subtask of T2686: Use hashes for all kafka keys, as Resolved.

Mar 23 2021, 2:33 PM · Data Model, Storage manager

vlorentz closed T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects as Resolved.

Mar 23 2021, 2:33 PM · Data Model, Storage manager, Extrinsic metadata

vlorentz closed T3017: Use hashes as keys in swh.journal.objects.raw_extrinsic_metadata, a subtask of T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects, as Resolved.

Mar 23 2021, 2:33 PM · Data Model, Storage manager, Extrinsic metadata

vlorentz closed T3017: Use hashes as keys in swh.journal.objects.raw_extrinsic_metadata as Resolved.

Mar 23 2021, 2:33 PM · Data Model, Storage manager, Extrinsic metadata

vlorentz closed T3020: Add an "index" for raw_extrinsic_metadata.id in swh.storage.cassandra, a subtask of T3022: Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields, as Resolved.

Mar 23 2021, 2:32 PM · Storage manager, Extrinsic metadata

vlorentz closed T3020: Add an "index" for raw_extrinsic_metadata.id in swh.storage.cassandra, a subtask of T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage, as Resolved.

Mar 23 2021, 2:32 PM · Storage manager, Extrinsic metadata

vlorentz closed T3020: Add an "index" for raw_extrinsic_metadata.id in swh.storage.cassandra as Resolved.

Mar 23 2021, 2:32 PM · Storage manager, Extrinsic metadata

olasd closed T3019: Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql, a subtask of T3018: Allow querying raw_extrinsic_metadata by hash in swh-storage, as Resolved.

Mar 23 2021, 2:31 PM · Storage manager, Extrinsic metadata

olasd closed T3019: Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql as Resolved.

After a lot of back and forth, and the release of swh.model v2.3.0 and swh.storage v0.26.0, this is now all done and deployed in staging and production.

Mar 23 2021, 2:31 PM · Storage manager, Extrinsic metadata

olasd closed T3019: Add an index for raw_extrinsic_metadata.id in swh.storage.postgresql, a subtask of T3022: Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields, as Resolved.

Mar 23 2021, 2:31 PM · Storage manager, Extrinsic metadata

olasd closed T3022: Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields, a subtask of T2703: Use intrinsic identifiers/hashes for RawExtrinsicMetadata objects, as Resolved.

Mar 23 2021, 2:25 PM · Data Model, Storage manager, Extrinsic metadata

olasd closed T3022: Deduplicate RawExtrinsicMetadata by hash instead of a subset of their fields as Resolved.

After the release of swh.model v2, this is now done.

Mar 23 2021, 2:25 PM · Storage manager, Extrinsic metadata

Mar 19 2021

vlorentz triaged T3135: Improve integrity of ingested content as Normal priority.

Mar 19 2021, 4:23 PM · Storage manager, Roadmap 2021, meta-task

Mar 17 2021

KShivendu updated the task description for T3145: Docs : Postgres DB schema missing .

Mar 17 2021, 8:56 AM · Storage manager, Documentation

KShivendu updated the task description for T3145: Docs : Postgres DB schema missing .

Mar 17 2021, 8:56 AM · Storage manager, Documentation

KShivendu triaged T3145: Docs : Postgres DB schema missing as Normal priority.

Mar 17 2021, 8:46 AM · Storage manager, Documentation

Mar 15 2021

rdicosmo added a subtask for T3135: Improve integrity of ingested content: T399: (Re-)Compute data checksums before insertion.

Mar 15 2021, 8:48 PM · Storage manager, Roadmap 2021, meta-task

rdicosmo added a parent task for T399: (Re-)Compute data checksums before insertion: T3135: Improve integrity of ingested content.

Mar 15 2021, 8:48 PM · Storage manager

rdicosmo created T3135: Improve integrity of ingested content.

Mar 15 2021, 8:47 PM · Storage manager, Roadmap 2021, meta-task

rdicosmo added a comment to T3092: Define the requirements for an on-premise Cassandra cluster.

Let's organise a call next week to explore the options, including the new opportunities of testing that emerged recently.

Mar 15 2021, 1:57 PM · System administration, Storage manager

vlorentz added a comment to T3092: Define the requirements for an on-premise Cassandra cluster.

@rdicosmo I have not, good idea. While they are probably too expansive to use as the main storage instead of SSDs (either via a regular FS or by using a Pmem-aware Cassandra fork), we could use them in addition to the above requirements.

Mar 15 2021, 1:48 PM · System administration, Storage manager

rdicosmo added a comment to T3092: Define the requirements for an on-premise Cassandra cluster.

Did you consider PMem (and other configurations for Intel Optane memory) in your discussion? It offers a very interesting price/performance ratio.
There are machines on Grid5000 available to test this technology if needed.

Mar 15 2021, 1:21 PM · System administration, Storage manager

vlorentz added a parent task for T3089: Remove the 'metadata' column of the 'revision' table: T2471: NPM package angular-ts-manage fails to be properly loaded.

Mar 15 2021, 12:32 PM · Storage manager, Archive content

vlorentz closed T3092: Define the requirements for an on-premise Cassandra cluster as Resolved.

Mar 15 2021, 11:34 AM · System administration, Storage manager

vlorentz closed T3092: Define the requirements for an on-premise Cassandra cluster, a subtask of T3091: Order hardware for an on-premise Cassandra cluster, as Resolved.

Mar 15 2021, 11:34 AM · System administration, Storage manager