Page MenuHomeSoftware Heritage

[WIP] Add content_metadata_{add,get}.
AbandonedPublic

Authored by vlorentz on Jun 8 2020, 6:51 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

This is a generic implementation, so adding other types of objects should be easy.

Tests will fail because it's not implemented for in-mem or cassandra yet.

Test Plan

failing

Diff Detail

Repository
rDSTO Storage manager
Branch
artifact-metadata
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 13115
Build 20007: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 20006: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D3247 (id=11506)

Rebasing onto 7eb44d412b...

Current branch diff-target is up to date.
Changes applied before test
commit 640848e8490f32dd181fdb3e400ad59bb2c1dcbb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 8 18:50:17 2020 +0200

    [WIP] Add content_metadata_{add,get}.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/248/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/248/console

This looks reasonable ;)

Though I'll definitely need a second look :D

swh/storage/db.py
1182

since you're removing id (index 0) and metadata (index -1) from the cols, maybe assert those are what we think they are (to avoid future breaking change on that).

(We did that for the origin visit status)

swh/storage/sql/30-swh-schema.sql
40

I don't really know what's more sensible here, being more restrictive or not.

I would have made it less open at first, something like:
'^swh:[0-9]:[a-z]{3}:.*' ?

  • Before we reach version number 9, we may have some time (we are 1 today ;)
  • [a-z]{3} because so far, it's only {snp, rev, dir, cnt, rel}. We could also just limit to those values, maybe.

So that may make us think when we want to upgrade the swhid.
(Then again, i might be thinking backward here ¯\_(ツ)_/¯)

428–429

heh, if we had the origin id as swhid, we would have been symmetric here :D

swh/storage/tests/test_storage.py
3179

"""metadata which are not bytes should raise""" or something

I'm a bit puzzled by this diff w.r.t this assertion in T2306 "this metadata store should be outside of the main graph storage" . T2306 is related with this diff, right?

I had the impression we wanted all this to live in a dedicated storage rather than in the main one. Did I miss something?

In any case it's a good starting point, nice job.

swh/storage/extrinsic_metadata.py
23

Not sure I find this way of writing the dict easier to read than writing it as a literal, but I am mainly concerned why we need yet another model-like declaration here.

What does this carry that is not declared in the data model? Shouldn't this be put there (in the model) instead?

[edit] Just reread the spec (D3154) and I do better understand this part. However, I wonder is this notion of context should not be better described in swh.model .

swh/storage/extrinsic_metadata.py
23

I'd rather keep it in swh-storage for now, and move it in swh-model later if need be.

swh/storage/sql/30-swh-schema.sql
40

[a-z]{3} because so far, it's only {snp, rev, dir, cnt, rel}.

No. We have no idea what the format of SWHID v2 will be.

I'm a bit puzzled by this diff w.r.t this assertion in T2306 "this metadata store should be outside of the main graph storage" . T2306 is related with this diff, right?

I had the impression we wanted all this to live in a dedicated storage rather than in the main one. Did I miss something?

This meant that the metadata store should be in separate, unlinked, tables, so we can move it out/split it out if we need to. E.g. no new metadata attributes on current objects.

In the latest iteration of that conversation (ISTR with @zack and @ardumont), we concluded that we don't have a good, separate, storage backend solution for this metadata, and that it would make sense to put it within the main database, in separate tables (so we get replication, backups, ... integrated with the current setup for "free"). We can move this out / shard this separately when we finalize the migration off of postgres.

  • rebase
  • Simplify code by using a single table for metadata of all object types.

Build has FAILED

Patch application report for D3247 (id=11652)

Rebasing onto 8f1ac4cb13...

Current branch diff-target is up to date.
Changes applied before test
commit 4540e48ae64960bb2b83850b33ec11bb4b3c7828
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 15 17:17:36 2020 +0200

    Simplify code by using a single table for metadata of all object types.

commit b68fbd296c5e09ab1fcb3233af79d4fe8f83de88
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 8 18:50:17 2020 +0200

    [WIP] Add content_metadata_{add,get}.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/282/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/282/console

  • Rebase
  • Add content_metadata_{add,get} to the cassandra backend.

Build has FAILED

Patch application report for D3247 (id=11725)

Rebasing onto 822d96b580...

Current branch diff-target is up to date.
Changes applied before test
commit cbe008ca1c1fbbecc6902cd63502dec97c3db713
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 18 12:23:10 2020 +0200

    Add content_metadata_{add,get} to the cassandra backend.

commit f0d3089809de5d4a38b6fa5c4398b20fc51f6029
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 15 17:17:36 2020 +0200

    Simplify code by using a single table for metadata of all object types.

commit cc9a2edcd9f2be0ced57bf6ffc13d3293b24d50b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 8 18:50:17 2020 +0200

    [WIP] Add content_metadata_{add,get}.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/306/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/306/console

  • weed out accidental reverts in cql.py
  • fix extra context key in pg's origin_metadata_get output

Build has FAILED

Patch application report for D3247 (id=11807)

Rebasing onto 2d497ff07c...

First, rewinding head to replay your work on top of it...
Applying: [WIP] add content_metadata for cassandra and postgresql
Changes applied before test
commit b516bd947537cbe452b140647bbc8f67fdf050e6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 16:17:39 2020 +0200

    [WIP] add content_metadata for cassandra and postgresql

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/330/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/330/console

  • [WIP] add content_metadata for in-mem

Build has FAILED

Patch application report for D3247 (id=11824)

Rebasing onto 2d497ff07c...

First, rewinding head to replay your work on top of it...
Applying: [WIP] add content_metadata for cassandra and postgresql
Applying: [WIP] add content_metadata for in-mem
Changes applied before test
commit 3db984acfc7d6d747d6e33470582fb797fd37a5c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 19:43:23 2020 +0200

    [WIP] add content_metadata for in-mem

commit 8bde7ba388b35060d7517a79615853e7f8ddf9c2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 16:17:39 2020 +0200

    [WIP] add content_metadata for cassandra and postgresql

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/334/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/334/console

  • Add 'type' column
  • Fix signatures
  • bump schema version

Build has FAILED

Patch application report for D3247 (id=11895)

Rebasing onto c6e6f33916...

First, rewinding head to replay your work on top of it...
Applying: [WIP] add content_metadata for cassandra and postgresql
Using index info to reconstruct a base tree...
M	swh/storage/cassandra/cql.py
M	swh/storage/cassandra/schema.py
M	swh/storage/cassandra/storage.py
M	swh/storage/db.py
M	swh/storage/sql/30-swh-schema.sql
M	swh/storage/sql/60-swh-indexes.sql
M	swh/storage/storage.py
M	swh/storage/tests/test_storage.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/tests/test_storage.py
Auto-merging swh/storage/storage.py
CONFLICT (content): Merge conflict in swh/storage/storage.py
Auto-merging swh/storage/sql/60-swh-indexes.sql
Auto-merging swh/storage/sql/30-swh-schema.sql
Auto-merging swh/storage/db.py
Auto-merging swh/storage/cassandra/storage.py
Auto-merging swh/storage/cassandra/schema.py
Auto-merging swh/storage/cassandra/cql.py
Patch failed at 0001 [WIP] add content_metadata for cassandra and postgresql

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto c6e6f33916...

Already up to date.
Changes applied before test
commit 84f000545c050edb7f70092147975fd2d571c712
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:40:12 2020 +0200

    bump schema version

commit b14ad6a33629a17dc6b3b0e50b384322f4d4af3b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:39:33 2020 +0200

    Fix signatures

commit a9201fde444ee4a1cb302def5a0753c447d84e3e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:13:22 2020 +0200

    Add 'type' column

commit f1e304f7ec4a85e06c9742fa78995f9d7e5e38e1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 19:43:23 2020 +0200

    [WIP] add content_metadata for in-mem

commit 71dfb8dc2fab777d6017fed02e58737c12f26846
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 16:17:39 2020 +0200

    [WIP] add content_metadata for cassandra and postgresql

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/354/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/354/console

Build is green

Patch application report for D3247 (id=11900)

Rebasing onto 89e9dae604...

Current branch diff-target is up to date.
Changes applied before test
commit dc4e8bc0291bad473ccbb80ad959d4e04714f9a7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:40:12 2020 +0200

    bump schema version

commit acc3b4210446203230f7f6881bd8be6300bcfbd1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:39:33 2020 +0200

    Fix signatures

commit cfa820131c5caef47931dbde449143ab56c2d0ea
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jun 25 14:13:22 2020 +0200

    Add 'type' column

commit 1b1400a7e7fbc3750602ae136812741d5556986c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 19:43:23 2020 +0200

    [WIP] add content_metadata for in-mem

commit 54c17186e4686e23f5a43594fc1c81a9871041ba
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jun 22 16:17:39 2020 +0200

    [WIP] add content_metadata for cassandra and postgresql

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/356/ for more details.

Replaced by linked diffs