Page MenuHomeSoftware Heritage

Implement extrinsic origin metadata specification.
ClosedPublic

Authored by vlorentz on Apr 9 2020, 12:37 PM.

Details

Diff Detail

Repository
rDSTO Storage manager
Branch
impl-origin-metadata-spec
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11746
Build 17806: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 17805: arc lint + arc unit

Unit TestsFailed

TimeTest
4 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_buffer::test_buffering_proxy_storage_clear
sample_data = {'authority': [{'metadata': {'location': 'France'}, 'type': 'deposit-client', 'url': 'http:///hal/inria'}], 'content':...384>, 'target': b'12345678901234567890', 'type': 'dir'}], 'id': b'4\x013B2S1\x000\xf51\xe62\xa73\xff7\xc3\xa90'}], ...} def test_buffering_proxy_storage_clear(sample_data):
708 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_content_add
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7f7462b86550> swh_storage = <swh.storage.validate.ValidatingProxyStorage object at 0x7f746062b940>
637 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_content_add_from_generator
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7f746033b080> swh_storage = <swh.storage.validate.ValidatingProxyStorage object at 0x7f746028ff60>
600 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_content_add_from_lazy_content
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7f7450655550> swh_storage = <swh.storage.validate.ValidatingProxyStorage object at 0x7f74507ddcf8>
486 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_directory_add_from_generator
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7f741059a358> swh_storage = <swh.storage.validate.ValidatingProxyStorage object at 0x7f7410674cc0>
View Full Test Results (19 Failed · 830 Passed · 18 Skipped)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sql/upgrades/148.sql
93 ↗(On Diff #10802)

I'm commenting here that the changes above should be all identified in swh-deposit before landing.
Because swh-deposit uses the origin_metadata table.

swh/storage/cassandra/cql.py
801

I'm not sure I remember correctly, so I'll ask @ardumont to join.
But the name metadata_fetcher is specific to extrinsic metadata, while the tool table is used for the indexers.

Would that be a change that will affect indexers?

I see that this is defined in a Cassandra file, for which I do not know the scope.

Could you do a test for a registry and a forge, for the same origin?
I can give you real life examples, if needed..

I can't test the diff now, but I read it and added some comments..

I could try testing it later on this week...

swh/storage/cassandra/schema.py
193

Why do we need authority_type in the PRIMARY KEY? isn't the url enough ?

197

every typo?

swh/storage/interface.py
1151

and?

1170

is the and at the end intentional?

1241–1242

is it a fixed set?

swh/storage/tests/storage_data.py
330

above the type is only deposit

331

this looks like a wrong url (I don't know if this is a good or bad thing for a test)
and this could an example where name and url can change:
here hal is the authority but the hal-instance is hal-inria.
I'm not sure this is something that we can/want to differentiate .

swh/storage/cassandra/cql.py
801

The tool table used by the indexer are in their dedicated storage db.
This tool table stayed for the deposit.
Only 1 entry is referenced in the archive table (the deposit).

vlorentz marked 9 inline comments as done.

apply @moranegg's comments

sql/upgrades/148.sql
37 ↗(On Diff #10802)

the name is not needed as an identifier, and everything else can go in the metadata if we need it, but I don't think we do.

swh/storage/cassandra/schema.py
193

I don't know if it's needed, but that's how we wrote the specification.

swh/storage/interface.py
1170

no, that's a typo

1241–1242

According to the spec, yes. We can extend it later anyway.

swh/storage/tests/storage_data.py
330

indeed

331

It should be a valid URI in practice; but in tests it doesn't really matter. Changed to a more obvious url.

Build is green

Patch application report for D2988 (id=10832)

Rebasing onto bca643acab...

First, rewinding head to replay your work on top of it...
Applying: Add a wrapper to manage a sorted list.
Applying: Implement extrinsic origin metadata specification.
Changes applied before test
commit f12240bf1531751ab4af443906c8ba83d4e3b978
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

commit e9369b031669ae77ee627ee38ac12dda32b7a3b4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 24 17:22:09 2020 +0100

    Add a wrapper to manage a sorted list.
    
    For now this is only used by sorted_sha1, but we'll need it for
    origin_metadata soon.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/111/ for more details.

sql/upgrades/148.sql
93 ↗(On Diff #10802)

Yes, and now thanks to your review and remarks on D3045, there is even a test which checks those are fine.
So when this lands and get released, it should be noticed by the deposit loader ci if we forget about this \m/

  • Rebase
  • Add test with a different authority

Build is green

Patch application report for D2988 (id=10843)

Rebasing onto bca643acab...

First, rewinding head to replay your work on top of it...
Applying: Add a wrapper to manage a sorted list.
Applying: Implement extrinsic origin metadata specification.
Changes applied before test
commit 73cef28bb09655e7f4e3bc5c7974f79d66162181
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

commit 0c38297e5b67c1f020d742797d3df82a7c7ce7d7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 24 17:22:09 2020 +0100

    Add a wrapper to manage a sorted list.
    
    For now this is only used by sorted_sha1, but we'll need it for
    origin_metadata soon.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/113/ for more details.

swh/storage/cassandra/storage.py
1040

don't we have some conversion to do on that field?

(in metadata_fetcher_get below, you do a json.loads so ¯\_(ツ)_/¯)

swh/storage/cassandra/storage.py
1040

origin metadata are blobs (because we get them from random sources), metadata fetcher metadatas are jsonb (because we write them at SWH)

swh/storage/in_memory.py
101

Missing a rebase on latest master

That's D2987, ain't it?

Build is green

Patch application report for D2988 (id=10847)

Rebasing onto fe56005555...

Current branch diff-target is up to date.
Changes applied before test
commit 0acfafc733995d858c5dc6e9c9fafffca49a5f52
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/116/ for more details.

swh/storage/storage.py
1157

nitpick: you could move the check just after the get.
There is no point in fetching the fetcher if the authority_id is not resolved.

1211
row = db.metadata_fetcher_get(name, version, cur)
1229
row = db.metadata_authority_get(type, url, cur)
swh/storage/tests/test_retry.py
449

can you please merge the string, that's black doing some joke on us?

Looks good to me.

Was a bit hard to read the all lot of it though ;)

This revision is now accepted and ready to land.Apr 23 2020, 2:35 PM

Don't forget about the missing cur though ;)

@ardumont After this diff lands as is, the pattern in loader with the extrinsic field, needs to change terminology to authority and fetcher.
(instead of provider and tool, to stay coherent)

This is why I wanted to have a last chat with @zack regarding the alignment of the terminology.

Also @vlorentz, can we identify all the modules that are affected by this modification and terminology?

@ardumont After this diff lands as is, the pattern in leader with extrinsic, needs to change terminology to authority and fetcher.

I did not grasp what you meant ;)
Can you please reformulate?

This is why I wanted to have a last chat with @zack regarding the alignment of the terminology.

sure thing.

Also @vlotentz, can we identify all the modules that are affected by this modification and terminology?

typo @vlorentz ^

Also @vlotentz, can we identify all the modules that are affected by this modification and terminology?

Only swh-loader-core.

Build is green

Patch application report for D2988 (id=10855)

Rebasing onto fe56005555...

Current branch diff-target is up to date.
Changes applied before test
commit 3dc9a2f52938ea20c4fd3b4078f7cb6fd1ac5c59
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/117/ for more details.

I have updated my first comment with maybe a more clear and without typos sentence.
Sorry about that.

I'm referring to the loader-tar and the new pattern of including extrinsic metadata to the revision.
Which needs to be coherent in terms of terminology.
So we need to discuss that.

@ardumont After this diff lands as is, the pattern in loader with the extrinsic field, needs to change terminology to authority and fetcher.
(instead of provider and tool, to stay coherent)

Yes, indeed. When this lands and the storage get tagged.
Thus why i mention earlier the loader-core's deposit test will notice it and fail at that point.
That will be our cue to migrate that calls to storage accordingly.

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

What data?

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

What data?

The ones present in the table you are renaming, dropping columns, etc... ?
(i was talking about the pg storage heh ;)

There's the migration.

And this:

-- migrate metadata_jsonb (a jsonb) to metadata (a bytea) with an external process

is because I don't think there is a postgresql function to do it.

Build is green

Patch application report for D2988 (id=11114)

Rebasing onto b0b767b91c...

First, rewinding head to replay your work on top of it...
Applying: Implement extrinsic origin metadata specification.
Using index info to reconstruct a base tree...
M	docs/extrinsic-metadata-specification.rst
M	swh/storage/cassandra/cql.py
M	swh/storage/cassandra/schema.py
M	swh/storage/cassandra/storage.py
M	swh/storage/db.py
M	swh/storage/in_memory.py
M	swh/storage/sql/30-swh-schema.sql
M	swh/storage/sql/60-swh-indexes.sql
M	swh/storage/storage.py
M	swh/storage/tests/conftest.py
M	swh/storage/tests/storage_data.py
M	swh/storage/tests/test_retry.py
M	swh/storage/tests/test_storage.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/tests/test_storage.py
Auto-merging swh/storage/tests/test_retry.py
CONFLICT (content): Merge conflict in swh/storage/tests/test_retry.py
Auto-merging swh/storage/tests/storage_data.py
Auto-merging swh/storage/tests/conftest.py
Auto-merging swh/storage/storage.py
Auto-merging swh/storage/sql/60-swh-indexes.sql
Auto-merging swh/storage/sql/30-swh-schema.sql
CONFLICT (content): Merge conflict in swh/storage/sql/30-swh-schema.sql
Auto-merging swh/storage/in_memory.py
Auto-merging swh/storage/db.py
Auto-merging swh/storage/cassandra/storage.py
Auto-merging swh/storage/cassandra/schema.py
CONFLICT (content): Merge conflict in swh/storage/cassandra/schema.py
Auto-merging swh/storage/cassandra/cql.py
Auto-merging docs/extrinsic-metadata-specification.rst
Patch failed at 0001 Implement extrinsic origin metadata specification.

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto b0b767b91c...

Already up to date.
Changes applied before test
commit 5b7825b2f6506ece6f6b7fc5b582ea9cbd9f7ad3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/161/ for more details.

There's the migration.

And this:

-- migrate metadata_jsonb (a jsonb) to metadata (a bytea) with an external process

is because I don't think there is a postgresql function to do it.

convert_to(metadata_jsonb::text, 'utf-8'); should turn the jsonb field metadata_jsonb into a utf-8 encoded bytea version of this string.

Build is green

Patch application report for D2988 (id=11227)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/177/ for more details.

For the sake of the migration to amend 149.sql [1]

[1] P675

sql/upgrades/149.sql
74 ↗(On Diff #11227)

This currently fails on staging with latest 148.

Trying out olasd's proposition works:

update origin_metadata set metadata=convert_to(metadata_jsonb::text, 'utf-8');
96 ↗(On Diff #11227)

Currently failing with:

psql:149.sql:100: ERROR:  function name "swh_origin_metadata_get_by_origin" is not unique

As indeed, there are 2 functions matching this:

swh=> \df swh_origin_metadata_get_by_origin
                                             List of functions
 Schema |               Name                |        Result data type         | Argument data types | Type
--------+-----------------------------------+---------------------------------+---------------------+------
 public | swh_origin_metadata_get_by_origin | SETOF origin_metadata_signature | origin integer      | func
 public | swh_origin_metadata_get_by_origin | SETOF origin_metadata_signature | origin text         | func
(2 rows)

Same for the next one swh_origin_metadata_get_by_provider_type:

swh=> \df swh_origin_metadata_get_by_provider_type
                                                        List of functions
 Schema |                   Name                   |        Result data type         |         Argument data types         | Type
--------+------------------------------------------+---------------------------------+-------------------------------------+------
 public | swh_origin_metadata_get_by_provider_type | SETOF origin_metadata_signature | origin integer, type text           | func
 public | swh_origin_metadata_get_by_provider_type | SETOF origin_metadata_signature | origin_url text, provider_type text | func
(2 rows)

For the drop instruction to work, we need to provide the function signatures.

Assuming we want to drop all functions, that means:

drop function swh_origin_metadata_get_by_origin(text);
drop function swh_origin_metadata_get_by_provider_type(text, text);

drop function swh_origin_metadata_get_by_origin(int);
drop function swh_origin_metadata_get_by_provider_type(int, text);

Note: If we need to drop all functions as well, the func.sql file needs to be
updated with those other functions to be dropped.

This revision now requires changes to proceed.May 18 2020, 4:12 PM

apply @ardumont's changes to the migration.

Build is green

Patch application report for D2988 (id=11233)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit eb41db1ee9fc615d674d89c46c35518993293856
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/181/ for more details.

This revision is now accepted and ready to land.May 18 2020, 5:13 PM

fixup of the rebase: alter index metadata_provider_type_url to change its name, instead of creating metadata_authority_type_url which is a duplicate.

Build is green

Patch application report for D2988 (id=11236)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit e645e637095152070399f927f02b927d70e1d11c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/183/ for more details.