Page MenuHomeSoftware Heritage

Implement extrinsic origin metadata specification.
ClosedPublic

Authored by vlorentz on Apr 9 2020, 12:37 PM.

Details

Diff Detail

Repository
rDSTO Storage manager
Branch
impl-origin-metadata-spec
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11951
Build 18118: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 18117: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
moranegg added inline comments.Apr 21 2020, 6:21 PM
sql/upgrades/148.sql
93

I'm commenting here that the changes above should be all identified in swh-deposit before landing.
Because swh-deposit uses the origin_metadata table.

swh/storage/cassandra/cql.py
802

I'm not sure I remember correctly, so I'll ask @ardumont to join.
But the name metadata_fetcher is specific to extrinsic metadata, while the tool table is used for the indexers.

Would that be a change that will affect indexers?

I see that this is defined in a Cassandra file, for which I do not know the scope.

Could you do a test for a registry and a forge, for the same origin?
I can give you real life examples, if needed..

I can't test the diff now, but I read it and added some comments..

I could try testing it later on this week...

swh/storage/cassandra/schema.py
194

Why do we need authority_type in the PRIMARY KEY? isn't the url enough ?

198

every typo?

swh/storage/interface.py
1152

and?

1171

is the and at the end intentional?

1241

is it a fixed set?

swh/storage/tests/storage_data.py
330

above the type is only deposit

331

this looks like a wrong url (I don't know if this is a good or bad thing for a test)
and this could an example where name and url can change:
here hal is the authority but the hal-instance is hal-inria.
I'm not sure this is something that we can/want to differentiate .

ardumont added inline comments.Apr 22 2020, 2:12 PM
swh/storage/cassandra/cql.py
802

The tool table used by the indexer are in their dedicated storage db.
This tool table stayed for the deposit.
Only 1 entry is referenced in the archive table (the deposit).

vlorentz updated this revision to Diff 10832.Apr 23 2020, 10:41 AM
vlorentz marked 9 inline comments as done.

apply @moranegg's comments

sql/upgrades/148.sql
37

the name is not needed as an identifier, and everything else can go in the metadata if we need it, but I don't think we do.

swh/storage/cassandra/schema.py
194

I don't know if it's needed, but that's how we wrote the specification.

swh/storage/interface.py
1171

no, that's a typo

1241

According to the spec, yes. We can extend it later anyway.

swh/storage/tests/storage_data.py
330

indeed

331

It should be a valid URI in practice; but in tests it doesn't really matter. Changed to a more obvious url.

Build is green

Patch application report for D2988 (id=10832)

Rebasing onto bca643acab...

First, rewinding head to replay your work on top of it...
Applying: Add a wrapper to manage a sorted list.
Applying: Implement extrinsic origin metadata specification.
Changes applied before test
commit f12240bf1531751ab4af443906c8ba83d4e3b978
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

commit e9369b031669ae77ee627ee38ac12dda32b7a3b4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 24 17:22:09 2020 +0100

    Add a wrapper to manage a sorted list.
    
    For now this is only used by sorted_sha1, but we'll need it for
    origin_metadata soon.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/111/ for more details.

ardumont added inline comments.Apr 23 2020, 12:15 PM
sql/upgrades/148.sql
93

Yes, and now thanks to your review and remarks on D3045, there is even a test which checks those are fine.
So when this lands and get released, it should be noticed by the deposit loader ci if we forget about this \m/

vlorentz updated this revision to Diff 10843.Apr 23 2020, 12:50 PM
  • Rebase
  • Add test with a different authority

Build is green

Patch application report for D2988 (id=10843)

Rebasing onto bca643acab...

First, rewinding head to replay your work on top of it...
Applying: Add a wrapper to manage a sorted list.
Applying: Implement extrinsic origin metadata specification.
Changes applied before test
commit 73cef28bb09655e7f4e3bc5c7974f79d66162181
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

commit 0c38297e5b67c1f020d742797d3df82a7c7ce7d7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 24 17:22:09 2020 +0100

    Add a wrapper to manage a sorted list.
    
    For now this is only used by sorted_sha1, but we'll need it for
    origin_metadata soon.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/113/ for more details.

ardumont added inline comments.Apr 23 2020, 1:31 PM
swh/storage/cassandra/storage.py
1020

don't we have some conversion to do on that field?

(in metadata_fetcher_get below, you do a json.loads so ¯\_(ツ)_/¯)

vlorentz added inline comments.Apr 23 2020, 1:37 PM
swh/storage/cassandra/storage.py
1020

origin metadata are blobs (because we get them from random sources), metadata fetcher metadatas are jsonb (because we write them at SWH)

ardumont added inline comments.Apr 23 2020, 2:03 PM
swh/storage/in_memory.py
102

Missing a rebase on latest master

That's D2987, ain't it?

vlorentz marked an inline comment as done.Apr 23 2020, 2:06 PM

Build is green

Patch application report for D2988 (id=10847)

Rebasing onto fe56005555...

Current branch diff-target is up to date.
Changes applied before test
commit 0acfafc733995d858c5dc6e9c9fafffca49a5f52
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/116/ for more details.

ardumont added inline comments.Apr 23 2020, 2:23 PM
swh/storage/storage.py
1159

nitpick: you could move the check just after the get.
There is no point in fetching the fetcher if the authority_id is not resolved.

1213
row = db.metadata_fetcher_get(name, version, cur)
1231
row = db.metadata_authority_get(type, url, cur)
ardumont added inline comments.Apr 23 2020, 2:34 PM
swh/storage/tests/test_retry.py
551

can you please merge the string, that's black doing some joke on us?

ardumont accepted this revision.Apr 23 2020, 2:35 PM

Looks good to me.

Was a bit hard to read the all lot of it though ;)

This revision is now accepted and ready to land.Apr 23 2020, 2:35 PM

Don't forget about the missing cur though ;)

moranegg added a subscriber: zack.EditedApr 23 2020, 2:59 PM

@ardumont After this diff lands as is, the pattern in loader with the extrinsic field, needs to change terminology to authority and fetcher.
(instead of provider and tool, to stay coherent)

This is why I wanted to have a last chat with @zack regarding the alignment of the terminology.

Also @vlorentz, can we identify all the modules that are affected by this modification and terminology?

vlorentz updated this revision to Diff 10855.Apr 23 2020, 3:03 PM

Apply @ardumont's comments

@ardumont After this diff lands as is, the pattern in leader with extrinsic, needs to change terminology to authority and fetcher.

I did not grasp what you meant ;)
Can you please reformulate?

This is why I wanted to have a last chat with @zack regarding the alignment of the terminology.

sure thing.

Also @vlotentz, can we identify all the modules that are affected by this modification and terminology?

typo @vlorentz ^

Also @vlotentz, can we identify all the modules that are affected by this modification and terminology?

Only swh-loader-core.

Build is green

Patch application report for D2988 (id=10855)

Rebasing onto fe56005555...

Current branch diff-target is up to date.
Changes applied before test
commit 3dc9a2f52938ea20c4fd3b4078f7cb6fd1ac5c59
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/117/ for more details.

I have updated my first comment with maybe a more clear and without typos sentence.
Sorry about that.

I'm referring to the loader-tar and the new pattern of including extrinsic metadata to the revision.
Which needs to be coherent in terms of terminology.
So we need to discuss that.

ardumont added a comment.EditedApr 23 2020, 3:13 PM

@ardumont After this diff lands as is, the pattern in loader with the extrinsic field, needs to change terminology to authority and fetcher.
(instead of provider and tool, to stay coherent)

Yes, indeed. When this lands and the storage get tagged.
Thus why i mention earlier the loader-core's deposit test will notice it and fail at that point.
That will be our cue to migrate that calls to storage accordingly.

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

What data?

ardumont added a comment.EditedApr 24 2020, 10:23 AM

Which makes me realize, @vlorentz don't we need an extra sql statement to migrate the data?

What data?

The ones present in the table you are renaming, dropping columns, etc... ?
(i was talking about the pg storage heh ;)

There's the migration.

And this:

-- migrate metadata_jsonb (a jsonb) to metadata (a bytea) with an external process

is because I don't think there is a postgresql function to do it.

vlorentz updated this revision to Diff 11114.May 6 2020, 1:45 PM

debug migration script.

Build is green

Patch application report for D2988 (id=11114)

Rebasing onto b0b767b91c...

First, rewinding head to replay your work on top of it...
Applying: Implement extrinsic origin metadata specification.
Using index info to reconstruct a base tree...
M	docs/extrinsic-metadata-specification.rst
M	swh/storage/cassandra/cql.py
M	swh/storage/cassandra/schema.py
M	swh/storage/cassandra/storage.py
M	swh/storage/db.py
M	swh/storage/in_memory.py
M	swh/storage/sql/30-swh-schema.sql
M	swh/storage/sql/60-swh-indexes.sql
M	swh/storage/storage.py
M	swh/storage/tests/conftest.py
M	swh/storage/tests/storage_data.py
M	swh/storage/tests/test_retry.py
M	swh/storage/tests/test_storage.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/tests/test_storage.py
Auto-merging swh/storage/tests/test_retry.py
CONFLICT (content): Merge conflict in swh/storage/tests/test_retry.py
Auto-merging swh/storage/tests/storage_data.py
Auto-merging swh/storage/tests/conftest.py
Auto-merging swh/storage/storage.py
Auto-merging swh/storage/sql/60-swh-indexes.sql
Auto-merging swh/storage/sql/30-swh-schema.sql
CONFLICT (content): Merge conflict in swh/storage/sql/30-swh-schema.sql
Auto-merging swh/storage/in_memory.py
Auto-merging swh/storage/db.py
Auto-merging swh/storage/cassandra/storage.py
Auto-merging swh/storage/cassandra/schema.py
CONFLICT (content): Merge conflict in swh/storage/cassandra/schema.py
Auto-merging swh/storage/cassandra/cql.py
Auto-merging docs/extrinsic-metadata-specification.rst
Patch failed at 0001 Implement extrinsic origin metadata specification.

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto b0b767b91c...

Already up to date.
Changes applied before test
commit 5b7825b2f6506ece6f6b7fc5b582ea9cbd9f7ad3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/161/ for more details.

olasd added a subscriber: olasd.May 15 2020, 10:08 AM

There's the migration.

And this:

-- migrate metadata_jsonb (a jsonb) to metadata (a bytea) with an external process

is because I don't think there is a postgresql function to do it.

convert_to(metadata_jsonb::text, 'utf-8'); should turn the jsonb field metadata_jsonb into a utf-8 encoded bytea version of this string.

vlorentz updated this revision to Diff 11227.May 18 2020, 3:38 PM

big rebase

Build is green

Patch application report for D2988 (id=11227)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/177/ for more details.

ardumont requested changes to this revision.May 18 2020, 4:12 PM

For the sake of the migration to amend 149.sql [1]

[1] P675

sql/upgrades/149.sql
74 ↗(On Diff #11227)

This currently fails on staging with latest 148.

Trying out olasd's proposition works:

update origin_metadata set metadata=convert_to(metadata_jsonb::text, 'utf-8');
96 ↗(On Diff #11227)

Currently failing with:

psql:149.sql:100: ERROR:  function name "swh_origin_metadata_get_by_origin" is not unique

As indeed, there are 2 functions matching this:

swh=> \df swh_origin_metadata_get_by_origin
                                             List of functions
 Schema |               Name                |        Result data type         | Argument data types | Type
--------+-----------------------------------+---------------------------------+---------------------+------
 public | swh_origin_metadata_get_by_origin | SETOF origin_metadata_signature | origin integer      | func
 public | swh_origin_metadata_get_by_origin | SETOF origin_metadata_signature | origin text         | func
(2 rows)

Same for the next one swh_origin_metadata_get_by_provider_type:

swh=> \df swh_origin_metadata_get_by_provider_type
                                                        List of functions
 Schema |                   Name                   |        Result data type         |         Argument data types         | Type
--------+------------------------------------------+---------------------------------+-------------------------------------+------
 public | swh_origin_metadata_get_by_provider_type | SETOF origin_metadata_signature | origin integer, type text           | func
 public | swh_origin_metadata_get_by_provider_type | SETOF origin_metadata_signature | origin_url text, provider_type text | func
(2 rows)

For the drop instruction to work, we need to provide the function signatures.

Assuming we want to drop all functions, that means:

drop function swh_origin_metadata_get_by_origin(text);
drop function swh_origin_metadata_get_by_provider_type(text, text);

drop function swh_origin_metadata_get_by_origin(int);
drop function swh_origin_metadata_get_by_provider_type(int, text);

Note: If we need to drop all functions as well, the func.sql file needs to be
updated with those other functions to be dropped.

This revision now requires changes to proceed.May 18 2020, 4:12 PM
vlorentz updated this revision to Diff 11233.May 18 2020, 4:40 PM

apply @ardumont's changes to the migration.

Build is green

Patch application report for D2988 (id=11233)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit eb41db1ee9fc615d674d89c46c35518993293856
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/181/ for more details.

ardumont accepted this revision.May 18 2020, 5:13 PM
This revision is now accepted and ready to land.May 18 2020, 5:13 PM
vlorentz updated this revision to Diff 11236.May 18 2020, 6:26 PM

fixup of the rebase: alter index metadata_provider_type_url to change its name, instead of creating metadata_authority_type_url which is a duplicate.

Build is green

Patch application report for D2988 (id=11236)

Rebasing onto 6d24ed721a...

Current branch diff-target is up to date.
Changes applied before test
commit e645e637095152070399f927f02b927d70e1d11c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Apr 9 11:46:21 2020 +0200

    Implement extrinsic origin metadata specification.
    
    https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/183/ for more details.

oh right, missed it ;)

ardumont accepted this revision.May 18 2020, 8:35 PM
This revision was automatically updated to reflect the committed changes.