Page MenuHomeSoftware Heritage

Added content_metadata logic to the storage
ClosedPublic

Authored by moranegg on Jun 27 2017, 4:24 PM.

Details

Summary

Keeping the indexed content_metadata in the storage with a content_metadata table.
Possibility to add with and without duplications and getting the content_metadata
with tool information.
works only for NPM context also because tool configuration.
References T733

TODO:

  • add json schema in json/ (should we keep CodeMeta schema as is or put PURl to it for the translated_metadata_property?)
Test Plan
  • content_metadata_missing
  • content_metadata_add skipping duplicate
  • content_metadata_add dropping duplicate
  • content_metadata_get
  • db_to_metadata conveter

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Added content_metadata logic to the storage

Summary:
- added tables content_metadata, origin_metadata and origin_metadata_history to db schema
- added constraints for content_metadata in indexes.sql
- added functions in swh-func.sql, db.py, storage.py
(references T733 items 1-4)

TODO:
- write constraints for origin_metadata and origin_metadata_history
- add tests
- add tool data in swh-data.sql
- add json schema in json/
moranegg retitled this revision from Added tables content_metadata, origin_metadata and origin_metadata_history to db schema to Added content_metadata logic to the storage.Jun 29 2017, 3:08 PM
moranegg edited the summary of this revision. (Show Details)
moranegg edited reviewers, added: ardumont; removed: Reviewers.
moranegg edited the summary of this revision. (Show Details)

Updating D219: Added content_metadata logic to the storage with tests

moranegg edited the test plan for this revision. (Show Details)
moranegg removed a reviewer: Reviewers.

Except for tidbits (cf. comments inline), this sounds good.

I would prefer that we remove the origin_metadata* schema alteration for another diff though and keep this diff indexer related.

sql/swh-data.sql
879

Please, remove unneeded comments.

sql/swh-schema.sql
515

Can we keep this diff only indexer related?

Personally, I would delay the origin_metadata* table insertion up until we need it (this is not used here :).

swh/storage/storage.py
1609

That's the conversion i told you about that i think is redundant.
Even if metadatas holds more keys than needed, you will only copy the ones mentioned line 1612 below.

As now you have tests, you can remove the conversion and check that tests are still green :)

swh/storage/tests/test_storage.py
3165

metadata did change... :)

moranegg marked 3 inline comments as done.
moranegg edited the summary of this revision. (Show Details)
moranegg edited the test plan for this revision. (Show Details)

update D219 to keep only content_metadata modifications

sql/swh-data.sql
879

it's commented because I don't know how to insert a new tool with same name and version with different config, should I name the tools differently ?

sql/swh-schema.sql
515

yes sure, i wrote it because i wanted comments on the table's design, but it's better to have a new diff for this

swh/storage/storage.py
1609

so if i erase the entire db.copy_to or if i write metadatas as is in the function, the tests don't pass..
but to tell you the truth i'm not sure how to write it differently

swh/storage/tests/test_storage.py
3165

lol :-D

sql/swh-data.sql
879

You should not have any issue with that anymore. I fixed that :)
Simply specify the same name, same version and different configuration as another value.

If you want to be convinced, take a look line 873 and 876.
Those are the same tool with different configuration.

swh/storage/storage.py
1609

so if i erase the entire db.copy_to or if i write metadatas as is in the function, the tests don't pass..

Yes, that could not work.
The copy_to needs to stay, that copies the data we want to insert in a temporary table with sensibly the same schema as the original table. And then the call to db.content_metadata_add_from_temp use that temporary table and its content to insert in the original table.

I was talking about the conversion done within that call (the generator call within parenthesis) can be skipped.
This would give:

db.copy_to(metadatas, 'tmp_content_metadata', ['id', 'translated_metadata', 'indexer_configuration_id'], cur)
moranegg marked 2 inline comments as done.

Updating D219: Added content_metadata logic to the storage

sql/swh-data.sql
879

i'm convinced :-)

but the maven tool replaced the npm tool when i inserted the data, so i had doubts
anyway maven is still not ready at the indexer end, so i'll come back to this when it's ready

swh/storage/storage.py
1609

so strange,
i tried metadatas as is yesterday and it didn't pass the test
now it does.. HOOORAY

This revision is now accepted and ready to land.Jul 12 2017, 6:07 PM
This revision was automatically updated to reflect the committed changes.