Page MenuHomeSoftware Heritage

Add the origin intrinsic metadata storage database.
ClosedPublic

Authored by vlorentz on Oct 22 2018, 10:41 AM.

Details

Summary

Add tests for adding duplicates.

This Diff partially replaces D537.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-metadata-storage
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1715
Build 2061: arc lint + arc unit

Event Timeline

  • Remove duplicate functions.
zack requested changes to this revision.Oct 22 2018, 11:13 AM
zack added a subscriber: zack.
zack added inline comments.
swh/indexer/sql/30-swh-schema.sql
136–140

the grammar of this comment is weird, perhaps something is missing?

Also, we don't care they are translated metadata, that notion should just disappear from the DB and its documentation :-))
Other comments following this one has the same issue.

swh/indexer/sql/40-swh-func.sql
349

Do we really need this? The temporary table + COPY trick is something we use in heavy-duty tables, and in particular Merkle DAG nodes, because we add massive amounts of data there, which are potentially duplicated a lot across parallel workers. Metadata are much less heavy-duty than that, and I suppose that simple inserts (with proper ON CONFLICT handling) would be enough. Or am I missing something?

OTOH if it makes the coder more consistent with everything else, I'm not strongly opposed to having this. Just keep in mind that all these temporary tables will make it harder to migrate to a non-SQL implementation of the data model the day we come to that.

swh/indexer/storage/__init__.py
530

this docstring should probably mention metadata somewhere

538

is this datatype incomplete or still undecided upon?

fwiw, I think it should be a dict, to keep the Python API as Pythonic as possible. The implementation should then translate it to whatever is wanted by the DB

545

empty metadata should be mapped to an empty dictionary/json, no?

This revision now requires changes to proceed.Oct 22 2018, 11:13 AM
swh/indexer/sql/30-swh-schema.sql
129

the comment here is not correct.
You should delete it or change it to the PK you are using.
In this diff the PK is <origin_id, tool_id> if you are sticking with that

swh/indexer/sql/40-swh-func.sql
422

Question about update politics:
when conflict update is true, you update origin no matter what
when conflict_update is false you don't update if there is an origin indexed with tool x
but if the tool is different it creates a new entry?

swh/indexer/sql/60-swh-indexes.sql
59

The PK is <origin_id, tool_id> in this diff.
Is that what you decided to go with?
I thought it was only origin_id, but maybe it's not a good choice.

swh/indexer/tests/storage/test_storage.py
1327

type is no longer in the minimal set.
check all type properties in tests

vlorentz marked 7 inline comments as done.
  • Apply @zacc's comments.
swh/indexer/sql/40-swh-func.sql
349

I copied what was done for revision metadata. I discussed that issue with @ardumont and he agrees this should be removed. Once this Diff is accepted, I'll create a new one to remove temporary tables.

swh/indexer/storage/__init__.py
530

What do you mean?

538

Copy-paste from another method. Fixed.

545

Copy-paste from another method, it actually does not make sense here. Comment removed.

swh/indexer/sql/40-swh-func.sql
349

Yes, it's consistent with how we add stuff in the indexer storage.
We can diverge from this though.

swh/indexer/storage/converters.py
125

For my information, is this more pythonic to do it that way instead of the previous one?

swh/indexer/storage/db.py
92

Adding a docstring here now would be a sensible thing to do.

swh/indexer/tests/storage/test_storage.py
1505

v2 is not dropped here, i think the comment needs some update (or drop it? ;)

1515

I'm not sure why you revision_metadata_add in all tests since you don't check back what you inserted.
I might be needed a second pass ;)

vlorentz marked 4 inline comments as done.
swh/indexer/sql/40-swh-func.sql
422

but if the tool is different it creates a new entry?

Yes, that's because the indexer_configuration_id is part of the PK.

swh/indexer/sql/60-swh-indexes.sql
59

Is that what you decided to go with?

Yes

I thought it was only origin_id, but maybe it's not a good choice.

We didn't actually decide upon it. I figured there's no harm in making it into the PK, in case we want to progressively switch to a new tool with a different format someday.

vlorentz marked 3 inline comments as done.
swh/indexer/storage/converters.py
125

I changed the semantic to keep all keys that are not explicitly handled, eg. metadata vs translated_metadata.

swh/indexer/tests/storage/test_storage.py
1515

Required because of the FK.

swh/indexer/sql/40-swh-func.sql
349

Well, i missed to check back the old context prior to saying anything.
We no longer use temporary tables for reading (to avoid issues with replication on backend used for reading for the webapp).
But for writing, we still use the pattern.

If we agree it's enough to use simpler patterns here (at least for revision), we can indeed change that.
The tests are already drafted so that should not be much of a change ;)

422

Clarifying the initial attempt:
conflict_update to true means update on conflict (in effect, overwrite the old value with the new one)
conflict_update to false means do nothing (in effect, drop the new value).

And a conflict is raised only when an entry already exists for the same object (here origin) and the same tool (up to its version).

Now, answering you ;)

when conflict update is true, you update origin no matter what

only if a conflict exist, yes.
If no conflict, we add the new entry.

when conflict_update is false you don't update if there is an origin indexed with tool x

yes

but if the tool is different it creates a new entry

yes

swh/indexer/sql/60-swh-indexes.sql
59

If you remove that key pair, we can no longer have multiple values for 1 origin (1 value per origin and tool).
It's mostly the same for the other tables (cf. content_mimetype for example).

swh/indexer/storage/__init__.py
530

you're not adding "generic metadata" to storage, you're adding "*origin* metadata" to it

538

still, the type of "metadata" argument is not jsonb, is it? (as that is not a python data type)
I guess it's a dict instead?

vlorentz added inline comments.
swh/indexer/storage/__init__.py
538

Yes, indeed.

vlorentz marked an inline comment as done.
  • Fix doc.
swh/indexer/sql/60-swh-indexes.sql
59

Yes but for the origin_intrinsic_metadata the question is more delicate because the tool referenced and used is the one used with the RevisionIndexer
for the first brainstorming we had, an origin had only one set of intrinsic metadata calculated on the last seen revision on HEAD branch with the most recent tool used
@vlorentz has now changed this to have <origin_id, tool_id> as PK

swh/indexer/storage/__init__.py
530

i would add that it is intrinsic metadata- metadata that was detected and retrieved from the content itself
origin metadata as noted somewhere before is metadata found at the origin while listing/loading/depositing/etc..

swh/indexer/sql/60-swh-indexes.sql
59

We didn't actually decide upon it.

You may not have discussed it but that was part of the initial design (or at least, it came to be at some point ;)

I figured there's no harm in making it into the PK

Actually, you must, that's part of the indexer stack ;)

in case we want to progressively switch to a new tool with a different format someday.

Indeed, that's an example.
I'm not sure we can .oO(~> yes, i think but on different machines).
But we could want to run multiple instances of the same indexer with different configuration (different tool).

This revision is now accepted and ready to land.Oct 23 2018, 3:34 PM
This revision was automatically updated to reflect the committed changes.