Page MenuHomeSoftware Heritage

Add the origin intrinsic metadata storage database.
ClosedPublic

Authored by vlorentz on Oct 22 2018, 10:41 AM.

Details

Summary

Add tests for adding duplicates.

This Diff partially replaces D537.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-metadata-storage
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1711
Build 2057: arc lint + arc unit

Event Timeline

  • Remove duplicate functions.
zack requested changes to this revision.Oct 22 2018, 11:13 AM
zack added a subscriber: zack.
zack added inline comments.
swh/indexer/sql/30-swh-schema.sql
136

the grammar of this comment is weird, perhaps something is missing?

Also, we don't care they are translated metadata, that notion should just disappear from the DB and its documentation :-))
Other comments following this one has the same issue.

swh/indexer/sql/40-swh-func.sql
350

Do we really need this? The temporary table + COPY trick is something we use in heavy-duty tables, and in particular Merkle DAG nodes, because we add massive amounts of data there, which are potentially duplicated a lot across parallel workers. Metadata are much less heavy-duty than that, and I suppose that simple inserts (with proper ON CONFLICT handling) would be enough. Or am I missing something?

OTOH if it makes the coder more consistent with everything else, I'm not strongly opposed to having this. Just keep in mind that all these temporary tables will make it harder to migrate to a non-SQL implementation of the data model the day we come to that.

swh/indexer/storage/__init__.py
533

this docstring should probably mention metadata somewhere

541

is this datatype incomplete or still undecided upon?

fwiw, I think it should be a dict, to keep the Python API as Pythonic as possible. The implementation should then translate it to whatever is wanted by the DB

548

empty metadata should be mapped to an empty dictionary/json, no?

This revision now requires changes to proceed.Oct 22 2018, 11:13 AM
swh/indexer/sql/30-swh-schema.sql
130

the comment here is not correct.
You should delete it or change it to the PK you are using.
In this diff the PK is <origin_id, tool_id> if you are sticking with that

swh/indexer/sql/40-swh-func.sql
423

Question about update politics:
when conflict update is true, you update origin no matter what
when conflict_update is false you don't update if there is an origin indexed with tool x
but if the tool is different it creates a new entry?

swh/indexer/sql/60-swh-indexes.sql
60

The PK is <origin_id, tool_id> in this diff.
Is that what you decided to go with?
I thought it was only origin_id, but maybe it's not a good choice.

swh/indexer/tests/storage/test_storage.py
1331

type is no longer in the minimal set.
check all type properties in tests

vlorentz marked 7 inline comments as done.
  • Apply @zacc's comments.
swh/indexer/sql/40-swh-func.sql
350

I copied what was done for revision metadata. I discussed that issue with @ardumont and he agrees this should be removed. Once this Diff is accepted, I'll create a new one to remove temporary tables.

swh/indexer/storage/__init__.py
533

What do you mean?

541

Copy-paste from another method. Fixed.

548

Copy-paste from another method, it actually does not make sense here. Comment removed.

swh/indexer/sql/40-swh-func.sql
350

Yes, it's consistent with how we add stuff in the indexer storage.
We can diverge from this though.

swh/indexer/storage/converters.py
126

For my information, is this more pythonic to do it that way instead of the previous one?

swh/indexer/storage/db.py
92

Adding a docstring here now would be a sensible thing to do.

swh/indexer/tests/storage/test_storage.py
1510

v2 is not dropped here, i think the comment needs some update (or drop it? ;)

1520

I'm not sure why you revision_metadata_add in all tests since you don't check back what you inserted.
I might be needed a second pass ;)

vlorentz marked 4 inline comments as done.
swh/indexer/sql/40-swh-func.sql
423

but if the tool is different it creates a new entry?

Yes, that's because the indexer_configuration_id is part of the PK.

swh/indexer/sql/60-swh-indexes.sql
60

Is that what you decided to go with?

Yes

I thought it was only origin_id, but maybe it's not a good choice.

We didn't actually decide upon it. I figured there's no harm in making it into the PK, in case we want to progressively switch to a new tool with a different format someday.

vlorentz marked 3 inline comments as done.
swh/indexer/storage/converters.py
126

I changed the semantic to keep all keys that are not explicitly handled, eg. metadata vs translated_metadata.

swh/indexer/tests/storage/test_storage.py
1520

Required because of the FK.

swh/indexer/sql/40-swh-func.sql
350

Well, i missed to check back the old context prior to saying anything.
We no longer use temporary tables for reading (to avoid issues with replication on backend used for reading for the webapp).
But for writing, we still use the pattern.

If we agree it's enough to use simpler patterns here (at least for revision), we can indeed change that.
The tests are already drafted so that should not be much of a change ;)

423

Clarifying the initial attempt:
conflict_update to true means update on conflict (in effect, overwrite the old value with the new one)
conflict_update to false means do nothing (in effect, drop the new value).

And a conflict is raised only when an entry already exists for the same object (here origin) and the same tool (up to its version).

Now, answering you ;)

when conflict update is true, you update origin no matter what

only if a conflict exist, yes.
If no conflict, we add the new entry.

when conflict_update is false you don't update if there is an origin indexed with tool x

yes

but if the tool is different it creates a new entry

yes

swh/indexer/sql/60-swh-indexes.sql
60

If you remove that key pair, we can no longer have multiple values for 1 origin (1 value per origin and tool).
It's mostly the same for the other tables (cf. content_mimetype for example).

swh/indexer/storage/__init__.py
533

you're not adding "generic metadata" to storage, you're adding "*origin* metadata" to it

541

still, the type of "metadata" argument is not jsonb, is it? (as that is not a python data type)
I guess it's a dict instead?

vlorentz added inline comments.
swh/indexer/storage/__init__.py
541

Yes, indeed.

vlorentz marked an inline comment as done.
  • Fix doc.
swh/indexer/sql/60-swh-indexes.sql
60

Yes but for the origin_intrinsic_metadata the question is more delicate because the tool referenced and used is the one used with the RevisionIndexer
for the first brainstorming we had, an origin had only one set of intrinsic metadata calculated on the last seen revision on HEAD branch with the most recent tool used
@vlorentz has now changed this to have <origin_id, tool_id> as PK

swh/indexer/storage/__init__.py
533

i would add that it is intrinsic metadata- metadata that was detected and retrieved from the content itself
origin metadata as noted somewhere before is metadata found at the origin while listing/loading/depositing/etc..

swh/indexer/sql/60-swh-indexes.sql
60

We didn't actually decide upon it.

You may not have discussed it but that was part of the initial design (or at least, it came to be at some point ;)

I figured there's no harm in making it into the PK

Actually, you must, that's part of the indexer stack ;)

in case we want to progressively switch to a new tool with a different format someday.

Indeed, that's an example.
I'm not sure we can .oO(~> yes, i think but on different machines).
But we could want to run multiple instances of the same indexer with different configuration (different tool).

This revision is now accepted and ready to land.Oct 23 2018, 3:34 PM
This revision was automatically updated to reflect the committed changes.