⚙ D557 Add the origin intrinsic metadata storage database.

vlorentz created this revision.Oct 22 2018, 10:41 AM

Herald added a reviewer: Reviewers. · View Herald TranscriptOct 22 2018, 10:41 AM

Harbormaster completed remote builds in B1710: Diff 1712.Oct 22 2018, 10:41 AM

vlorentz added reviewers: ardumont, moranegg.Oct 22 2018, 10:42 AM

Remove duplicate functions.

Harbormaster completed remote builds in B1711: Diff 1713.Oct 22 2018, 10:44 AM

zack requested changes to this revision.Oct 22 2018, 11:13 AM

zack added a subscriber: zack.

zack added inline comments.

swh/indexer/sql/30-swh-schema.sql
136–140	the grammar of this comment is weird, perhaps something is missing? Also, we don't care they are translated metadata, that notion should just disappear from the DB and its documentation :-)) Other comments following this one has the same issue.
swh/indexer/sql/40-swh-func.sql
349	Do we really need this? The temporary table + COPY trick is something we use in heavy-duty tables, and in particular Merkle DAG nodes, because we add massive amounts of data there, which are potentially duplicated a lot across parallel workers. Metadata are much less heavy-duty than that, and I suppose that simple inserts (with proper ON CONFLICT handling) would be enough. Or am I missing something? OTOH if it makes the coder more consistent with everything else, I'm not strongly opposed to having this. Just keep in mind that all these temporary tables will make it harder to migrate to a non-SQL implementation of the data model the day we come to that.
swh/indexer/storage/__init__.py
530	this docstring should probably mention metadata somewhere
538	is this datatype incomplete or still undecided upon? fwiw, I think it should be a dict, to keep the Python API as Pythonic as possible. The implementation should then translate it to whatever is wanted by the DB
545	empty metadata should be mapped to an empty dictionary/json, no?

This revision now requires changes to proceed.Oct 22 2018, 11:13 AM

vlorentz mentioned this in D537: Origin metadata pipeline..Oct 22 2018, 11:43 AM

moranegg added inline comments.Oct 22 2018, 11:48 AM

swh/indexer/sql/30-swh-schema.sql
129	the comment here is not correct. You should delete it or change it to the PK you are using. In this diff the PK is <origin_id, tool_id> if you are sticking with that
swh/indexer/sql/40-swh-func.sql
422	Question about update politics: when `conflict update` is true, you update origin no matter what when `conflict_update` is false you don't update if there is an origin indexed with tool x but if the tool is different it creates a new entry?
swh/indexer/sql/60-swh-indexes.sql
59	The PK is <origin_id, tool_id> in this diff. Is that what you decided to go with? I thought it was only origin_id, but maybe it's not a good choice.
swh/indexer/tests/storage/test_storage.py
1326	type is no longer in the minimal set. check all `type` properties in tests

Apply @zacc's comments.

Harbormaster completed remote builds in B1714: Diff 1716.Oct 22 2018, 11:51 AM

vlorentz added inline comments.Oct 22 2018, 11:51 AM

swh/indexer/sql/40-swh-func.sql
349	I copied what was done for revision metadata. I discussed that issue with @ardumont and he agrees this should be removed. Once this Diff is accepted, I'll create a new one to remove temporary tables.
swh/indexer/storage/__init__.py
530	What do you mean?
538	Copy-paste from another method. Fixed.
545	Copy-paste from another method, it actually does not make sense here. Comment removed.

ardumont added inline comments.Oct 22 2018, 11:56 AM

swh/indexer/sql/40-swh-func.sql
349	Yes, it's consistent with how we add stuff in the indexer storage. We can diverge from this though.
swh/indexer/storage/converters.py
125	For my information, is this more pythonic to do it that way instead of the previous one?
swh/indexer/storage/db.py
92	Adding a docstring here now would be a sensible thing to do.
swh/indexer/tests/storage/test_storage.py
1504	v2 is not dropped here, i think the comment needs some update (or drop it? ;)
1514	I'm not sure why you `revision_metadata_add` in all tests since you don't check back what you inserted. I might be needed a second pass ;)

Apply @moranegg's comments.

Harbormaster completed remote builds in B1715: Diff 1717.Oct 22 2018, 11:56 AM

vlorentz added inline comments.Oct 22 2018, 11:56 AM

swh/indexer/sql/40-swh-func.sql
422	but if the tool is different it creates a new entry? Yes, that's because the `indexer_configuration_id` is part of the PK.
swh/indexer/sql/60-swh-indexes.sql
59	Is that what you decided to go with? Yes I thought it was only origin_id, but maybe it's not a good choice. We didn't actually decide upon it. I figured there's no harm in making it into the PK, in case we want to progressively switch to a new tool with a different format someday.

Apply @ardumont's comment.

Harbormaster completed remote builds in B1716: Diff 1718.Oct 22 2018, 12:02 PM

vlorentz added inline comments.Oct 22 2018, 12:06 PM

swh/indexer/storage/converters.py
125	I changed the semantic to keep all keys that are not explicitly handled, eg. `metadata` vs `translated_metadata`.
swh/indexer/tests/storage/test_storage.py
1514	Required because of the FK.

ardumont added inline comments.Oct 22 2018, 12:13 PM

swh/indexer/sql/40-swh-func.sql
349	Well, i missed to check back the old context prior to saying anything. We no longer use temporary tables for reading (to avoid issues with replication on backend used for reading for the webapp). But for writing, we still use the pattern. If we agree it's enough to use simpler patterns here (at least for revision), we can indeed change that. The tests are already drafted so that should not be much of a change ;)
422	Clarifying the initial attempt: `conflict_update` to true means update on conflict (in effect, overwrite the old value with the new one) `conflict_update` to false means do nothing (in effect, drop the new value). And a conflict is raised only when an entry already exists for the same object (here origin) and the same tool (up to its version). Now, answering you ;) when conflict update is true, you update origin no matter what only if a conflict exist, yes. If no conflict, we add the new entry. when conflict_update is false you don't update if there is an origin indexed with tool x yes but if the tool is different it creates a new entry yes
swh/indexer/sql/60-swh-indexes.sql
59	If you remove that key pair, we can no longer have multiple values for 1 origin (1 value per origin and tool). It's mostly the same for the other tables (cf. content_mimetype for example).

zack added inline comments.Oct 22 2018, 12:45 PM

swh/indexer/storage/__init__.py
530	you're not adding "generic metadata" to storage, you're adding "origin metadata" to it
538	still, the type of "metadata" argument is not jsonb, is it? (as that is not a python data type) I guess it's a dict instead?

vlorentz marked 3 inline comments as done.Oct 22 2018, 1:22 PM

vlorentz added inline comments.

swh/indexer/storage/__init__.py
538	Yes, indeed.

Fix doc.

Harbormaster completed remote builds in B1717: Diff 1719.Oct 22 2018, 1:22 PM

moranegg added inline comments.Oct 22 2018, 2:31 PM

swh/indexer/sql/60-swh-indexes.sql
59	Yes but for the origin_intrinsic_metadata the question is more delicate because the tool referenced and used is the one used with the RevisionIndexer for the first brainstorming we had, an origin had only one set of intrinsic metadata calculated on the last seen revision on HEAD branch with the most recent tool used @vlorentz has now changed this to have <origin_id, tool_id> as PK
swh/indexer/storage/__init__.py
530	i would add that it is intrinsic metadata- metadata that was detected and retrieved from the content itself origin metadata as noted somewhere before is metadata found at the origin while listing/loading/depositing/etc..

ardumont added inline comments.Oct 22 2018, 2:36 PM

swh/indexer/sql/60-swh-indexes.sql
59	We didn't actually decide upon it. You may not have discussed it but that was part of the initial design (or at least, it came to be at some point ;) I figured there's no harm in making it into the PK Actually, you must, that's part of the indexer stack ;) in case we want to progressively switch to a new tool with a different format someday. Indeed, that's an example. I'm not sure we can .oO(~> yes, i think but on different machines). But we could want to run multiple instances of the same indexer with different configuration (different tool).