the grammar of this comment is weird, perhaps something is missing?
Also, we don't care they are translated metadata, that notion should just disappear from the DB and its documentation :-))
Other comments following this one has the same issue.
Do we really need this? The temporary table + COPY trick is something we use in heavy-duty tables, and in particular Merkle DAG nodes, because we add massive amounts of data there, which are potentially duplicated a lot across parallel workers. Metadata are much less heavy-duty than that, and I suppose that simple inserts (with proper ON CONFLICT handling) would be enough. Or am I missing something?
OTOH if it makes the coder more consistent with everything else, I'm not strongly opposed to having this. Just keep in mind that all these temporary tables will make it harder to migrate to a non-SQL implementation of the data model the day we come to that.
is this datatype incomplete or still undecided upon?
fwiw, I think it should be a dict, to keep the Python API as Pythonic as possible. The implementation should then translate it to whatever is wanted by the DB
the comment here is not correct.
You should delete it or change it to the PK you are using.
In this diff the PK is <origin_id, tool_id> if you are sticking with that
Question about update politics:
when conflict update is true, you update origin no matter what
when conflict_update is false you don't update if there is an origin indexed with tool x
but if the tool is different it creates a new entry?
I copied what was done for revision metadata. I discussed that issue with @ardumont and he agrees this should be removed. Once this Diff is accepted, I'll create a new one to remove temporary tables.
I thought it was only origin_id, but maybe it's not a good choice.
We didn't actually decide upon it. I figured there's no harm in making it into the PK, in case we want to progressively switch to a new tool with a different format someday.
Well, i missed to check back the old context prior to saying anything.
We no longer use temporary tables for reading (to avoid issues with replication on backend used for reading for the webapp).
But for writing, we still use the pattern.
If we agree it's enough to use simpler patterns here (at least for revision), we can indeed change that.
The tests are already drafted so that should not be much of a change ;)
Clarifying the initial attempt: conflict_update to true means update on conflict (in effect, overwrite the old value with the new one) conflict_update to false means do nothing (in effect, drop the new value).
And a conflict is raised only when an entry already exists for the same object (here origin) and the same tool (up to its version).
Now, answering you ;)
when conflict update is true, you update origin no matter what
only if a conflict exist, yes.
If no conflict, we add the new entry.
when conflict_update is false you don't update if there is an origin indexed with tool x
yes
but if the tool is different it creates a new entry
If you remove that key pair, we can no longer have multiple values for 1 origin (1 value per origin and tool).
It's mostly the same for the other tables (cf. content_mimetype for example).
Yes but for the origin_intrinsic_metadata the question is more delicate because the tool referenced and used is the one used with the RevisionIndexer
for the first brainstorming we had, an origin had only one set of intrinsic metadata calculated on the last seen revision on HEAD branch with the most recent tool used @vlorentz has now changed this to have <origin_id, tool_id> as PK
i would add that it is intrinsic metadata- metadata that was detected and retrieved from the content itself
origin metadata as noted somewhere before is metadata found at the origin while listing/loading/depositing/etc..
You may not have discussed it but that was part of the initial design (or at least, it came to be at some point ;)
I figured there's no harm in making it into the PK
Actually, you must, that's part of the indexer stack ;)
in case we want to progressively switch to a new tool with a different format someday.
Indeed, that's an example.
I'm not sure we can .oO(~> yes, i think but on different machines).
But we could want to run multiple instances of the same indexer with different configuration (different tool).