Page MenuHomeSoftware Heritage
Paste P168

Working on origin_metadata, external_metadata and revision_metadata tables
ActivePublic

Authored by moranegg on Jul 11 2017, 11:59 AM.
-- Discovery of metadata during a listing, loading, deposit or external_catalog of an origin
-- also provides a translation to a defined json schema using a translation tool (indexer_configuration_id)
create table origin_metadata(
id bigserial primary key-- PK object identifier
origin_id bigint not null references origin(id),
date timestamptz not null,
provenance text not null, -- TODO use an enum (?)
raw_metadata jsonb not null,
translated_metadata jsonb,
indexer_configuration_id bigint,
);
-- NOTES:
-- translation_date is not needed because we wish to translate on the fly
-- having origin_metadata and origin_metadata_history tables is redundant and inefficient

Event Timeline

For the discussion tomorrow about the entity table vs origin_metadata table.
I also added a draft for the external_metadata table to have it in mind as well.

moranegg changed the title of this paste from Working on origin_metadata tabe to Working on origin_metadata, external_metadata and revision_metadata tables.Jul 12 2017, 11:40 AM
moranegg edited the content of this paste. (Show Details)

New day.. new question:

Should we keep more than one translation of origin_metadata?
if so, we should break the table above into two tables:

  • origin_metadata
  • origin_ metadata_translation

Here is the discussion on devel:

|15:49:27 morane_ | keeping raw_metadata found with lister/loader/deposit/external_catalog                                                                                                  │
│15:49:57 morane_ | and translating the raw_metadata into the CodeMeta vocabulary                                                                                                           │
│15:50:37 morane_ | with a tool that we keep in indexer_configuration and reference by indexer_configuration_id                                                                             │
│15:51:12 morane_ | should this id be part of the PK of the table?                                                                                                                          │
│15:51:23   olasd | no                                                                                                                                                                      │
│15:52:09 morane_ | I thought so too, but i'm keeping it as PK in content_metadata and revision_metadata                                                                                    │
│15:52:25 morane_ | so i'm searching for the why not                                                                                                                                        │
│15:53:22   olasd | considering your schema, what would be the point?                                                                                                                       │
│15:53:51   olasd | as far as I can tell you can have however many origin_metadata entries you want per origin                                                                              │
│15:54:12 morane_ | yes right                                                                                                                                                               │
│15:54:18   olasd | which is not the case for content_metadata, which is keyed using the content_id                                                                                         │
│15:54:22   olasd | or should be in any case                                                                                                                                                │
│15:54:50 morane_ | so the PK is the object's id (object= origin_metadata)                                                                                                                  │
│15:55:16 morane_ | but the raw_metadata it contains can be translated by different tools                                                                                                   │
│15:55:35 morane_ | where all the rest of the information stays the same                                                                                                                    │
│15:56:43 morane_ | (easier to keep only object id as PK anyway, cause we can translate directly when captured or with a delay)                                                             │
│15:58:14   olasd | if the raw_metadata is the same and gets translated several times, then (considering we want to keep the data normalized) you should make an ancillary table for the    │
│                 | translated metadata entries                                                                                                                                             │
│15:58:49   olasd | and _that_ ancillary table can be keyed with the pair (origin_metadata_id, indexer_configuration_id)                                                                    │
│15:58:55   olasd | does that make sense?                                                                                                                                                   │
│15:59:47 morane_ | it does, i thought of that but it seemed like adding another metadata table to the mix                                                                                  │
│16:00:11   olasd | well                                                                                                                                                                    │
│16:00:33   olasd | do we really want to keep several different translations for the same raw metadata                                                                                      │
│16:00:40   olasd | that's the main question, I believe                                                                                                                                     │
│16:04:58 morane_ | i think we don't, because the most recent translation should be the most accurate translation                                                                           │
│16:05:11 morane_ | so we shouldn't break into 2 tables                                                                                                                                     │
│16:05:52 morane_ | and just update the tool used when updating translation (but this is against keeping everything at any time)                                                            │
│16:06:35 morane_ | on the other hand we can reproduce the same result with an older version for example... ahhhh i don't know    
│16:09:00 ardumont | > because the most recent translation should be the most accurate: why? how do you determine what's the most accurate?                                                 │
│16:11:00  morane_ | haha you are right                                                                                                                                                     │
│16:12:23  morane_ | i imagine that with a new tool or newer version we improve the translation