Page MenuHomeSoftware Heritage

Add an overview of the metadata workflow
ClosedPublic

Authored by vlorentz on Sep 7 2021, 1:46 PM.

Diff Detail

Repository
rDDOC Development documentation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert added a subscriber: anlambert.
anlambert added inline comments.
docs/architecture/metadata.rst
9

Should not it be plural here ? These metadata are partitioned ...

This revision is now accepted and ready to land.Sep 7 2021, 3:04 PM
docs/architecture/metadata.rst
9

"(meta)data" is uncountable, so it's usually singular.

jayeshv added inline comments.
docs/architecture/metadata.rst
48

An indexer work only on stored/archived metadata? Can't we have an indexer to store number of forks or stars in a github repo?

docs/architecture/metadata.rst
48

No; indexers are not supposed to access the internet. Instead, we'll need to add some sort of loader that fetches and stores API responses; and indexers would read the stored responses

Good idea to add this documentation!
I have added a few comments.

docs/architecture/metadata.rst
7

intrinsic metadata is part of the code.
I think we should focus what metadata is, and not what it is not.

|swh| calls "metadata" information it collects and extracts that describes and provides additional information on the source code itself.
13

switch collected with extracted.

I propose to keep collect for the actions of gathering from external resources.

17

here I would suggest not to define with the negative statement

:term:`extrinsic metadata`, which is collected or deposited from external sources.
71

I would put this section before the metadata mining as an introduction to it

74

delete as we saw above

75

The raw metadata is the authentic piece of metadata while the indexed metadata is a processed version, where the raw metadata is translated to a uniform vocabulary.

both intrinsic and extrinsic metadata can be indexed and translated.

77

drop and is not bug free

79

it's not only because of bugs, we keep both because the information is different, we don't translate all properties, etc.
also we might choose a different vocabulary in the future, we weren't sure CodeMeta was the best option in the first place.
it is a choice that is more about the overall robustness, not just dealing with bugs.

docs/architecture/metadata.rst
77

why?

moranegg added inline comments.
docs/architecture/metadata.rst
77

because of the explanation in the next comment.
nothing is bug free :-)
You can keep it, but this is not the main reason for keeping raw metadata.

This revision now requires changes to proceed.Sep 8 2021, 10:13 AM
docs/architecture/metadata.rst
61

Add:

The raw metadata is the authentic piece of metadata while the indexed metadata is a processed version, where the raw metadata is translated to a uniform vocabulary.

Both intrinsic and extrinsic metadata can be indexed and translated.
66

Add:

By keeping the raw metadata we ensure the possibility to re-compute the metadata in the future with other vocabularies. Furthermore, if we did not store the raw metadata, this would mean bugs in indexers....

(continue with the sentence in text)

vlorentz marked 2 inline comments as done.

add the two paragraphs

This revision is now accepted and ready to land.Sep 13 2021, 4:15 PM
This revision was automatically updated to reflect the committed changes.