docs/architecture/metadata.rst
48	No; indexers are not supposed to access the internet. Instead, we'll need to add some sort of loader that fetches and stores API responses; and indexers would read the stored responses

Good idea to add this documentation!
I have added a few comments.

docs/architecture/metadata.rst
7	intrinsic metadata is part of the code. I think we should focus what metadata is, and not what it is not. \|swh\| calls "metadata" information it collects and extracts that describes and provides additional information on the source code itself.
13	switch collected with extracted. I propose to keep collect for the actions of gathering from external resources.
17	here I would suggest not to define with the negative statement :term:`extrinsic metadata`, which is collected or deposited from external sources.
71	I would put this section before the metadata mining as an introduction to it
74	delete `as we saw above`
75	The raw metadata is the authentic piece of metadata while the indexed metadata is a processed version, where the raw metadata is translated to a uniform vocabulary. both intrinsic and extrinsic metadata can be indexed and translated.
77	drop `and is not bug free`
79	it's not only because of bugs, we keep both because the information is different, we don't translate all properties, etc. also we might choose a different vocabulary in the future, we weren't sure CodeMeta was the best option in the first place. it is a choice that is more about the overall robustness, not just dealing with bugs.

vlorentz added inline comments.Sep 7 2021, 5:10 PM

docs/architecture/metadata.rst
77	why?

moranegg requested changes to this revision.Sep 8 2021, 10:13 AM

moranegg added inline comments.

docs/architecture/metadata.rst
77	because of the explanation in the next comment. nothing is bug free :-) You can keep it, but this is not the main reason for keeping raw metadata.

This revision now requires changes to proceed.Sep 8 2021, 10:13 AM

vlorentz marked 9 inline comments as done.Sep 8 2021, 11:48 PM

apply @moranegg's comments

Harbormaster completed remote builds in B23450: Diff 22500.Sep 8 2021, 11:48 PM

moranegg added inline comments.Sep 9 2021, 9:48 AM

docs/architecture/metadata.rst

Add:

The raw metadata is the authentic piece of metadata while the indexed metadata is a processed version, where the raw metadata is translated to a uniform vocabulary.

Both intrinsic and extrinsic metadata can be indexed and translated.

Add:

By keeping the raw metadata we ensure the possibility to re-compute the metadata in the future with other vocabularies. Furthermore, if we did not store the raw metadata, this would mean bugs in indexers....

(continue with the sentence in text)