Details

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

douardda requested changes to this revision.Nov 30 2018, 4:43 PM

douardda added inline comments.

docs/metadata_workflow.rst
7–9 ↗	(On Diff #2336)	This part of the description is unclear w.r.t the sequence diagram. How is this "deduplication" implemented?

This revision now requires changes to proceed.Nov 30 2018, 4:43 PM

here is the generated png in case someone wants to have a look

In D747#15707, @douardda wrote:

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

docs/metadata_workflow.rst
7–9 ↗	(On Diff #2336)	When the "alt" block is not executed. The work is deduplicated, not the data itself.

Add Makefile target.

Fix target of content_metadata_get.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/104/ for more details.

Harbormaster completed remote builds in B2759: Diff 2378.Nov 30 2018, 5:08 PM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/105/ for more details.

Harbormaster completed remote builds in B2760: Diff 2379.Nov 30 2018, 5:11 PM

zack requested changes to this revision.Dec 1 2018, 3:43 PM

zack added subscribers: moranegg, zack.

zack added inline comments.

docs/index.rst
15	nitpick: we usually use dashes as filename separator for doc files, please favor that over the underscore here
docs/metadata_workflow.rst
4 ↗	(On Diff #2379)	We should clarify which kind of metadata we are talking about here. In the past with @moranegg we have agreed on the following terminology: intrinsic metadata: those shipped as part of the source code artifacts that we ingest into the archive extrinsic metadata: those available out-of-band w.r.t. the above scenario (e.g., available on the forge / distribution platform, but not distributed as source code artifacts) You should consider starting this document documenting this distinction. Failing that (e.g., because we want to document the distinction properly elsewhere), you should at least stick to the terminology of intrinsic metadata in the documentation, because it's the workflow about them that you are documenting, not the other one.

This revision now requires changes to proceed.Dec 1 2018, 3:43 PM

vlorentz marked an inline comment as done.Dec 1 2018, 3:49 PM

vlorentz added inline comments.

docs/metadata_workflow.rst
4 ↗	(On Diff #2379)	Good point, will do.

Rename metadata_workflow.rst -> metadata-workflow.rst
Mention this doc is about intrinsic metadata only, for now.

vlorentz added a parent revision: D760: Document {intrinsic,extrinsic} metadata..Dec 3 2018, 11:08 AM

vlorentz marked an inline comment as done.

vlorentz added a task: T1384: Document indexer architecture / metadata pipeline.Dec 3 2018, 3:14 PM

here is the generated png in case someone wants to have a look

Thanks!

Build was aborted

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/109/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/109/console

Harbormaster failed remote builds in B2773: Diff 2388!Dec 3 2018, 3:56 PM

Explain in text what each metadata indexer does.

Build was aborted

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/110/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/110/console

Harbormaster failed remote builds in B2797: Diff 2399!Dec 4 2018, 12:02 PM

vlorentz retitled this revision from Start documenting the metadata workflow. to Document the metadata workflow..Dec 5 2018, 10:12 AM

vlorentz edited the summary of this revision. (Show Details)

Nice work.

There are some typos to fix.

docs/metadata-workflow.rst
22	`scheduled manually`
38	`, then extracts`
41	`known`
44	`contents`

docs/images/tasks-metadata-indexers.uml
47	In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage" Indeed, thanks. pretty sure what the thanks means but hey ;) Explicitely content_metadata_get is a call from the indexer_storage api. So: IDX_REV_META->>IDX_STORAGE