Page MenuHomeSoftware Heritage

Document the metadata workflow.
ClosedPublic

Authored by vlorentz on Nov 30 2018, 10:31 AM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
doc-metadata-workflow-1
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2773
Build 3483: tox-on-jenkinsJenkins
Build 3482: arc lint + arc unit

Event Timeline

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

douardda added inline comments.
docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

This part of the description is unclear w.r.t the sequence diagram. How is this "deduplication" implemented?

This revision now requires changes to proceed.Nov 30 2018, 4:43 PM

here is the generated png in case someone wants to have a look

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

When the "alt" block is not executed.

The work is deduplicated, not the data itself.

  • Fix target of content_metadata_get.
zack requested changes to this revision.Dec 1 2018, 3:43 PM
zack added subscribers: moranegg, zack.
zack added inline comments.
docs/index.rst
15

nitpick: we usually use dashes as filename separator for doc files, please favor that over the underscore here

docs/metadata_workflow.rst
4 ↗(On Diff #2379)

We should clarify which kind of metadata we are talking about here. In the past with @moranegg we have agreed on the following terminology:

  • intrinsic metadata: those shipped as part of the source code artifacts that we ingest into the archive
  • extrinsic metadata: those available out-of-band w.r.t. the above scenario (e.g., available on the forge / distribution platform, but not distributed as source code artifacts)

You should consider starting this document documenting this distinction.

Failing that (e.g., because we want to document the distinction properly elsewhere), you should at least stick to the terminology of intrinsic metadata in the documentation, because it's the workflow about them that you are documenting, not the other one.

This revision now requires changes to proceed.Dec 1 2018, 3:43 PM
vlorentz added inline comments.
docs/metadata_workflow.rst
4 ↗(On Diff #2379)

Good point, will do.

  • Rename metadata_workflow.rst -> metadata-workflow.rst
  • Mention this doc is about intrinsic metadata only, for now.

here is the generated png in case someone wants to have a look

Thanks!

  • Explain in text what each metadata indexer does.
vlorentz retitled this revision from Start documenting the metadata workflow. to Document the metadata workflow..Dec 5 2018, 10:12 AM
vlorentz edited the summary of this revision. (Show Details)

Nice work.

There are some typos to fix.

docs/metadata-workflow.rst
22

scheduled manually

38

, then extracts

41

known

44

contents

docs/images/tasks-metadata-indexers.uml
47

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

pretty sure what the thanks means but hey ;)

Explicitely content_metadata_get is a call from the indexer_storage api.

So:

IDX_REV_META->>IDX_STORAGE
This revision now requires changes to proceed.Dec 6 2018, 11:52 AM
vlorentz added inline comments.
docs/images/tasks-metadata-indexers.uml
47

Did you comment on an old version of that Diff?

ardumont added inline comments.
docs/images/tasks-metadata-indexers.uml
47

Apparently so

This revision is now accepted and ready to land.Jan 21 2019, 2:44 PM
This revision was automatically updated to reflect the committed changes.