Page MenuHomeSoftware Heritage

Document the metadata workflow.
Needs ReviewPublic

Authored by vlorentz on Fri, Nov 30, 10:31 AM.

Details

Summary

Related: D746

Diff Detail

Repository
rDCIDX Object indexer
Branch
doc-metadata-workflow-1
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2896
Build 3665: tox-on-jenkinsJenkins
Build 3664: arc lint + arc unit

Event Timeline

vlorentz created this revision.Fri, Nov 30, 10:31 AM

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

douardda requested changes to this revision.Fri, Nov 30, 4:43 PM
douardda added inline comments.
docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

This part of the description is unclear w.r.t the sequence diagram. How is this "deduplication" implemented?

This revision now requires changes to proceed.Fri, Nov 30, 4:43 PM

here is the generated png in case someone wants to have a look

vlorentz marked an inline comment as done.Fri, Nov 30, 4:46 PM

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

When the "alt" block is not executed.

The work is deduplicated, not the data itself.

vlorentz updated this revision to Diff 2378.Fri, Nov 30, 5:05 PM
  • Add Makefile target.
vlorentz updated this revision to Diff 2379.Fri, Nov 30, 5:07 PM
  • Fix target of content_metadata_get.
zack requested changes to this revision.Sat, Dec 1, 3:43 PM
zack added subscribers: moranegg, zack.
zack added inline comments.
docs/index.rst
16

nitpick: we usually use dashes as filename separator for doc files, please favor that over the underscore here

docs/metadata_workflow.rst
4 ↗(On Diff #2379)

We should clarify which kind of metadata we are talking about here. In the past with @moranegg we have agreed on the following terminology:

  • intrinsic metadata: those shipped as part of the source code artifacts that we ingest into the archive
  • extrinsic metadata: those available out-of-band w.r.t. the above scenario (e.g., available on the forge / distribution platform, but not distributed as source code artifacts)

You should consider starting this document documenting this distinction.

Failing that (e.g., because we want to document the distinction properly elsewhere), you should at least stick to the terminology of intrinsic metadata in the documentation, because it's the workflow about them that you are documenting, not the other one.

This revision now requires changes to proceed.Sat, Dec 1, 3:43 PM
vlorentz marked an inline comment as done.Sat, Dec 1, 3:49 PM
vlorentz added inline comments.
docs/metadata_workflow.rst
4 ↗(On Diff #2379)

Good point, will do.

vlorentz updated this revision to Diff 2388.Mon, Dec 3, 11:08 AM
  • Rename metadata_workflow.rst -> metadata-workflow.rst
  • Mention this doc is about intrinsic metadata only, for now.
vlorentz marked an inline comment as done.

here is the generated png in case someone wants to have a look

Thanks!

vlorentz updated this revision to Diff 2399.Mon, Dec 3, 5:25 PM
  • Explain in text what each metadata indexer does.
vlorentz retitled this revision from Start documenting the metadata workflow. to Document the metadata workflow..Wed, Dec 5, 10:12 AM
vlorentz edited the summary of this revision. (Show Details)

Nice work.

There are some typos to fix.

docs/metadata-workflow.rst
23

scheduled manually

39

, then extracts

42

known

45

contents

ardumont requested changes to this revision.Thu, Dec 6, 11:52 AM
docs/images/tasks-metadata-indexers.uml
48

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

pretty sure what the thanks means but hey ;)

Explicitely content_metadata_get is a call from the indexer_storage api.

So:

IDX_REV_META->>IDX_STORAGE
This revision now requires changes to proceed.Thu, Dec 6, 11:52 AM
vlorentz updated this revision to Diff 2474.Thu, Dec 6, 2:37 PM
  • Fix typos.
vlorentz marked 2 inline comments as done.Thu, Dec 6, 2:41 PM
vlorentz added inline comments.
docs/images/tasks-metadata-indexers.uml
48

Did you comment on an old version of that Diff?

ardumont accepted this revision.Thu, Dec 6, 10:00 PM
ardumont added inline comments.
docs/images/tasks-metadata-indexers.uml
48

Apparently so