Page MenuHomeSoftware Heritage

Document the metadata workflow.
ClosedPublic

Authored by vlorentz on Nov 30 2018, 10:31 AM.

Diff Detail

Repository
rDCIDX Object indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Nov 30 2018, 10:31 AM

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

douardda requested changes to this revision.Nov 30 2018, 4:43 PM
douardda added inline comments.
docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

This part of the description is unclear w.r.t the sequence diagram. How is this "deduplication" implemented?

This revision now requires changes to proceed.Nov 30 2018, 4:43 PM

here is the generated png in case someone wants to have a look

vlorentz marked an inline comment as done.Nov 30 2018, 4:46 PM

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

docs/metadata_workflow.rst
7–9 ↗(On Diff #2336)

When the "alt" block is not executed.

The work is deduplicated, not the data itself.

vlorentz updated this revision to Diff 2378.Nov 30 2018, 5:05 PM
  • Add Makefile target.
vlorentz updated this revision to Diff 2379.Nov 30 2018, 5:07 PM
  • Fix target of content_metadata_get.
zack requested changes to this revision.Dec 1 2018, 3:43 PM
zack added subscribers: moranegg, zack.
zack added inline comments.
docs/index.rst
16

nitpick: we usually use dashes as filename separator for doc files, please favor that over the underscore here

docs/metadata_workflow.rst
4 ↗(On Diff #2379)

We should clarify which kind of metadata we are talking about here. In the past with @moranegg we have agreed on the following terminology:

  • intrinsic metadata: those shipped as part of the source code artifacts that we ingest into the archive
  • extrinsic metadata: those available out-of-band w.r.t. the above scenario (e.g., available on the forge / distribution platform, but not distributed as source code artifacts)

You should consider starting this document documenting this distinction.

Failing that (e.g., because we want to document the distinction properly elsewhere), you should at least stick to the terminology of intrinsic metadata in the documentation, because it's the workflow about them that you are documenting, not the other one.

This revision now requires changes to proceed.Dec 1 2018, 3:43 PM
vlorentz marked an inline comment as done.Dec 1 2018, 3:49 PM
vlorentz added inline comments.
docs/metadata_workflow.rst
4 ↗(On Diff #2379)

Good point, will do.

vlorentz updated this revision to Diff 2388.Dec 3 2018, 11:08 AM
  • Rename metadata_workflow.rst -> metadata-workflow.rst
  • Mention this doc is about intrinsic metadata only, for now.
vlorentz marked an inline comment as done.

here is the generated png in case someone wants to have a look

Thanks!

vlorentz updated this revision to Diff 2399.Dec 3 2018, 5:25 PM
  • Explain in text what each metadata indexer does.
vlorentz retitled this revision from Start documenting the metadata workflow. to Document the metadata workflow..Dec 5 2018, 10:12 AM
vlorentz edited the summary of this revision. (Show Details)

Nice work.

There are some typos to fix.

docs/metadata-workflow.rst
23

scheduled manually

39

, then extracts

42

known

45

contents

ardumont requested changes to this revision.Dec 6 2018, 11:52 AM
docs/images/tasks-metadata-indexers.uml
48

In your sequence diagram, it looks strange that you try to retrieve existing metadata from the "Graph Storage", but you upload newly created metadata (in the alt box) only to the "Indexer Storage"

Indeed, thanks.

pretty sure what the thanks means but hey ;)

Explicitely content_metadata_get is a call from the indexer_storage api.

So:

IDX_REV_META->>IDX_STORAGE
This revision now requires changes to proceed.Dec 6 2018, 11:52 AM
vlorentz updated this revision to Diff 2474.Dec 6 2018, 2:37 PM
  • Fix typos.
vlorentz marked 2 inline comments as done.Dec 6 2018, 2:41 PM
vlorentz added inline comments.
docs/images/tasks-metadata-indexers.uml
48

Did you comment on an old version of that Diff?

ardumont accepted this revision.Dec 6 2018, 10:00 PM
ardumont added inline comments.
docs/images/tasks-metadata-indexers.uml
48

Apparently so

zack accepted this revision.Jan 21 2019, 2:14 PM
douardda accepted this revision.Jan 21 2019, 2:44 PM
This revision is now accepted and ready to land.Jan 21 2019, 2:44 PM
vlorentz updated this revision to Diff 3120.Jan 21 2019, 2:47 PM
  • Rebase/squash
vlorentz updated this revision to Diff 3121.Jan 21 2019, 2:49 PM
  • Rebase
This revision was automatically updated to reflect the committed changes.
Harbormaster failed remote builds in B3646: Diff 3121!