Page MenuHomeSoftware Heritage

Add a documentation/specification of the journal messages formats
ClosedPublic

Authored by douardda on Nov 30 2020, 5:15 PM.

Details

Summary

Closes T2818.

Missing the whole metadata related part. Probably have several errors.

Diff Detail

Repository
rDDOC Development documentation
Branch
master
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 17769
Build 27466: arc lint + arc unit

Event Timeline

docs/journal.rst
77

this! really?

Note: I have not made any proofreading yet, so it should be full of typos and errors...

ok overall.

But we will have to make sure this stays in sync with swh.model.model somehow.

your code blocks have inconsistent indent (some have 1 space, some have 2, some have a mix)

docs/journal.rst
30–31

Each value. Kafka messages are made of a key (which is either a msgpacked bytestring or a msgpacked dict, usually) which we never read, and a value.

38–42

should mention it's using the type extension thingy of msgpack.

49–50

What is this on the rhs? That's neither the serialized value nor the deserialized one.

52–54

Should explain why there are two types (ie. they need to be precisely recorded for git, but not for visits)

63–64

s/null/zero/

77

oh boy

77

s/which/of which/

80

8601

96–97

origin/visits/statuses are not part of the merkle dag (yet?)

Call them objects instead.

129

s/with this loader type/with a loader/

to avoid confusion about how unique it is

184

s/is/if/

should mention it's not part of the id computation, for consistency with revisions

218

"and values a dict of:"

252

note it's not an exhaustive list

255

s/metadate/metadata/

416–423

Oh, and you need to add something to explain anonymized topics

douardda marked an inline comment as not done.Dec 1 2020, 11:59 AM

Oh, and you need to add something to explain anonymized topics

yes this is TODO, but planned.

douardda added inline comments.
docs/journal.rst
49–50

The extension types in msgpack are, in the end, a couple [ID, payload], with the payload being a bytes array. These are representation of that.

96–97

indeed you're right

252

it is as of today, but is expected to be amended in the future

416–423

oh right it's better indeed thanks

typos and (somes) fixes reported by vlorentz

moved origin related docs in a dedicated section (not under the DAG section).

Update the documentation for (WIP) extended type based datetime encoding

see D4655

forgot to save before commit...

improve the doc

  • document anonymized/privileged topics
  • properly describe the masgpack Timestamp
  • document metadata topics
douardda retitled this revision from [WIP] Add a documentation/specification of the journal messages formats to Add a documentation/specification of the journal messages formats.Dec 4 2020, 3:59 PM
docs/journal.rst
49–50

Wouldn't it be clearer to just write the serialized value, instead of a 2-tuple of a byte and an array of bytes?

docs/journal.rst
49–50

this is no more the case: we now use "regular" msgpack.Timestamp (or is it something else?)

docs/journal.rst
49–50

It is; my comment was initially on the "big integer" serialization

fix and improve the long int encoding examples

zack added a subscriber: zack.

LGTM in general.

I've noted in comments some suggestions for improvements, mostly minor, with some restructuring suggestion.

Thanks!

docs/journal.rst
3–4

question: is this the spec of the journal as a whole (assuming that's a thing), or "just" the specification of the serialization format for messages in the journal?
(If it's the latter, maybe the title should mention "serialization", or something such)

6–10

Here we discuss both the swh.model objects and "other stuff", such as indexer data. In the rest of the intro we seem to assume that all messages in the journal correspond to swh.model objects. Is that true also for indexer data, or are they represented as python objects for other modules? In the latter case there is an inconsistency that would be nice to fix.

24

Reaching this point, I'm really missing the big list of topics, because all the talk about topics and topic groups is very abstract for me. Can we add the full list here, or is it too long? Anyway, it's needed somewhere; so an alternative would be an in-document link here pointing to where the full list is.
(If duplication is a concern, I suspect it can be added [ab]using sphinx TOC generation)

39–41

just call it "full version" or "complete version".

"privileged" is an operational aspect, rather than data aspect, and it has a weird connotation here

44

minor: expand "dict" to "dictionary", "dict" is very pythonesque :)

63

typo: "dictionnaty"

68

instead of "most values ... but", I suggest "All values ... except"

73–74

The Integer and Datetime sections here steal the thunder a bit here. We're into the "Kafka message format" section, and I want to read about that as soon as possible, not low-level details about integers and datetime serialization.

So, concrete proposal: move these two to a later section, maybe even an appendix, and reference that section from wherever is needed.

144–145

"core" will probably be an unclear term for the reader.

Just say that these are topics for the various types of objects stored in the Software Heritage Merkle DAG. (You can say either here or in the intro that the archive data model is a Merkle DAG, with the good ref.)

168

indentation seems weird: it looks like we should line break after the first { here
same in other json examples
(it's of course a minor point)

475–476

I think we have doc explaining what "extrinsic metadata" are, it would be nice to link to it from here

This revision is now accepted and ready to land.Dec 11 2020, 11:37 AM
docs/journal.rst
3–4

This aim at being the specs of the journal as a whole (but this later mainly consists in the message format.)

6–10

It's something that needs to be clarified indeed.

24

sure we can add a list, I'll give a try

39–41

ok

49–50

oops

73–74

well there is really not much to say about "fakfa message format" other then "a dict encoded via msgpack". Then the content of each dict is described in each topic description. I can give a try to your proposal, we'll see.

docs/journal.rst
6–10

For now all topics are related to objects from swh-model, but we expect to also put in there indexers objects any time soon, so I made add something about it here.

rework the specs according to zack's comments (hopefully) + rebase