Page MenuHomeSoftware Heritage

binary (de)serialiazer for more compact PID<->int maps
ClosedPublic

Authored by zack on Sep 3 2019, 6:08 PM.

Diff Detail

Repository
rDGRPH Compressed graph representation
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

seirl requested changes to this revision.Sep 3 2019, 6:14 PM
seirl added inline comments.
swh/graph/pid.py
54

str_to_bytes probably

This revision now requires changes to proceed.Sep 3 2019, 6:14 PM
  • fix typo in function name cross-ref
zack marked an inline comment as done.Sep 3 2019, 6:20 PM
This revision was not accepted when it landed; it landed in state Needs Review.Sep 3 2019, 6:22 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.

(it's only closed in a branch, not in master)

requirements.txt: add new dep on swh.model

This revision was not accepted when it landed; it landed in state Needs Review.Sep 3 2019, 6:44 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
  • binary (de)serialiazer for more compact PID<->int maps
  • fix typo in function name cross-ref
  • requirements.txt: add new dep on swh.model
haltode added inline comments.
swh/graph/pid.py
23–28

Anticipating the Java part, it would be best to follow the ordering already defined in https://forge.softwareheritage.org/source/swh-graph/browse/master/java/server/src/main/java/org/softwareheritage/graph/Node.java (or at least make sure in the future Java part diff that the order is the same in both Java/Python code).

  • change PID order to be alphabetic and match current Java implementation
zack marked an inline comment as done.Sep 4 2019, 7:59 AM
zack added inline comments.
swh/graph/pid.py
23–28

Ah, good point. Not much the Java compatibility (we're gonna change the serialization on the Java side anyway), but the fact the previous one I used was not alphabetic on textual PIDs. Having that is a good property, because we already have alpha-sorted PID lists, e.g., in the nodes file. Being able to easily translate those files to binary serializations is a feature that we want to keep.

Diff updated to fix this.

zack marked an inline comment as done.
  • fix binary serialization tests after PID ordering change
  • pid2int: add type checking on the key
  • add restore script from textual to binary maps

Please see how other SWH tools implement their CLI. There should be a file named swh/graph/cli.py that extends the cli from swh.core, and an entrypoint declared in setup.py

swh/graph/dump_map.py
25 ↗(On Diff #6553)

these should be two different commands instead.

swh/graph/pid.py
42

Is it useful to include namespace version? (ie. will we ever mix two different namespace versions in the same file?)

54

There's a function in swh.model.identifiers to parse pids.

72–73

There's also a function for this.

91–92

(self.length, remainder) = divmod(self.size, record_size)

289

The error should probably show the original pos, not len(self) + <original pos>. It might give surprising errors otherwise.

zack planned changes to this revision.Sep 5 2019, 3:46 PM
zack marked 4 inline comments as done.
  • update to fix (most of) @vlorentz review remarks
swh/graph/dump_map.py
25 ↗(On Diff #6553)

I'm not sure I agree. We're going to have a bunch of maps (we already have a 3rd one, although it's not supported here yet). One command per map type would not be convenient. Also, while currently the type has to be specified, it's possible in the future it will be auto-detected, which would make a single dump command more convenient.

swh/graph/pid.py
42

Unlikely in these maps, but the serialization is meant to be general purpose. As such, it should contain the same information that exist in a PID, which includes the version number.

I can imagine in the future using this serialization format elsewhere, where you cannot necessarily expect uniformity.

zack planned changes to this revision.Sep 5 2019, 4:20 PM

(Work on redoing the cli.py module right is still pending.)

swh/graph/pid.py
42

If it's general purpose, shouldn't it be in swh.model.identifiers?

zack added inline comments.
swh/graph/pid.py
42

I've raised this on #swh-devel the other day (although only @olasd participated). The consensus was that it could go there eventually, but until it's needed there is no point in doing so. I'm personally fine with waiting for a second user to promote this as something general purposes. But if there is the desire from other to move it there right now, I wouldn't mind doing so either.

  • integrate cli with swh.core.cli
  • binary (de)serialiazer for more compact PID<->int maps
  • fix typo in function name cross-ref
  • requirements.txt: add new dep on swh.model
  • change PID order to be alphabetic and match current Java implementation
  • fix binary serialization tests after PID ordering change
  • pid2int: add type checking on the key
  • add restore script from textual to binary maps
  • update to fix (most of) @vlorentz review remarks
  • serialization: integrate map dump/restore commands into CLI

Please see how other SWH tools implement their CLI. There should be a file named swh/graph/cli.py that extends the cli from swh.core, and an entrypoint declared in setup.py

Done in this version.

The new commands are: swh graph map {dump,restore}, replacing the previous standalone top-level executables.
The former cli.py main is now available as swh graph api-client (cc: @haltode).

  • binary (de)serialiazer for more compact PID<->int maps
  • fix typo in function name cross-ref
  • requirements.txt: add new dep on swh.model
  • change PID order to be alphabetic and match current Java implementation
  • fix binary serialization tests after PID ordering change
  • pid2int: add type checking on the key
  • add restore script from textual to binary maps
  • update to fix (most of) @vlorentz review remarks
  • serialization: integrate map dump/restore commands into CLI
This revision is now accepted and ready to land.Sep 6 2019, 2:51 PM

binary (de)serialiazer for more compact PID<->int maps

This revision was landed with ongoing or failed builds.Sep 6 2019, 2:54 PM
This revision was automatically updated to reflect the committed changes.