Page MenuHomeSoftware Heritage

Add support for large negative integers in msgpack encoding
ClosedPublic

Authored by olasd on Nov 23 2020, 10:46 PM.

Details

Summary

This is to support some of the metadata imported from npm package.jsons in
revisions.

Test Plan

tox test added

Diff Detail

Repository
rDCORE Foundations and core functionalities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4568 (id=16224)

Rebasing onto e0e55b71cc...

Current branch diff-target is up to date.
Changes applied before test
commit 38728c614169faa3bbf19045abe74cee0369d853
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Nov 23 22:43:15 2020 +0100

    Add support for large negative integers in msgpack encoding
    
    This is to support some of the metadata imported from npm package.jsons in
    revisions.

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/166/ for more details.

ardumont added a subscriber: ardumont.

lgtm


I got questions... for those functions in general.

Any particular reason for having those encoders/decoders hard-coded in the main encoding/decoding functions?
(i know you are here only extending the current behavior to solve the problem at hand but i mean long term)

Or can we envision refactor those within the realm of the extra_encoders/decoders at some point?
And then encode_types becomes the simple for look around calling the encoders.
(ideally something equivalent could happen in the decoding part).

This revision is now accepted and ready to land.Nov 24 2020, 9:49 AM
douardda added a subscriber: douardda.

bit sad indeed, but LGTM

Why don't you pass signed=True to int.to_bytes/int.from_bytes, and add "one eighth of a byte" to the length (ie. replace if rem: length += 1 by length += 1) when encoding?

Why don't you pass signed=True to int.to_bytes/int.from_bytes, and add "one eighth of a byte" to the length (ie. replace if rem: length += 1 by length += 1) when encoding?

Because that's not backwards-compatible with existing serialized data.

Move comment back to the right place

Build is green

Patch application report for D4568 (id=16237)

Rebasing onto e0e55b71cc...

Current branch diff-target is up to date.
Changes applied before test
commit f269e2d3d0f4853a8a94f8eaca7d3373b1ebf092
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Nov 23 22:43:15 2020 +0100

    Add support for large negative integers in msgpack encoding
    
    This is to support some of the metadata imported from npm package.jsons in
    revisions.

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/167/ for more details.

I got questions... for those functions in general.

Any particular reason for having those encoders/decoders hard-coded in the main encoding/decoding functions?
(i know you are here only extending the current behavior to solve the problem at hand but i mean long term)

Or can we envision refactor those within the realm of the extra_encoders/decoders at some point?
And then encode_types becomes the simple for look around calling the encoders.
(ideally something equivalent could happen in the decoding part).

Yeah, here's my reasoning so far:

extra_encoders/decoders are (originally) for stuff that can't be represented in either json or msgpack, and generate compatible representations in both serializations.

The extra bits for msgpack, as well as the bytes serialization for json, are for base types that are supported in one of either but not the other one; The idea was to have a baseline compact representation of these types, while others would be handled through the more verbose, but pluggable, mechanism.

The fact that the bytes serialization was put in the main ENCODERS/DECODERS is, I believe, a mistake: it'll never be used for msgpack (it should probably just be enabled in the json encoder/decoder stack).

Honestly I'm not sure this reasoning holds much water, but that's what we had currently.

Of course, most of this could also be alleviated by making the revision metadata field bytes instead of arbitrary json, or finally dropping it! but we're not there yet :-)

In D4568#114334, @olasd wrote:

Why don't you pass signed=True to int.to_bytes/int.from_bytes, and add "one eighth of a byte" to the length (ie. replace if rem: length += 1 by length += 1) when encoding?

Because that's not backwards-compatible with existing serialized data.

ah yes, I forgot about that