Page MenuHomeSoftware Heritage

SWHID parsing: simplify and deduplicate validation logic
ClosedPublic

Authored by zack on Dec 19 2020, 1:51 PM.

Details

Summary

Before this change there was a lot of overlap between parse_swhid() and the
attrs-based validators in the SWHID class. Also, the validation implementation
in parse_swhid() was done by hand.

With this change the coarse-grained validation done by parse_swhid() is now
delegated to a regex. The semantic validation of SWHIDs is left to attrs
validators. The regex is also exposed as a module attribute, to be used by
client code that want to syntactically validate SWHIDs without necessarily
instantiate SWHID classes (we have several other modules doing that already,
and they are using slightly different hand-made regexs, which isn't great).

As part of this change we also clean up the use of ValidationError exceptions,
systematically passing the problematic parts of SWHID as arguments, and uniform
error messages.

This change also brings some speed up in SWHID parsing. On a benchmark parsing
~30 M valid SWHIDs, the previous implementation took ~3:06 minutes, the new one
~2:50 minutes, or a ~9% speedup.

Closes T2788

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4771 (id=16886)

Rebasing onto 76b744e08f...

Current branch diff-target is up to date.
Changes applied before test
commit a1804e15b6a1f17c0bc6c62ba8eca1cf9b591bf2
Author: Stefano Zacchiroli <zack@upsilon.cc>
Date:   Sat Dec 19 11:54:59 2020 +0100

    SWHID parsing: simplify and deduplicate validation logic
    
    Before this change there was a lot of overlap between parse_swhid() and the
    attrs-based validators in the SWHID class. Also, the validation implementation
    in parse_swhid() was done by hand.
    
    With this change the coarse-grained validation done by parse_swhid() is now
    delegated to a regex. The semantic validation of SWHIDs is left to attrs
    validators. The regex is also exposed as a module attribute, to be used by
    client code that want to syntactically validate SWHIDs without necessarily
    instantiate SWHID classes (we have several other modules doing that already,
    and they are using slightly different hand-made regexs, which isn't great).
    
    As part of this change we also clean up the use of ValidationError exceptions,
    systematically passing the problematic parts of SWHID as arguments, and uniform
    error messages.
    
    This change also brings some speed up in SWHID parsing. On a benchmark parsing
    ~30 M valid SWHIDs, the previous implementation took ~3:06 minutes, the new one
    ~2:50 minutes, or a ~9% speedup.
    
    Closes T2788

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/187/ for more details.

Nice!

btw, you don't have to expose SWHID_RE_RAW directly, it's available as SWHID_RE.pattern

swh/model/identifiers.py
33

caps

40

^ and $ are redundant if you only use SWHID_RE.match.

774–777

from None

This revision now requires changes to proceed.Dec 21 2020, 2:54 PM
@vlorentz wrote:

btw, you don't have to expose SWHID_RE_RAW directly, it's available as SWHID_RE.pattern

So you can directly compile the regexes inline in one definition.

swh/model/identifiers.py
40

I would say: if you use match, only the caret is redundant, but if you use fullmatch then both are

swh/model/identifiers.py
40

oh, indeed

zack marked 5 inline comments as done.Dec 30 2020, 1:24 PM

Build is green

Patch application report for D4771 (id=16979)

Rebasing onto 76b744e08f...

Current branch diff-target is up to date.
Changes applied before test
commit 574685052348bfc6ed28570f06b9cc4302dfde27
Author: Stefano Zacchiroli <zack@upsilon.cc>
Date:   Sat Dec 19 11:54:59 2020 +0100

    SWHID parsing: simplify and deduplicate validation logic
    
    Before this change there was a lot of overlap between parse_swhid() and the
    attrs-based validators in the SWHID class. Also, the validation implementation
    in parse_swhid() was done by hand.
    
    With this change the coarse-grained validation done by parse_swhid() is now
    delegated to a regex. The semantic validation of SWHIDs is left to attrs
    validators. The regex is also exposed as a module attribute, to be used by
    client code that want to syntactically validate SWHIDs without necessarily
    instantiate SWHID classes (we have several other modules doing that already,
    and they are using slightly different hand-made regexs, which isn't great).
    
    As part of this change we also clean up the use of ValidationError exceptions,
    systematically passing the problematic parts of SWHID as arguments, and uniform
    error messages.
    
    This change also brings some speed up in SWHID parsing. On a benchmark parsing
    ~30 M valid SWHIDs, the previous implementation took ~3:06 minutes, the new one
    ~2:50 minutes, or a ~9% speedup.
    
    Closes T2788

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/188/ for more details.

This revision is now accepted and ready to land.Dec 30 2020, 2:48 PM