Page MenuHomeSoftware Heritage

storage: Strip null characters from metadata documents
ClosedPublic

Authored by vlorentz on Aug 10 2022, 10:45 AM.

Details

Summary

They cause postgresql to crash because it does not allow them in text fields.

They are seemingly only present accidentally in source documents;
so stripping them does not really impact the quality of metadata.

Resolves T4277.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8229 (id=29685)

Rebasing onto 5313be86b3...

Current branch diff-target is up to date.
Changes applied before test
commit bb9082a6e5b95085ada61a917a2547f7d0a5c5e2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Aug 10 10:44:31 2022 +0200

    storage: Strip null characters from metadata documents
    
    They cause postgresql to crash because it does not allow them in text fields.
    
    They are seemingly only present accidentally in source documents;
    so stripping them does not really impact the quality of metadata.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/413/ for more details.

anlambert added a subscriber: anlambert.

Looks good to me.

swh/indexer/storage/__init__.py
52

s/NUL/NULL/

This revision is now accepted and ready to land.Aug 10 2022, 11:52 AM
swh/indexer/storage/__init__.py
52

nah, NUL is the name of the zero byte/character in ASCII, NULL is the name inherited from C for zero pointers.

Unicode doesn't have a name for the zero character, so I used ASCII's

swh/indexer/storage/__init__.py
52

hah, actually unicode calls it NULL but allows NUL as an alias https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt

swh/indexer/storage/__init__.py
52

Ack