Page MenuHomeSoftware Heritage

storage: Strip null characters from metadata documents
ClosedPublic

Authored by vlorentz on Aug 10 2022, 10:45 AM.

Details

Summary

They cause postgresql to crash because it does not allow them in text fields.

They are seemingly only present accidentally in source documents;
so stripping them does not really impact the quality of metadata.

Resolves T4277.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 30738
Build 48059: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 48058: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8229 (id=29685)

Rebasing onto 5313be86b3...

Current branch diff-target is up to date.
Changes applied before test
commit bb9082a6e5b95085ada61a917a2547f7d0a5c5e2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Aug 10 10:44:31 2022 +0200

    storage: Strip null characters from metadata documents
    
    They cause postgresql to crash because it does not allow them in text fields.
    
    They are seemingly only present accidentally in source documents;
    so stripping them does not really impact the quality of metadata.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/413/ for more details.

anlambert added a subscriber: anlambert.

Looks good to me.

swh/indexer/storage/__init__.py
53

s/NUL/NULL/

This revision is now accepted and ready to land.Aug 10 2022, 11:52 AM
swh/indexer/storage/__init__.py
53

nah, NUL is the name of the zero byte/character in ASCII, NULL is the name inherited from C for zero pointers.

Unicode doesn't have a name for the zero character, so I used ASCII's

swh/indexer/storage/__init__.py
53

hah, actually unicode calls it NULL but allows NUL as an alias https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt

swh/indexer/storage/__init__.py
53

Ack