Page MenuHomeSoftware Heritage

replayer crashes on invalid directory entry name (b'gitter/gitter.xml')
Open, HighPublic

Description

https://sentry.softwareheritage.org/organizations/swh/issues/104802/?referrer=phabricator_plugin

ValueError: b'gitter/gitter.xml' is not a valid directory entry name.
(14 additional frame(s) were not displayed)
...
  File "swh/model/model.py", line 1195, in from_dict
    entries=tuple(
  File "swh/model/model.py", line 1196, in <genexpr>
    DirectoryEntry.from_dict(entry) for entry in d.pop("entries")
  File "swh/model/model.py", line 352, in from_dict
    return cls(**d)
  File "<attrs generated init swh.model.model.DirectoryEntry>", line 7, in __init__
    Implementation of Software Heritage's data model
  File "swh/model/model.py", line 1163, in check_name
    raise ValueError(f"{value!r} is not a valid directory entry name.")

Related Objects

StatusAssignedTask
OpenNone
Work in Progressvlorentz
Openvlorentz
Resolvedvlorentz
Resolvedvlorentz
Resolvedvlorentz
Work in Progressvlorentz
Openvlorentz
Resolvedvlorentz
Resolvedvlorentz
Resolvedvlorentz
Resolvedvlorentz
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Wontfixvlorentz
Wontfixvlorentz
Resolvedvlorentz

Event Timeline

I tried to add a workaround in the backfiller, but it is incredibly hard to do properly, especially as entries as disordered, so raw_manifest needs to be fixed in two different ways.

I think it would be best to fix this once and for all in both the database and kafka. Here is a script to fix the three problematic trees:

from swh.model.hashutil import hash_to_bytes
from swh.model.model import Directory, DirectoryEntry
from swh.storage import get_storage

d1 = Directory(
    entries=(
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("ec4afaa35ce19f6f93133149cbbf316832007d6e"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("1dd3ec83942bbc04deee7fc6be8b9c6e703d57f9"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 132\x00"
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/gitter.xml\x00\xecJ\xfa\xa3\\\xe1\x9fo\x93\x131I\xcb\xbf1h2\x00}n"
        b"100775 gitter/script.py\x00\x1d\xd3\xec\x83\x94+\xbc\x04\xde\xee\x7f\xc6\xbe\x8b\x9cnp=W\xf9"
    ),
)

assert d1.id == hash_to_bytes("379405713938d0e89fe10d1ff43bdee87e06e088")

d2 = Directory(
    entries=(
        DirectoryEntry(
            name=b"AUTHORS",
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"COPYING",
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("ec4afaa35ce19f6f93133149cbbf316832007d6e"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("1dd3ec83942bbc04deee7fc6be8b9c6e703d57f9"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 202\x00"
        b"100775 AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/gitter.xml\x00\xecJ\xfa\xa3\\\xe1\x9fo\x93\x131I\xcb\xbf1h2\x00}n"
        b"100775 gitter/script.py\x00\x1d\xd3\xec\x83\x94+\xbc\x04\xde\xee\x7f\xc6\xbe\x8b\x9cnp=W\xf9"
    ),
)

assert d2.id == hash_to_bytes("04aa01e3bc24766135a29a9c6cfcc6a5eec50410")


d3 = Directory(
    entries=(
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("8512d68b66565426de430f06a96a2efedfc3623f"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_NEWS",  # changed
            type="file",
            target=hash_to_bytes("0bfeb2d36499ae4c330cffdcc2fe0f9ca30d2818"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_COPYING",  # changed
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"AUTHORS",
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"COPYING",
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("ff242b0acecc36271cc8a5848cdab57c91e488a8"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_AUTHORS",  # changed
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 325\x00"
        b"100775 AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 gitter/COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 gitter/NEWS\x00\x0b\xfe\xb2\xd3d\x99\xaeL3\x0c\xff\xdc\xc2\xfe\x0f\x9c\xa3\r(\x18"
        b"100775 gitter/gitter.xml\x00\x85\x12\xd6\x8bfVT&\xdeC\x0f\x06\xa9j.\xfe\xdf\xc3b?"
        b"100775 gitter/script.py\x00\xff$+\n\xce\xcc6'\x1c\xc8\xa5\x84\x8c\xda\xb5|\x91\xe4\x88\xa8"
    ),
)

assert d3.id == hash_to_bytes("efe077bea80139cea754a83ac3d6fc773276745f")


storage = get_storage(
    "postgresql",
    db="service=swh",  # XXX: change me
    objstorage={"cls": "memory"},
)

storage.directory_missing = lambda *args, **kwargs: [d1.id, d2.id, d3.id]

print(storage.directory_add([d1, d2, d2]))

This script changes entry names to use an underscore instead of a slash, and adds raw manifests, with the original objects (extracted manually from https://github.com/jkingok/sl4a-scripts )

Someone with write access needs to DELETE the three objects from the database, edit db="service=swh", # XXX: change me, and run the script against the production DB.

many thanks to @seirl for doing the inventory in P1358

vlorentz changed the visibility from "All Users" to "Public (No Login Required)".Oct 25 2022, 5:57 PM

Do you actually want to keep these objects? This would be inconsistent with the fixed loader behavior that would just reject those objects, and not load the repository at all.

Holes are bad. And I just opened a diff to make the git loader apply the same transformation, as @olasd made the same comment: D8776

Oh yeah, I was thinking of just removing the entire project, but your solution also works.

@vlorentz I'm running the following adaptation to your script:

import os
import pprint

import swh.storage.postgresql.storage
from swh.model.hashutil import hash_to_bytes
from swh.model.model import Directory, DirectoryEntry
from swh.storage import get_storage
from swh.storage.api.server import load_and_check_config

# [...]

config_path = os.environ.get("SWH_CONFIG_FILENAME")
config = load_and_check_config(config_path)

storage = get_storage(**config["storage"])
assert isinstance(storage, swh.storage.postgresql.storage.Storage)

db = storage.get_db()
with db.transaction() as cur:
    cur.execute("DELETE from directory where id in %s", ((d1.id, d2.id, d3.id),))
    print(cur.statusmessage)
    assert cur.statusmessage == "DELETE 3", "Deleted the wrong number of directories"

    ret = storage.directory_add([d1, d2, d3], db=db, cur=cur)
    print(ret)
    assert ret == {"directory:add": 3}, "Wrong number of directories added"
    for d in [d1, d2, d3]:
        print(f"Checking directory {d.id.hex()}")
        assert storage.directory_get_raw_manifest([d.id], db=db, cur=cur) == {
            d.id: d.raw_manifest,
        }, f"Raw manifest for {d.id.hex()} NOK"

        result = storage.directory_get_entries(d.id, db=db, cur=cur)
        assert result, f"Directory {d.id.hex()} not found"

        pprint.pprint(sorted(result.results, key=lambda e: e.name))

        assert sorted(result.results, key=lambda e: e.name) == sorted(
            d.entries, key=lambda e: e.name
        ), f"Entries do not match for {d.id.hex()}"

    input("Commit?")

Seems to work fine on a scratch local DB.

Pulling the config from saam will also write the fixed directories to kafka at no additional cost.

WDYT?

This is now done, the objects are fixed in the production DB and kafka.

Of course, we still need to adapt the replayer to skip these broken objects now, because kafka compaction will only happen "at some point"

To force kafka compaction to run I've done the following:

olasd@getty:~$ /opt/kafka/bin/kafka-configs.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --topic swh.journal.objects.directory --alter --add-config min.cleanable.dirty.ratio=0.1,min.compaction.lag.ms=1200000,max.compaction.lag.ms=2678400000
Completed updating config for topic swh.journal.objects.directory.
olasd@getty:~$ for nthreads in 2 4 8; do for broker in 1 2 3 4; do /opt/kafka/bin/kafka-configs.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --broker $broker --alter --add-config log.cleaner.threads=$nthreads; done; done
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.

This forces a compaction to run every 30 days at most (and it indeed triggered a compaction for the topic)