replayer crashes on invalid directory entry name (b'gitter/gitter.xml')
Closed, MigratedEdits Locked
Actions

Description

https://sentry.softwareheritage.org/organizations/swh/issues/104802/?referrer=phabricator_plugin

ValueError: b'gitter/gitter.xml' is not a valid directory entry name.
(14 additional frame(s) were not displayed)
...
  File "swh/model/model.py", line 1195, in from_dict
    entries=tuple(
  File "swh/model/model.py", line 1196, in <genexpr>
    DirectoryEntry.from_dict(entry) for entry in d.pop("entries")
  File "swh/model/model.py", line 352, in from_dict
    return cls(**d)
  File "<attrs generated init swh.model.model.DirectoryEntry>", line 7, in __init__
    Implementation of Software Heritage's data model
  File "swh/model/model.py", line 1163, in check_name
    raise ValueError(f"{value!r} is not a valid directory entry name.")

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2213 Storage
Migrated	gitlab-migration	T2214 Scale-out graph and database storage in production
Migrated	gitlab-migration	T1892 Cassandra as a storage backend
Migrated	gitlab-migration	T1816 Make all clients of swh-storage use origin URL as identifier and use visit types instead of origin types
Migrated	gitlab-migration	T1731 Intrinsic identifiers for origins
Migrated	gitlab-migration	T1912 Support origin pagination without origin ids
Migrated	gitlab-migration	T2033 Run Cassandra storage backend with production data
Migrated	gitlab-migration	T4644 replayer crashes on invalid directory entry name (b'gitter/gitter.xml')
Migrated	gitlab-migration	T2185 Make webapp0 use Cassandra as storage backend.
Migrated	gitlab-migration	T2186 Merge swh-storage-cassandra in swh-storage master
Migrated	gitlab-migration	T2602 Investigate how to upgrade the schema of the Cassandra storage
Migrated	gitlab-migration	T3314 Test swh.storage.cassandra with ScyllaDB
Migrated	gitlab-migration	T3357 Perform some tests of the cassandra storage on Grid5000
Migrated	gitlab-migration	T3394 cassandra - origin url hashing encoding issue
Migrated	gitlab-migration	T3395 cassandra - Timeouts during revision import
Migrated	gitlab-migration	T3396 cassandra - allow to configure the consistency level used by the queries
Migrated	gitlab-migration	T3464 Prepare a quote for the cassandra servers
Migrated	gitlab-migration	T3465 Test multidatacenter replication
Migrated	gitlab-migration	T3491 Origin visit ids restart from 1 even if there is previous visits
Migrated	gitlab-migration	T3493 [cassandra] Git loader performance are very bad
Migrated	gitlab-migration	T3517 [cassandra] decorate the method calls to have statsd metrics
Migrated	gitlab-migration	T3573 [cassandra] directory and content read benchmarks
Migrated	gitlab-migration	T3577 Parallel loaders performances
Migrated	gitlab-migration	T3683 cassandra - benchmark the vault
Migrated	gitlab-migration	T3585 Fix inconsistencies of the Cassandra backend with postgres
Migrated	gitlab-migration	T3582 cassandra: Use 'git ordering' for directory entries
Migrated	gitlab-migration	T3586 Figure out what to do with 'misordered' directories in Cassandra

Event Timeline

swh-sentry-integration assigned this task to vlorentz.Oct 18 2022, 3:39 PM

swh-sentry-integration added projects: Data Model, Object storage, Archive integrity.

swh-sentry-integration added a subscriber: swh-sentry-integration.

vlorentz triaged this task as High priority.Oct 18 2022, 3:39 PM

vlorentz added a parent task: T2033: Run Cassandra storage backend with production data.

I tried to add a workaround in the backfiller, but it is incredibly hard to do properly, especially as entries as disordered, so raw_manifest needs to be fixed in two different ways.

I think it would be best to fix this once and for all in both the database and kafka. Here is a script to fix the three problematic trees:

from swh.model.hashutil import hash_to_bytes
from swh.model.model import Directory, DirectoryEntry
from swh.storage import get_storage

d1 = Directory(
    entries=(
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("ec4afaa35ce19f6f93133149cbbf316832007d6e"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("1dd3ec83942bbc04deee7fc6be8b9c6e703d57f9"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 132\x00"
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/gitter.xml\x00\xecJ\xfa\xa3\\\xe1\x9fo\x93\x131I\xcb\xbf1h2\x00}n"
        b"100775 gitter/script.py\x00\x1d\xd3\xec\x83\x94+\xbc\x04\xde\xee\x7f\xc6\xbe\x8b\x9cnp=W\xf9"
    ),
)

assert d1.id == hash_to_bytes("379405713938d0e89fe10d1ff43bdee87e06e088")

d2 = Directory(
    entries=(
        DirectoryEntry(
            name=b"AUTHORS",
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"COPYING",
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("ec4afaa35ce19f6f93133149cbbf316832007d6e"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("1dd3ec83942bbc04deee7fc6be8b9c6e703d57f9"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 202\x00"
        b"100775 AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/gitter.xml\x00\xecJ\xfa\xa3\\\xe1\x9fo\x93\x131I\xcb\xbf1h2\x00}n"
        b"100775 gitter/script.py\x00\x1d\xd3\xec\x83\x94+\xbc\x04\xde\xee\x7f\xc6\xbe\x8b\x9cnp=W\xf9"
    ),
)

assert d2.id == hash_to_bytes("04aa01e3bc24766135a29a9c6cfcc6a5eec50410")


d3 = Directory(
    entries=(
        DirectoryEntry(
            name=b"gitter_gitter.xml",  # changed
            type="file",
            target=hash_to_bytes("8512d68b66565426de430f06a96a2efedfc3623f"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_NEWS",  # changed
            type="file",
            target=hash_to_bytes("0bfeb2d36499ae4c330cffdcc2fe0f9ca30d2818"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_COPYING",  # changed
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"AUTHORS",
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"COPYING",
            type="file",
            target=hash_to_bytes("d50a11d64fa528fc76b38192d18c053fe82241da"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_script.py",  # changed
            type="file",
            target=hash_to_bytes("ff242b0acecc36271cc8a5848cdab57c91e488a8"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"gitter_AUTHORS",  # changed
            type="file",
            target=hash_to_bytes("7fde9841768149bb19884eff75edca01e104b181"),
            perms=0o100755,
        ),
        DirectoryEntry(
            name=b"README.markdown",
            type="file",
            target=hash_to_bytes("582d63d6b7a82afa139eef8371bf5e90e9555651"),
            perms=0o100755,
        ),
    ),
    raw_manifest=(
        b"tree 325\x00"
        b"100775 AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 README.markdown\x00X-c\xd6\xb7\xa8*\xfa\x13\x9e\xef\x83q\xbf^\x90\xe9UVQ"
        b"100775 gitter/AUTHORS\x00\x7f\xde\x98Av\x81I\xbb\x19\x88N\xffu\xed\xca\x01\xe1\x04\xb1\x81"
        b'100775 gitter/COPYING\x00\xd5\n\x11\xd6O\xa5(\xfcv\xb3\x81\x92\xd1\x8c\x05?\xe8"A\xda'
        b"100775 gitter/NEWS\x00\x0b\xfe\xb2\xd3d\x99\xaeL3\x0c\xff\xdc\xc2\xfe\x0f\x9c\xa3\r(\x18"
        b"100775 gitter/gitter.xml\x00\x85\x12\xd6\x8bfVT&\xdeC\x0f\x06\xa9j.\xfe\xdf\xc3b?"
        b"100775 gitter/script.py\x00\xff$+\n\xce\xcc6'\x1c\xc8\xa5\x84\x8c\xda\xb5|\x91\xe4\x88\xa8"
    ),
)

assert d3.id == hash_to_bytes("efe077bea80139cea754a83ac3d6fc773276745f")


storage = get_storage(
    "postgresql",
    db="service=swh",  # XXX: change me
    objstorage={"cls": "memory"},
)

storage.directory_missing = lambda *args, **kwargs: [d1.id, d2.id, d3.id]

print(storage.directory_add([d1, d2, d2]))

This script changes entry names to use an underscore instead of a slash, and adds raw manifests, with the original objects (extracted manually from https://github.com/jkingok/sl4a-scripts )

Someone with write access needs to DELETE the three objects from the database, edit db="service=swh", # XXX: change me, and run the script against the production DB.

many thanks to @seirl for doing the inventory in P1358

vlorentz changed the visibility from "All Users" to "Public (No Login Required)".Oct 25 2022, 5:57 PM

Do you actually want to keep these objects? This would be inconsistent with the fixed loader behavior that would just reject those objects, and not load the repository at all.

vlorentz mentioned this in D8776: converters: Replace '/' with '_' in directory entries.Oct 25 2022, 6:09 PM

Holes are bad. And I just opened a diff to make the git loader apply the same transformation, as @olasd made the same comment: D8776

Oh yeah, I was thinking of just removing the entire project, but your solution also works.

@vlorentz I'm running the following adaptation to your script:

import os
import pprint

import swh.storage.postgresql.storage
from swh.model.hashutil import hash_to_bytes
from swh.model.model import Directory, DirectoryEntry
from swh.storage import get_storage
from swh.storage.api.server import load_and_check_config

# [...]

config_path = os.environ.get("SWH_CONFIG_FILENAME")
config = load_and_check_config(config_path)

storage = get_storage(**config["storage"])
assert isinstance(storage, swh.storage.postgresql.storage.Storage)

db = storage.get_db()
with db.transaction() as cur:
    cur.execute("DELETE from directory where id in %s", ((d1.id, d2.id, d3.id),))
    print(cur.statusmessage)
    assert cur.statusmessage == "DELETE 3", "Deleted the wrong number of directories"

    ret = storage.directory_add([d1, d2, d3], db=db, cur=cur)
    print(ret)
    assert ret == {"directory:add": 3}, "Wrong number of directories added"
    for d in [d1, d2, d3]:
        print(f"Checking directory {d.id.hex()}")
        assert storage.directory_get_raw_manifest([d.id], db=db, cur=cur) == {
            d.id: d.raw_manifest,
        }, f"Raw manifest for {d.id.hex()} NOK"

        result = storage.directory_get_entries(d.id, db=db, cur=cur)
        assert result, f"Directory {d.id.hex()} not found"

        pprint.pprint(sorted(result.results, key=lambda e: e.name))

        assert sorted(result.results, key=lambda e: e.name) == sorted(
            d.entries, key=lambda e: e.name
        ), f"Entries do not match for {d.id.hex()}"

    input("Commit?")

Seems to work fine on a scratch local DB.

Pulling the config from saam will also write the fixed directories to kafka at no additional cost.

WDYT?

This is now done, the objects are fixed in the production DB and kafka.

Of course, we still need to adapt the replayer to skip these broken objects now, because kafka compaction will only happen "at some point"

vlorentz mentioned this in T4660: Workaround / in tree entry names.Oct 26 2022, 11:21 AM

To force kafka compaction to run I've done the following:

olasd@getty:~$ /opt/kafka/bin/kafka-configs.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --topic swh.journal.objects.directory --alter --add-config min.cleanable.dirty.ratio=0.1,min.compaction.lag.ms=1200000,max.compaction.lag.ms=2678400000
Completed updating config for topic swh.journal.objects.directory.
olasd@getty:~$ for nthreads in 2 4 8; do for broker in 1 2 3 4; do /opt/kafka/bin/kafka-configs.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --broker $broker --alter --add-config log.cleaner.threads=$nthreads; done; done
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.
Completed updating config for broker 1.
Completed updating config for broker 2.
Completed updating config for broker 3.
Completed updating config for broker 4.

This forces a compaction to run every 30 days at most (and it indeed triggered a compaction for the topic)

We have a new one that went unnoticed until 19 days ago: b'superduper/super/sub/bye.txt' is not a valid directory entry name.

This task has been migrated to GitLab.

replayer crashes on invalid directory entry name (b'gitter/gitter.xml')Closed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

replayer crashes on invalid directory entry name (b'gitter/gitter.xml')
Closed, MigratedEdits Locked
Actions

Related Objects
Search...