Page MenuHomeSoftware Heritage

indexer.metadata: Warn and skip incomplete entries from the journal
ClosedPublic

Authored by ardumont on Jul 29 2022, 11:47 AM.

Details

Reviewers
vlorentz
Group Reviewers
Reviewers
Maniphest Tasks
Restricted Maniphest Task
Commits
rDCIDXf31f1f5b3426: indexer.metadata: Warn and skip incomplete entries from the journal
Summary

Detected through T4406

According to sentry, stuff like the 2nd item [1] can appear and it's not currently dealt with.
This diff is an attempt to unstuck the extrinsic metadata journal client.

Related to T4412

[1]

{
raw_extrinsic_metadata: [
{
authority: [Filtered], 
discovery_date: datetime.datetime(2021, 10, 18, 12, 43, 22, tzinfo=datetime.timezone.utc), 
fetcher: {"metadata":"{}","name":"'swh.loader.package.npm.loader.NpmLoader'","version":"'0.25.0'"}, 
format: 'replicate-npm-package-json', 
id: b'\x10u\xb6\xb8\xd7\xfe\xb1\x92^\xa7\xa30\xa5^\xf4\xa6\x8dpF\r', 
metadata: b'{"name": "@arkecosystem/platform-sdk-lsk", "description": "Cross-Platform Utilities for ARK Applications", "version": "0.9.362", "contributors": [], "license": "MIT", "main": "dist/index", "types": "dist/index", "scripts": {"build": "yarn clean && tsc", "build:watch": "yarn build -w", "build:docs": "typedoc --out docs src", "clean": "rimraf .coverage dist tmp", "test": "jest", "test:watch": "jest --watchAll", "coverage:report": "codecov", "publish": "yarn build && yarn npm publish --access public --tol..., 
origin: 'https://www.npmjs.com/package/@arkecosystem/platform-sdk-lsk', 
revision: 'swh:1:rev:77436d8960cc5c3997535a03711e34ca63cfd97b', 
target: 'swh:1:dir:18fd0df245b17e8552e32cb950db1d1837063fc2'
}, 
{
authority: [Filtered], 
discovery_date: datetime.datetime(2021, 10, 18, 12, 43, 22, tzinfo=datetime.timezone.utc)
}
]
}

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D8165 (id=29487)

Rebasing onto 29cfbd25ca...

Current branch diff-target is up to date.
Changes applied before test
commit 180601e26fd8594c6527789a7d5640ec2470daa1
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 29 11:46:06 2022 +0200

    indexer.metadata: Warn and skip incomplete entries from the journal
    
    Detected through T4406
    
    Related to T4412

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/389/ for more details.

{
authority: [Filtered], 
discovery_date: datetime.datetime(2021, 10, 18, 12, 43, 22, tzinfo=datetime.timezone.utc)
}

wtf? that's clearly not a valid REMD dict, how did that end up in the journal?

swh/indexer/metadata.py
79

%s %r but a single argument

wtf? that's clearly not a valid REMD dict, how did that end up in the journal?

no idea.

Could you open a task so I can investigate this and tombstone existing messages like this?

In the meantime, this will be fine, I guess

This revision is now accepted and ready to land.Jul 29 2022, 1:35 PM

Build is green

Patch application report for D8165 (id=29488)

Rebasing onto 29cfbd25ca...

Current branch diff-target is up to date.
Changes applied before test
commit f31f1f5b34264026826e4b15a41290179f9c1892
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 29 11:46:06 2022 +0200

    indexer.metadata: Warn and skip incomplete entries from the journal
    
    Detected through T4406
    
    Related to T4412

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/390/ for more details.

Could you open a task so I can investigate this and tombstone existing messages like this?

In the meantime, this will be fine, I guess

yes, I opened T4412 which is linked here for this.

actually the object causing issue is:

{authority: {metadata: {}, type: 'forge', url: 'https://pypi.org/'}, discovery_date: datetime.datetime(2020, 11, 30, 22, 41, 37, 627239, tzinfo=datetime.timezone.utc), fetcher: {metadata: {}, name: 'swh.loader.package.pypi.loader.PyPILoader', version: '0.15.0'}, format: 'pypi-project-json', metadata: b'{"comment_text": "", "digests": {"md5": "dd8ef4e12995d12f2c8d95b9a6cfd677", "sha256": "aaab2adbe8fd5489110308a6d1ee428ded77f876423cad89a73949e0026d1c7e"}, "downloads": -1, "filename": "cloudknot-0.2.1.tar.gz", "has_sig": false, "md5_digest": "dd8ef4e12995d12f2c8d95b9a6cfd677", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 73133, "upload_time": "2017-12-02T22:02:11", "upload_time_iso_8601": "2017-12-02T22:02:11.049284Z", "url": "https://files.pythonhosted.org/packa..., origin: 'https://pypi.org/project/cloudknot/', revision: 'swh:1:rev:1da412dcd10cb71f8691dacacb51e5c900da0c2b', target: 'swh:1:dir:536907383a30b30f343780f383663f6e66cf43e2', type: 'directory'}

only id is missing, not other keys.