Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T4599 Github descriptions are not used to search origins | ||
Migrated | gitlab-migration | T4614 Deploy swh-search v0.16.4 |
Event Timeline
Bisecting, it seems the metadata is lost somewhere between the indexer storage:
softwareheritage-indexer=> select id, metadata, mappings from origin_extrinsic_metadata where id='https://github.com/nodejs-packages/settings-gateway'; id | https://github.com/nodejs-packages/settings-gateway metadata | {"name": "nodejs-packages/settings-gateway", "type": "https://forgefed.org/ns#Repository", "license": "https://spdx.org/licenses/MIT", "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "schema:dateCreated": "2020-06-03T19:50:03Z", "schema:downloadUrl": "https://api.github.com/repos/nodejs-packages/settings-gateway/{archive_format}{/ref}", "schema:dateModified": "2020-06-03T20:41:12Z", "codemeta:issueTracker": "https://api.github.com/repos/nodejs-packages/settings-gateway/issues{/number}", "schema:codeRepository": "https://github.com/nodejs-packages/settings-gateway", "https://forgefed.org/ns#forks": {"type": "https://www.w3.org/ns/activitystreams#OrderedCollection", "https://www.w3.org/ns/activitystreams#totalItems": 0}, "https://www.w3.org/ns/activitystreams#likes": {"type": "https://www.w3.org/ns/activitystreams#Collection", "https://www.w3.org/ns/activitystreams#totalItems": 0}, "https://www.w3.org/ns/activitystreams#followers": {"type": "https://www.w3.org/ns/activitystreams#Collection", "https://www.w3.org/ns/activitystreams#totalItems": 0}} mappings | {github}
and swh-search:
>>> from swh.search import get_search; s = get_search("remote", url="http://moma.internal.softwareheritage.org:5010") >>> s.origin_get("https://github.com/nodejs-packages/settings-gateway") {'sha1': '8acc09d7081e31724973c8e971037b4cadac0340', 'url': 'https://github.com/nodejs-packages/settings-gateway', 'visit_types': ['git'], 'last_eventful_visit_date': '2022-09-16T22:41:12.169370+00:00', 'snapshot_id': '14fc014e30901e0aa7057a21b95a6c62d03ad10d', 'nb_visits': 1, 'has_visits': True, 'last_visit_date': '2022-09-16T22:41:12.169370+00:00'}
I just checked with another project, and it is indeed in the journal:
>>> client = get_journal_client(cls="kafka", **{**conf["journal"], "group_id": "swh-vlorentz-test-idx-origin-ext-metadata2", "prefix": "swh.journal.indexed", "object_types": ["origin_extrinsic_metadata"]}) >>> with tqdm.tqdm() as pbar: ... client.process(print) ... {'origin_extrinsic_metadata': [{'tool': {'id': 110502138, 'name': 'swh-metadata-detector', 'version': '0.0.2', 'configuration': {}}, 'id': 'https://github.com/JacobiClark/design-resources-for-developers', 'metadata': {'@context': 'https://doi.org/10.5063/schema/codemeta-2.0', 'type': 'https://forgefed.org/ns#Repository', 'schema:codeRepository': 'https://github.com/JacobiClark/design-resources-for-developers', 'schema:dateCreated': '2020-06-13T21:45:45Z', 'schema:dateModified': '2020-06-13T21:45:47Z', 'description': 'Curated list of design and UI resources from stock photos, web templates, CSS frameworks, UI libraries, tools and much more', 'schema:downloadUrl': 'https://api.github.com/repos/JacobiClark/design-resources-for-developers/{archive_format}{/ref}', 'license': 'https://spdx.org/licenses/MIT', 'name': 'JacobiClark/design-resources-for-developers', 'codemeta:issueTracker': 'https://api.github.com/repos/JacobiClark/design-resources-for-developers/issues{/number}', 'https://forgefed.org/ns#forks': {'type': 'https://www.w3.org/ns/activitystreams#OrderedCollection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}, 'https://www.w3.org/ns/activitystreams#followers': {'type': 'https://www.w3.org/ns/activitystreams#Collection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}, 'https://www.w3.org/ns/activitystreams#likes': {'type': 'https://www.w3.org/ns/activitystreams#Collection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}}, 'from_remd_id': b'<\xd4\xd4l\xa6\x1f#\xf8\xa9\xf5Q\x90A\xe7\xd0\xc3!\xbb\xa6\xf5', 'mappings': ['github']}, ...
(I already ran the swh-vlorentz-test-idx-origin-ext-metadata2 consumer group up to the last message, so this message was produced today)
but metadata is not visible from swh-search:
>>> from swh.search import get_search; s = get_search("remote", url="http://moma.internal.softwareheritage.org:5010") >>> s.origin_get("https://github.com/JacobiClark/design-resources-for-developers") {'sha1': 'aec9c77413aee668ec89894620aad31158c4f81d', 'url': 'https://github.com/JacobiClark/design-resources-for-developers', 'visit_types': ['git'], 'last_eventful_visit_date': '2022-10-18T12:16:37.273014+00:00', 'snapshot_id': '615a2f4b8900a2a9acba8a49b6af025a93f5f657', 'nb_visits': 1, 'has_visits': True, 'last_visit_date': '2022-10-18T12:16:37.273014+00:00'}
But Grafana shows no lag from swh-search's journal client: https://grafana.softwareheritage.org/goto/cPEEDUSVk?orgId=1 (corroborated by CMAK: http://getty.internal.softwareheritage.org:9000/clusters/rocquencourt/consumers/swh.search.journal_client.indexed-v0.11/topic/swh.journal.indexed.origin_intrinsic_metadata/type/KF )
therefore, it seems swh-search's journal client reads objects, but either discards or writes them to the wrong ElasticSearch index
Now depends on https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4658 instead
swh-web uses swh-search as a glorified postgresql index: for every result returned by swh-search, it pulls the corresponding row from origin_intrinsic_metadata in the indexer database; which means it ignores extrinsic metadata.
I will fix this by removing the query to swh-indexer when searching for origins, because we only care about the URL in this case.