Page MenuHomeSoftware Heritage

Github descriptions are not used to search origins
Closed, MigratedEdits Locked

Event Timeline

vlorentz lowered the priority of this task from High to Normal.
vlorentz created this task.
ardumont changed the status of subtask T4614: Deploy swh-search v0.16.4 from Open to Work in Progress.Oct 17 2022, 4:57 PM

Bisecting, it seems the metadata is lost somewhere between the indexer storage:

softwareheritage-indexer=> select id, metadata, mappings from origin_extrinsic_metadata where id='https://github.com/nodejs-packages/settings-gateway';
id       | https://github.com/nodejs-packages/settings-gateway
metadata | {"name": "nodejs-packages/settings-gateway", "type": "https://forgefed.org/ns#Repository", "license": "https://spdx.org/licenses/MIT", "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "schema:dateCreated": "2020-06-03T19:50:03Z", "schema:downloadUrl": "https://api.github.com/repos/nodejs-packages/settings-gateway/{archive_format}{/ref}", "schema:dateModified": "2020-06-03T20:41:12Z", "codemeta:issueTracker": "https://api.github.com/repos/nodejs-packages/settings-gateway/issues{/number}", "schema:codeRepository": "https://github.com/nodejs-packages/settings-gateway", "https://forgefed.org/ns#forks": {"type": "https://www.w3.org/ns/activitystreams#OrderedCollection", "https://www.w3.org/ns/activitystreams#totalItems": 0}, "https://www.w3.org/ns/activitystreams#likes": {"type": "https://www.w3.org/ns/activitystreams#Collection", "https://www.w3.org/ns/activitystreams#totalItems": 0}, "https://www.w3.org/ns/activitystreams#followers": {"type": "https://www.w3.org/ns/activitystreams#Collection", "https://www.w3.org/ns/activitystreams#totalItems": 0}}
mappings | {github}

and swh-search:

>>> from swh.search import get_search; s = get_search("remote", url="http://moma.internal.softwareheritage.org:5010")
>>> s.origin_get("https://github.com/nodejs-packages/settings-gateway")
{'sha1': '8acc09d7081e31724973c8e971037b4cadac0340', 'url': 'https://github.com/nodejs-packages/settings-gateway', 'visit_types': ['git'], 'last_eventful_visit_date': '2022-09-16T22:41:12.169370+00:00', 'snapshot_id': '14fc014e30901e0aa7057a21b95a6c62d03ad10d', 'nb_visits': 1, 'has_visits': True, 'last_visit_date': '2022-09-16T22:41:12.169370+00:00'}

I just checked with another project, and it is indeed in the journal:

>>> client = get_journal_client(cls="kafka", **{**conf["journal"], "group_id": "swh-vlorentz-test-idx-origin-ext-metadata2", "prefix": "swh.journal.indexed", "object_types": ["origin_extrinsic_metadata"]})
>>> with tqdm.tqdm() as pbar:
...   client.process(print)
...
{'origin_extrinsic_metadata': [{'tool': {'id': 110502138, 'name': 'swh-metadata-detector', 'version': '0.0.2', 'configuration': {}}, 'id': 'https://github.com/JacobiClark/design-resources-for-developers', 'metadata': {'@context': 'https://doi.org/10.5063/schema/codemeta-2.0', 'type': 'https://forgefed.org/ns#Repository', 'schema:codeRepository': 'https://github.com/JacobiClark/design-resources-for-developers', 'schema:dateCreated': '2020-06-13T21:45:45Z', 'schema:dateModified': '2020-06-13T21:45:47Z', 'description': 'Curated list of design and UI resources from stock photos, web templates, CSS frameworks, UI libraries, tools and much more', 'schema:downloadUrl': 'https://api.github.com/repos/JacobiClark/design-resources-for-developers/{archive_format}{/ref}', 'license': 'https://spdx.org/licenses/MIT', 'name': 'JacobiClark/design-resources-for-developers', 'codemeta:issueTracker': 'https://api.github.com/repos/JacobiClark/design-resources-for-developers/issues{/number}', 'https://forgefed.org/ns#forks': {'type': 'https://www.w3.org/ns/activitystreams#OrderedCollection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}, 'https://www.w3.org/ns/activitystreams#followers': {'type': 'https://www.w3.org/ns/activitystreams#Collection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}, 'https://www.w3.org/ns/activitystreams#likes': {'type': 'https://www.w3.org/ns/activitystreams#Collection', 'https://www.w3.org/ns/activitystreams#totalItems': 0}}, 'from_remd_id': b'<\xd4\xd4l\xa6\x1f#\xf8\xa9\xf5Q\x90A\xe7\xd0\xc3!\xbb\xa6\xf5', 'mappings': ['github']}, ...

(I already ran the swh-vlorentz-test-idx-origin-ext-metadata2 consumer group up to the last message, so this message was produced today)

but metadata is not visible from swh-search:

>>> from swh.search import get_search; s = get_search("remote", url="http://moma.internal.softwareheritage.org:5010")
>>> s.origin_get("https://github.com/JacobiClark/design-resources-for-developers")
{'sha1': 'aec9c77413aee668ec89894620aad31158c4f81d', 'url': 'https://github.com/JacobiClark/design-resources-for-developers', 'visit_types': ['git'], 'last_eventful_visit_date': '2022-10-18T12:16:37.273014+00:00', 'snapshot_id': '615a2f4b8900a2a9acba8a49b6af025a93f5f657', 'nb_visits': 1, 'has_visits': True, 'last_visit_date': '2022-10-18T12:16:37.273014+00:00'}

But Grafana shows no lag from swh-search's journal client: https://grafana.softwareheritage.org/goto/cPEEDUSVk?orgId=1 (corroborated by CMAK: http://getty.internal.softwareheritage.org:9000/clusters/rocquencourt/consumers/swh.search.journal_client.indexed-v0.11/topic/swh.journal.indexed.origin_intrinsic_metadata/type/KF )

therefore, it seems swh-search's journal client reads objects, but either discards or writes them to the wrong ElasticSearch index

swh-web uses swh-search as a glorified postgresql index: for every result returned by swh-search, it pulls the corresponding row from origin_intrinsic_metadata in the indexer database; which means it ignores extrinsic metadata.

I will fix this by removing the query to swh-indexer when searching for origins, because we only care about the URL in this case.