Page MenuHomeSoftware Heritage

Look into triple-stores suitable as swh-search backends
Closed, MigratedEdits Locked

Description

Currently we rely on ElasticSearch, by embedding JSON-LD documents and treating them as regular JSON when searching.

This causes issues like T2876 and T4396 because ES expects some sort of strict schema, which goes against JSON-LD's design.

Plus, it does not support any sort of relations between documents, which we would need to run queries against related projects (eg. dependency graphs, fork graphs, "related software")

Therefore, I would like to try using a proper triple-store. Virtuoso in particular looks promising, as it support both SPARQL and full-text search.

Event Timeline

vlorentz triaged this task as Normal priority.Aug 16 2022, 10:24 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz renamed this task from Look into Virtuoso as a new backend for swh-search to Look into triple-stores suitable as swh-search backends.Aug 16 2022, 2:51 PM

Another option I am considering if other options do not scale: repurposing swh-graph (especially now that it supports edge labels) as a triple store; but we would need to either reimplement SPARQL on top or use https://github.com/SoftwareHeritage/swh-graph-tinkerpop + https://tinkerpop.apache.org/docs/current/reference/#sparql-gremlin

Actually the open-source version of Virtuoso is unmaintained, and I couldn't figure out how to use it. Blazegraph and Apache Jena look promising too, though.

@KShivendu pointed out that Neo4j may be a strong option too. It doesn't natively support SPARQL, but we could have it as a layer on top of the Tinkerpop API