Page MenuHomeSoftware Heritage

python bindings for compressed graph access
Closed, MigratedEdits Locked

Description

We want to have some basic python bindings for accessing the compressed graph representation produced by WebGraph.

Edge-by-edge traversal round-tripping between python and java wouldn't be great performance-wise (yet still useful in various cases), but bindings for entire visits returning integer node identifiers would be quite usable as an alternative to relying on the REST API.

The current best way for doing python<->java bindings seems to be Py4J.

Event Timeline

zack triaged this task as Low priority.Jul 6 2019, 12:21 PM
zack created this task.

Here's a first stab at an API for the py4j bindings that would be nice to use.

Note that it relies on the idea that we will have a single interface definition (as an ABC) for any kind of swh-graph accessor, and that we will uniform existing ones (e.g., the swh-graph Python REST client) to use it. I haven't replicated in the API the interface of each method, but as a first stab we should move here the current interface of the Python REST client.

cc: @seirl for feedback on the API, as he plans to use it.

from abc import ABC, abstractmethod
from enum import Enum

from typing import Set, Union


class GraphFeatures(Enum):
    """graph features selectable at graph loading timeout

    """
    FORWARD_GRAPH = 1
    BACKWARD_GRAPH = 2
    NODE_TYPE_MAP = 3


# graph features loaded by default
DEFAULT_FEATURES: Set[GraphFeatures] = {
    GraphFeatures.FORWARD_GRAPH,   # standard Merkle DAG
    GraphFeatures.BACKWARD_GRAPH,  # transposed Merkle DAG
    GraphFeatures.NODE_TYPE_MAP,   # map: node id (int) -> node type
}


class Graph(ABC):
    """SWH Merkle DAG navigation interface

    common API implemented by all graph abstractions, including REST clients,
    Java bindings, etc.

    """

    @abstractmethod
    def neighbors(...):
        pass

    @abstractmethod
    def visit(...):
        pass

    @abstractmethod
    def walk(...):
        pass

    @abstractmethod
    def leaves(...):
        pass


def load(basepath: Union[str, bytes],
         features: Set[GraphFeatures] = DEFAULT_FEATURES) -> Graph:
    pass

Sounds good. No 'node' type then, we just use IDs? Maybe a Node type would allow to do stuff like neighbors() directly on the Node instance?

This looks a bit "low level" to my taste, but no strong opinion whatsoever.

No objection from me on adding a more abstract node type. It would be a nicer API and, given it's gonna be on the python side only, it won't have any impact on perf.

Something else entirely: streaming. It might be worth to think now at a streaming API (and possibly make it the only API), to be able to return huge node lists/paths without having to load them entirely in memory.

That would pave the way to, maybe, implement the REST API service in Python, on top of the bindings.

seirl claimed this task.