Page MenuHomeSoftware Heritage

Native hadoop libraries during graph compression
Closed, MigratedEdits Locked

Description

In the early steps of the graph compression pipeline, we have a pretty intensive workload in which the ORC graph dataset is read multiple times to write the intermediary compressed representation. By default, the pure Java implementation of the ORC readers is used, which is slower than the native libraries.

To use the native libraries instead, one should:

a. download a Hadoop release here: https://hadoop.apache.org/releases.html
b. extract it somewhere, e.g. /opt/hadoop-3.2.3
c. pass -Djava.library.path=/opt/hadoop-3.2.3/lib/native to the Java command line

I am not sure what would be the best way of doing that. I see a few options:

  1. We ask users to do a. and b. themselves, then give the path to the hadoop home in the configuration file of swh graph compress (How To Make Users Cry).
  2. We automatically download the libraries from the web at runtime and put them in a tmpdir (How To Make Package Managers Cry).
  3. We package hadoop in the debian repos of swh and add it as a dependency of swh-graph (How To Make Sysadmins Cry).

Thoughts?

Related Objects

Event Timeline

seirl triaged this task as Normal priority.May 16 2022, 6:29 PM
seirl created this task.

4.a. properly declare this in the maven dependencies of swh.graph
4.b. ensure the container image generation pipeline and container entrypoint script properly handle this extra dependency and argument

(4. doesn't work because libhadoop.so isn't packaged in maven)

So, as was mentioned during the irc discussion, one of the possible ways forward is to:

  • load the native libraries from a ~/.cache/swh/graph/.../ subdirectory if it exists
  • add a swh graph utility command (swh graph fetch-native-hadoop-libraries or somesuch) to fetch and extract the native libraries into that directory

We would then be able to use this utility command in the Dockerfile that would be used to deploy the swh.graph compression pipeline.