In the early steps of the graph compression pipeline, we have a pretty intensive workload in which the ORC graph dataset is read multiple times to write the intermediary compressed representation. By default, the pure Java implementation of the ORC readers is used, which is slower than the native libraries.
To use the native libraries instead, one should:
a. download a Hadoop release here: https://hadoop.apache.org/releases.html
b. extract it somewhere, e.g. /opt/hadoop-3.2.3
c. pass -Djava.library.path=/opt/hadoop-3.2.3/lib/native to the Java command line
I am not sure what would be the best way of doing that. I see a few options:
- We ask users to do a. and b. themselves, then give the path to the hadoop home in the configuration file of swh graph compress (How To Make Users Cry).
- We automatically download the libraries from the web at runtime and put them in a tmpdir (How To Make Package Managers Cry).
- We package hadoop in the debian repos of swh and add it as a dependency of swh-graph (How To Make Sysadmins Cry).
Thoughts?