diff --git a/reports/node_mapping/node_mapping.tex b/reports/node_mapping/node_mapping.tex index ab831e8..cd7e81c 100644 --- a/reports/node_mapping/node_mapping.tex +++ b/reports/node_mapping/node_mapping.tex @@ -1,155 +1,203 @@ \documentclass[11pt,a4paper]{article} \usepackage[a4paper,left=2cm,right=2cm,top=2.5cm,bottom=2.5cm]{geometry} \usepackage[english]{babel} \usepackage{booktabs} \usepackage{graphicx} \usepackage{hyperref} \usepackage{minted} \usepackage{parskip} \usepackage{siunitx} \usepackage{tikz} \title{Google Summer of Code 2019} \author{Thibault Allançon} \date{June 6, 2019} \newcommand{\mmap}{\mintinline{text}{mmap}} \begin{document} \maketitle Software Heritage refers to its graph nodes using string persistent identifiers\footnote{\url{https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html}}. During the compression, WebGraph maps those labels to long identifiers. We need an efficient way to have a bidirectional map between the labels and internal node ids. \section{Additional informations} Example of a SWH string identifier: \begin{minted}{text} swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 \end{minted} The first step of the compression process is to map the input strings to long identifiers using the \mintinline{text}{GOVMinimalPerfectHashFunction} from the Sux4J\footnote{\url{http://sux4j.di.unimi.it/}} framework. The \mintinline{text}{.mph} function generated is a \textbf{minimal perfect hash function} (mapping with no collisions $n$ keys to $n$ consecutive integers). Later during the compression process, the nodes are re-ordered in a certain way (to be precise: in the order of a BFS traversal), so the MPH mapping is shuffled. \begin{figure}[H] \centering \begin{tikzpicture}[auto, node distance=6cm] \node[label=below:{(string)}] (input) {\mintinline{text}{.nodes.csv}}; \node[right of=input, label=below:{(long)}] (mph) {\mintinline{text}{.mph}}; \node[right of=mph, label=below:{(long)}] (bfs) {\mintinline{text}{.order}}; \draw[->] (input) edge node{MPH function} (mph); \draw[->] (mph) edge node{BFS ordering} (bfs); \end{tikzpicture} \caption{End-to-end node mapping} \end{figure} We want to create a bidirectional map to retrieve efficiently the corresponding strings or longs. Using naive hash maps is not possible due to memory usage (we -have around 20B nodes), hence we must use a disk-based solution. +have around 10B nodes), hence we must use a disk-based solution. -\section{Solution 1: \mmap{}} +\newpage + +\section{Solution 1: Simple file} Idea: dump the two mappings into huge files and load them using \mmap{} system call to avoid storing them into memory. Map from long to string only needs to store the SWH id because the MPH is a perfect mapping, so we can store on line $i$ the SWH persistent identifier of node $i$, and simply read the $i$-th line to retrieve it. Map from string to long needs to store both the SWH id and the node id on the same line. To efficiently get the corresponding node id we can store the lines ordered by SWH id, and binary search to get the mapping. -The main problem with this solution is technical since Java has a maximum size -on \mmap{}-ed files of ~2GB. To overcome the limitation and \mmap{} very large -files, there is the \mintinline{text}{ByteBufferInputStream} class from the + +\subsection{Java I/O API} + +Java offers many ways to write to a file, three were considered and timed: + +\begin{small} +\begin{minted}{java} +// Version 1 +try (Writer writer = new BufferedWriter(new OutputStreamWriter( + new FileOutputStream("output.txt"), "utf-8"))) { + writer.write("test"); +} + +// Version 2 +try (Writer writer = new BufferedWriter(new FileWriter("output.txt"))) { + writer.write("test"); +} + +// Version 3 +try (FileWriter writer = new FileWriter("output.txt")) { + writer.write("test"); +} +\end{minted} +\end{small} + +String ids are sorted in alphabetical order. When reading them, the long to +string map needs to write data in random places in the output file. One way to +transform it into a sequential write is to put the data into an array, and dump +the array sequentially into the file once the reading process is done. + +\subsection{\mmap{}} + +Java has a maximum size on \mmap{}-ed files of 2GB. To overcome the limitation +and \mmap{} very large files, there is the +\mintinline{text}{ByteBufferInputStream} class from the dsiutils\footnote{\url{http://dsiutils.di.unimi.it/}} package. But this only applies to read from files, not to write to them. I had to create the -\mintinline{text}{ByteBufferOutputStream} to allow writing on huge \mmap{}-ed -files. +\mintinline{text}{ByteBufferOutputStream} class to allow writing on huge +\mmap{}-ed files. + +\subsection{Timings} + +Experiments on single file were done on a VM with 1TB of RAM and 40 vCPUs using +the \mintinline{text}{rev_to_rev} dataset (1B nodes and 1B edges). + +\begin{center} + \begin{tabular}{@{} l *5r @{}} + \toprule + \multicolumn{1}{c}{} & + \textbf{I/O v1} & \textbf{I/O v2} & + \textbf{I/O v3} & + \textbf{random \mmap{}} & \textbf{sequential \mmap{}} \\ + \midrule + \texttt{Dump time} + & 4990s & \textbf{4890s} & 5350s & 9290s & 5240s \\ + \bottomrule + \end{tabular} +\end{center} -Note: I first started with the \mmap{} solution and got early results that -allowed me to compare with the DB-based solution, before going full in with the -\mmap{} one and coding the \mintinline{text}{ByteBufferOutputStream}. \section{Solution 2: Databases} Another approach would be to simply use a disk-based database, two were considered in my tests: RocksDB, and HaloDB. In both cases, I tried optimizing the database options to fit our needs and get the best timings. On the technical side, the code needed is simpler and more generic than the -custom \mmap{} solution. However, this can only be a feasible solution if map +custom file solution. However, this can only be a feasible solution if map retrieval/building time overhead are small enough to scale to the entire graph. \subsection{RocksDB} RocksDB is a popular and mature persistent key-value database developed by Facebook. It is coded in C++ but provides a Java API. The code used to dump and load the mappings can be found in \mintinline{text}{NodeIdMapRocksDB.java} file. RocksDB has \textbf{many} parameters to tune (see their tuning guide\footnote{\url{https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide}}), I tried the most obvious ones but being completely new to this software, the benchmarks are not be the best possible. \subsection{HaloDB} HaloDB is a relatively new database scheme developed by Yahoo. It got my attention because it is written in Java and leaves out features of RocksDB we don't need (like range scans) in order to have better performance. The code used to dump and load the mappings can be found in \mintinline{text}{NodeIdMapHaloDB.java} file. -\newpage - \section{Results} Experiments were done on a VM with 700GB of RAM and 72 vCPUs using the \mintinline{text}{rev_to_rev} dataset (1B nodes and 1B edges). \begin{center} \begin{tabular}{@{} l *3r @{}} \toprule \multicolumn{1}{c}{} & - \textbf{\mmap{}} & \textbf{RocksDB} & + \textbf{file solution} & \textbf{RocksDB} & \textbf{HaloDB} \\ \midrule \texttt{dump} - & 1h24min & estimated 4h & estimated 8h \\ + & < 1h & estimated 4h & estimated 8h \\ \texttt{load} & 0.003s & & \\ \texttt{get node id} & ~900µs & & \\ \texttt{get swh id} & ~50µs & & \\ \texttt{RAM usage (dump)} & 130GB & & \\ \texttt{RAM usage (load)} & 5GB & & \\ \texttt{Disk space} & 108GB & & \\ \bottomrule \end{tabular} \end{center} Note: \texttt{RAM usage} includes in-memory compressed graph. \end{document}