#+title: swh-git-loader - Specification (draft) #+author: swh team #+source: https://intranet.softwareheritage.org/index.php/Swh_git_loader The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. * License This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. * Dependency - python3 - python3-sqlalchemy - python3-psycopg2 - python3-pygit2 * Requirements ** Functional - input: a Git bare repository available locally, on the filesystem - input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache) - input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache) - output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added *** algo Sketch of the (naive) algorithm that the Git loader should execute #+begin_src pseudo for each ref in the repo for each commit referenced by the commit graph starting at that ref if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph for each tree referenced by the commit for each blob referenced by the tree compute the SHA256 checksum of the blob lookup the checksum in the file cache if it is not there add the file to the dataset on the filesystem add the file to the file cache, pointing to the file path on the filesystem #+end_src ** Non-functional - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via libgit2/pygit - cache: implemented as Postgres tables ** File-system storage Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will be stored at STORAGE_ROOT/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c * Configuration Create a configuration file in *~/.cache/sgloader.ini*: #+begin_src ini [main] dataset_dir = dataset log_dir = log # see http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls db_url = postgres:///swhgitloader #+end_src * Run ** Parse a repository #+begin_src sh PYTHONPATH=$PYTHONPATH:`pwd` ./bin/sgloader --repo-path /path/to/repo createdb #+end_src ** Clean data #+begin_src sh PYTHONPATH=$PYTHONPATH:`pwd` ./bin/sgloader --repo-path /path/to/repo dropdb #+end_src * TODO Improvments - [X] Push on remote git repository - [X] Serialize blob's data and not blob's size. - [X] Logging in python? How to see the log? - [ ] Drop sqlalchemy and use psycopg2 - [ ] A small unit test to determine the slight difference in commit number - [ ] Improve modularization (a file module? a hash computation module?) - [ ] Of course, add unit tests! * Implementation ** Expected results After each run (as per comparison purposes) : #+begin_src sql swhgitloader=> select count(*) from object_cache where type = 0; -- commit count ------- 1731 (1 row) swhgitloader=> select count(*) from object_cache where type = 1; -- tree count ------- 2819 (1 row) swhgitloader=> select count(*) from file_cache; count ------- 2944 (1 row) #+end_src ** sqlalchemy Lost my scratch buffer need to replay this. ** psycopg2 with one open/close for each db access #+begin_src sh # tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [17:38:56] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset PYTHONPATH=`pwd` ./bin/sgloader -v cleandb PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 85.82s user 23.53s system 19% cpu 9:16.00 total #+end_src with one connection open for the db during the computation: #+begin_src sh # tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [18:02:27] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset PYTHONPATH=`pwd` ./bin/sgloader -v cleandb PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 39.45s user 8.02s system 50% cpu 1:34.08 total #+end_src With unneeded check: #+begin_src sh # tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [18:07:59] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset PYTHONPATH=`pwd` ./bin/sgloader -v cleandb PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 30.74s user 6.19s system 48% cpu 1:16.32 total #+end_src Use the file cache: #+begin_src sh # tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [20:48:31] $ time make cleandb run REPO_PATH=~/repo/perso/dot-files FLAG=-v rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset PYTHONPATH=`pwd` ./bin/sgloader -v cleandb PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb make cleandb run REPO_PATH=~/repo/perso/dot-files FLAG=-v 17.68s user 2.23s system 33% cpu 59.404 total #+end_src