The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ======= This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ============ Runtime ------- - python3 - python3-psycopg2 - python3-pygit2 Test ---- - python3-nose - Requirements Functional ---------- - input: a Git bare repository available locally, on the filesystem - input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache) - input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache) - output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added ### algo Sketch of the (naive) algorithm that the Git loader should execute ``` {.pseudo} for each ref in the repo for each commit referenced by the commit graph starting at that ref if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph for each tree referenced by the commit for each blob referenced by the tree compute the SHA256 checksum of the blob lookup the checksum in the file cache if it is not there add the file to the dataset on the filesystem add the file to the file cache, pointing to the file path on the filesystem ``` Non-functional -------------- - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via libgit2/pygit - cache: implemented as Postgres tables File-system storage ------------------- Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will be stored at STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c Configuration ============= Create a configuration file in **\~/.config/sgloader.ini**: ``` {.ini} [main] # where to store blob on disk file_content_storage_dir = swh-git-loader/file-content-storage # where to store commit/tree on disk object_content_storage_dir = swh-git-loader/object-content-storage # Where to store the logs log_dir = swh-git-loader/log # url access to db # http://initd.org/psycopg/docs/module.html#psycopg2.connect db_url = dbname=swhgitloader # activate the compression for each blob object #blob_compression = true # compute folder's depth on disk aa/bb/cc/dd #folder_depth=4 ``` Note: - [DB url DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect) - the configuration file can be changed in the CLI with the flag \`-c \\` or \`--config-file \\` - Run Environment initialization -------------------------- ``` {.bash} export PYTHONPATH=`pwd`:$PYTHONPATH ``` Help ---- ``` {.bash} bin/sgloader --help ``` Parse a repository from a clean slate ------------------------------------- Clean and initialize the model then parse the repository git: ``` {.bash} bin/sgloader cleandb bin/sgloader initdb bin/sgloader load /path/to/git/repo ``` For ease: ``` {.bash} make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo ``` Parse an existing repository ---------------------------- ``` {.bash} bin/sgloader load /path/to/git/repo ``` Clean data ---------- ``` {.bash} bin/sgloader cleandb ``` For ease: ``` {.bash} make cleandb ``` Init data --------- ``` {.bash} bin/sgloader initdb ``` Log === Activating the debug mode (flag \`-v\` or \`--verbose\` will log more information in the following format: \ \ \ where: \ - walk walk a tree or a reference - skip skip an already saved/visited object or unknown object (e.g. commit submodule) - store save an object in db (file or object) and content (file or object) storage - initialize Initialize the db - clean Clean the db's data \ - tree - commit - blob - reference - submodule-commit A commit from a submodule - unknown-action An unknown action from swhgitloader's cli \ - sha1 git or swh's sha1 - name object name - path object's content storage path