The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ======= This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ============ Runtime ------- - python3 - python3-psycopg2 - python3-pygit2 Test ---- - python3-nose Requirements ============ Functional ---------- - input: a Git bare repository available locally, on the filesystem - input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache) - input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache) - output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added ### algo Sketch of the (naive) algorithm that the Git loader should execute ``` {.pseudo} for each ref in the repo for each commit referenced by the commit graph starting at that ref if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph for each tree referenced by the commit for each blob referenced by the tree compute the SHA256 checksum of the blob lookup the checksum in the file cache if it is not there add the file to the dataset on the filesystem add the file to the file cache, pointing to the file path on the filesystem ``` Non-functional -------------- - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via libgit2/pygit - cache: implemented as Postgres tables File-system storage ------------------- Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will be stored at STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c Configuration ============= swh-git-loader depends on some tools, here are the configuration files for those: swh-db-manager -------------- This is a tool in charge of db management (cleanup data, bootstrap model, etc...). Create a configuration file in **\~/.config/db-manager.ini** ``` {.ini} [main] # Where to store the logs log_dir = swh-git-loader/log # url access to db db_url = dbname=swhgitloader ``` See for the db url's schema swh-git-loader -------------- Create a configuration file in **\~/.config/swh/git-loader.ini**: ``` {.ini} [main] # Where to store the logs log_dir = /tmp/swh-git-loader/log # url access to api's backend backend_url = http://localhost:5000 ``` Note: - [DB url DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect) - the configuration file can be changed in the CLI with the flag \`-c \\` or \`--config-file \\` swh-backend ----------- Backend api. Create a configuration file in **\~/.config/swh/back.ini**: ``` {.ini} [main] # where to store blob on disk content_storage_dir = /tmp/swh-git-loader/content-storage # Where to store the logs log_dir = swh-git-loader/log # url access to db: dbname= (port= user= pass=) db_url = dbname=swhgitloader # activate the compression for each vcs stored object # storage_compression = true # compute folder's depth on disk aa/bb/cc/dd # folder_depth = 2 # Debugger (for dev only) debug = true ``` See for the db url's schema ### Tryouts PUT on commits: ``` {.bash} # tony at corellia in ~/work/inria/org/antelink on git:master x [14:04:40] $ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 204 NO CONTENT Content-Type: text/html; charset=utf-8 Content-Length: 0 Server: Werkzeug/0.9.6 Python/3.4.3+ Date: Thu, 18 Jun 2015 12:04:44 GMT # tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:05] $ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 18 Server: Werkzeug/0.9.6 Python/3.4.3+ Date: Thu, 18 Jun 2015 12:12:19 GMT Successful update!% # tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:19] $ curl http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e{ "sha1": "52745df6dd5dc46ee476a8be155ab049994f714e" }% ``` GET/PUT on blob: ``` {.bash} # tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:24] $ curl -i http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 404 NOT FOUND Content-Type: text/html; charset=utf-8 Content-Length: 10 Server: Werkzeug/0.9.6 Python/3.4.3+ Date: Thu, 18 Jun 2015 12:12:33 GMT Not found!% # tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:33] $ curl -i -XPUT -H'application/json' -d'git-sha1=456' -d'size=10' http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 204 NO CONTENT Content-Type: text/html; charset=utf-8 Content-Length: 0 Server: Werkzeug/0.9.6 Python/3.4.3+ Date: Thu, 18 Jun 2015 12:13:47 GMT # tony at corellia in ~/work/inria/org/antelink on git:master x [14:13:47] $ curl http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e{ "sha1": "52745df6dd5dc46ee476a8be155ab049994f714e" ``` Run === Environment initialization -------------------------- ``` {.bash} export PYTHONPATH=`pwd`:$PYTHONPATH ``` Help ---- ``` {.bash} bin/swh-git-loader --help bin/swh-db-manager --help ``` Parse a repository from a clean slate ------------------------------------- Clean and initialize the model then parse the repository git: ``` {.bash} bin/swh-db-manager cleandb bin/swh-db-manager initdb bin/swh-git-loader load /path/to/git/repo ``` For ease: ``` {.bash} make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo ``` Parse an existing repository ---------------------------- ``` {.bash} bin/swh-git-loader load /path/to/git/repo ``` Clean data ---------- ``` {.bash} bin/swh-db-manager cleandb ``` For ease: ``` {.bash} make cleandb ``` Init data --------- ``` {.bash} bin/swh-db-manager initdb ``` Log === Format ------ Activating the debug mode (flag \`-v\` or \`--verbose\` will log more information in the following format: \ \ \ where: \ - walk walk a tree or a reference - skip skip an already saved/visited object or unknown object (e.g. commit submodule) - store save an object in db (file or object) and content (file or object) storage - initialize Initialize the db - clean Clean the db's data \ - tree - commit - blob - reference - submodule-commit A commit from a submodule - unknown-action An unknown action from swhgitloader's cli \ - sha1 git or swh's sha1 - name object name - path object's content storage path Folder ------ The different tools can be configured in their respective .ini file. They, by default, log inside the swh-git-loader/log folder.