Page MenuHomeSoftware Heritage

README.org
No OneTemporary

README.org

#+title: swh-git-loader - Specification (draft)
#+author: swh team
#+source: https://intranet.softwareheritage.org/index.php/Swh_git_loader
The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before.
* License
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public License
along with this program.
* Dependencies
** Runtime
- python3
- python3-psycopg2
- python3-pygit2
** Test
- python3-nose
* Requirements
** Functional
- input: a Git bare repository available locally, on the filesystem
- input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache)
- input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache)
- output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added
*** algo
Sketch of the (naive) algorithm that the Git loader should execute
#+begin_src pseudo
for each ref in the repo
for each commit referenced by the commit graph starting at that ref
if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph
for each tree referenced by the commit
for each blob referenced by the tree
compute the SHA256 checksum of the blob
lookup the checksum in the file cache
if it is not there
add the file to the dataset on the filesystem
add the file to the file cache, pointing to the file path on the filesystem
#+end_src
** Non-functional
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via libgit2/pygit
- cache: implemented as Postgres tables
** File-system storage
Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
It will be stored at STORAGE_ROOT/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
* Configuration
Create a configuration file in *~/.config/sgloader.ini*:
#+begin_src ini
[main]
dataset_dir = dataset
log_dir = log
db_url = dbname=swhgitloader user=tony
#+end_src
* Run
** Parse a repository
#+begin_src sh
PYTHONPATH=$PYTHONPATH:`pwd` ./bin/sgloader --repo-path /path/to/repo createdb
#+end_src
For ease:
#+begin_src sh
make cleandb run REPO_PATH=/path/to/repo
#+end_src
** Clean data
#+begin_src sh
PYTHONPATH=$PYTHONPATH:`pwd` ./bin/sgloader dropdb
#+end_src
For ease:
#+begin_src sh
make cleandb
#+end_src
* IN-PROGRESS Improvments
- [X] Push on remote git repository
- [X] Serialize blob's data and not blob's size.
- [X] Logging in python? How to see the log?
- [X] Replace sqlalchemy dao layer with psycopg2
- [X] Improve sgloader cli interface
- [ ] Use sha1 instead of sha256
- [ ] Serialize sha1 as bytes
- [ ] A unit test to determine the slight difference in commit number
- [ ] Of course, add unit tests!
- [ ] Improve modularization (a file module? a hash computation module?)
* Implementation
** Expected results
Given a specific repository https://github.com/ardumont/dot-files.git
Here is the expected result for each run (as per comparison purposes):
#+begin_src sh
swhgitloader=> select count(*) from object_cache where type = 0; -- commit
count
-------
1744
(1 row)
swhgitloader=> select count(*) from object_cache where type = 1; -- tree
count
-------
2839
(1 row)
swhgitloader=> select count(*) from file_cache;
count
-------
2958
(1 row)
#+end_src
** sqlalchemy
ORM framework.
#+begin_src sh
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master o [10:35:08]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
PYTHONPATH=`pwd` ./bin/sgloader -v cleandb
PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 161.05s user 10.82s system 76% cpu 3:46.01 total
#+end_src
** psycopg2
A simple db client.
First implementation, with one open/close for each db access:
#+begin_src sh
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [17:38:56]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
PYTHONPATH=`pwd` ./bin/sgloader -v cleandb
PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 85.82s user 23.53s system 19% cpu 9:16.00 total
#+end_src
With one opened connection during all the computation:
#+begin_src sh
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [18:02:27]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
PYTHONPATH=`pwd` ./bin/sgloader -v cleandb
PYTHONPATH=`pwd` ./bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 39.45s user 8.02s system 50% cpu 1:34.08 total
#+end_src
Sanitize the algorithm (remove unneeded check, use the file cache, ...) :
#+begin_src sh
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [10:42:03]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
PYTHONPATH=`pwd` bin/sgloader -v cleandb
PYTHONPATH=`pwd` bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 15.90s user 2.08s system 31% cpu 56.879 total
#+end_src
No need for byte decoding before serializing on disk:
#+begin_src sh
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [12:36:10]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
PYTHONPATH=`pwd` bin/sgloader -v cleandb
PYTHONPATH=`pwd` bin/sgloader -v --repo-path ~/repo/perso/dot-files initdb
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 14.67s user 1.64s system 30% cpu 54.303 total
#+end_src

File Metadata

Mime Type
text/plain
Expires
Thu, Jul 3, 11:07 AM (1 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3301945

Event Timeline