Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9343739
README
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
7 KB
Subscribers
None
README
View Options
The Software Heritage Git Loader is a tool and a library to walk a local
Git repository and inject into the SWH dataset all contained files that
weren't known before.
License
=======
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public
License along with this program.
Dependencies
============
Runtime
-------
- python3
- python3-psycopg2
- python3-pygit2
Test
----
- python3-nose
Requirements
============
Functional
----------
- input: a Git bare repository available locally, on the filesystem
- input (optional): a table mapping SHA256 of individual files to path
on the filesystem that contain the corresponding content (AKA, the
file cache)
- input (optional): a set of SHA1 of Git commits that have already
been seen in the past (AKA, the Git commit cache)
- output: an augmented SWH dataset, where all files present in all
blobs referenced by any Git object, have been added
### algo
Sketch of the (naive) algorithm that the Git loader should execute
``` {.pseudo}
for each ref in the repo
for each commit referenced by the commit graph starting at that ref
if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph
for each tree referenced by the commit
for each blob referenced by the tree
compute the SHA256 checksum of the blob
lookup the checksum in the file cache
if it is not there
add the file to the dataset on the filesystem
add the file to the file cache, pointing to the file path on the filesystem
```
Non-functional
--------------
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via libgit2/pygit
- cache: implemented as Postgres tables
File-system storage
-------------------
Given a file with SHA256 of
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will
be stored at
STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
Configuration
=============
swh-git-loader depends on some tools, here are the configuration files
for those:
swh-db-manager
--------------
This is a tool in charge of db management (cleanup data, bootstrap
model, etc...).
Create a configuration file in **\~/.config/db-manager.ini**
``` {.ini}
[main]
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db
db_url = dbname=swhgitloader
```
See <http://initd.org/psycopg/docs/module.html#psycopg2.connect> for the
db url's schema
swh-git-loader
--------------
Create a configuration file in **\~/.config/swh/git-loader.ini**:
``` {.ini}
[main]
# Where to store the logs
log_dir = /tmp/swh-git-loader/log
# url access to api's backend
backend_url = http://localhost:5000
```
Note:
- [DB url
DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect)
- the configuration file can be changed in the CLI with the flag \`-c
\<config-filepath\>\` or \`--config-file \<config-filepath\>\`
swh-backend
-----------
Backend api.
Create a configuration file in **\~/.config/swh/back.ini**:
``` {.ini}
[main]
# where to store blob on disk
content_storage_dir = /tmp/swh-git-loader/content-storage
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db: dbname=<host> (port=<port> user=<user> pass=<pass>)
db_url = dbname=swhgitloader
# activate the compression for each vcs stored object
# storage_compression = true
# compute folder's depth on disk aa/bb/cc/dd
# folder_depth = 2
# Debugger (for dev only)
debug = true
```
See <http://initd.org/psycopg/docs/module.html#psycopg2.connect> for the
db url's schema
### Tryouts
PUT on commits:
``` {.bash}
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:04:40]
$ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 204 NO CONTENT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:04:44 GMT
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:05]
$ curl -i -XPUT -H'application/json' -d 'date=1' http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 18
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:12:19 GMT
Successful update!%
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:19]
$ curl http://localhost:5000/commits/52745df6dd5dc46ee476a8be155ab049994f714e{
"sha1": "52745df6dd5dc46ee476a8be155ab049994f714e"
}%
```
GET/PUT on blob:
``` {.bash}
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:24]
$ curl -i http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e HTTP/1.0 404 NOT FOUND
Content-Type: text/html; charset=utf-8
Content-Length: 10
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:12:33 GMT
Not found!%
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:12:33]
$ curl -i -XPUT -H'application/json' -d'git-sha1=456' -d'size=10' http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e
HTTP/1.0 204 NO CONTENT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Server: Werkzeug/0.9.6 Python/3.4.3+
Date: Thu, 18 Jun 2015 12:13:47 GMT
# tony at corellia in ~/work/inria/org/antelink on git:master x [14:13:47]
$ curl http://localhost:5000/blobs/52745df6dd5dc46ee476a8be155ab049994f714e{
"sha1": "52745df6dd5dc46ee476a8be155ab049994f714e"
```
Run
===
Environment initialization
--------------------------
``` {.bash}
export PYTHONPATH=`pwd`:$PYTHONPATH
```
Help
----
``` {.bash}
bin/swh-git-loader --help
bin/swh-db-manager --help
```
Parse a repository from a clean slate
-------------------------------------
Clean and initialize the model then parse the repository git:
``` {.bash}
bin/swh-db-manager cleandb
bin/swh-db-manager initdb
bin/swh-git-loader load /path/to/git/repo
```
For ease:
``` {.bash}
make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo
```
Parse an existing repository
----------------------------
``` {.bash}
bin/swh-git-loader load /path/to/git/repo
```
Clean data
----------
``` {.bash}
bin/swh-db-manager cleandb
```
For ease:
``` {.bash}
make cleandb
```
Init data
---------
``` {.bash}
bin/swh-db-manager initdb
```
Log
===
Format
------
Activating the debug mode (flag \`-v\` or \`--verbose\` will log more
information in the following format: \<action-verb\> \<nature-object\>
\<sha1-name-or-path\>
where: \<action-verb\>
- walk walk a tree or a reference
- skip skip an already saved/visited object or unknown object (e.g.
commit submodule)
- store save an object in db (file or object) and content (file or
object) storage
- initialize Initialize the db
- clean Clean the db's data
\<nature-object\>
- tree
- commit
- blob
- reference
- submodule-commit A commit from a submodule
- unknown-action An unknown action from swhgitloader's cli
\<sha1-name-or-path\>
- sha1 git or swh's sha1
- name object name
- path object's content storage path
Folder
------
The different tools can be configured in their respective .ini file.
They, by default, log inside the swh-git-loader/log folder.
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Fri, Jul 4, 1:48 PM (3 d, 20 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3281731
Attached To
rDLDG Git loader
Event Timeline
Log In to Comment