Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9342894
README
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
4 KB
Subscribers
None
README
View Options
The Software Heritage Git Loader is a tool and a library to walk a local
Git repository and inject into the SWH dataset all contained files that
weren't known before.
License
=======
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public
License along with this program.
Dependencies
============
Runtime
-------
- python3
- python3-psycopg2
- python3-pygit2
Test
----
- python3-nose
- Requirements
Functional
----------
- input: a Git bare repository available locally, on the filesystem
- input (optional): a table mapping SHA256 of individual files to path
on the filesystem that contain the corresponding content (AKA, the
file cache)
- input (optional): a set of SHA1 of Git commits that have already
been seen in the past (AKA, the Git commit cache)
- output: an augmented SWH dataset, where all files present in all
blobs referenced by any Git object, have been added
### algo
Sketch of the (naive) algorithm that the Git loader should execute
``` {.pseudo}
for each ref in the repo
for each commit referenced by the commit graph starting at that ref
if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph
for each tree referenced by the commit
for each blob referenced by the tree
compute the SHA256 checksum of the blob
lookup the checksum in the file cache
if it is not there
add the file to the dataset on the filesystem
add the file to the file cache, pointing to the file path on the filesystem
```
Non-functional
--------------
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via libgit2/pygit
- cache: implemented as Postgres tables
File-system storage
-------------------
Given a file with SHA256 of
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will
be stored at
STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
Configuration
=============
Create a configuration file in **\~/.config/sgloader.ini**:
``` {.ini}
[main]
# where to store blob on disk
file_content_storage_dir = swh-git-loader/file-content-storage
# where to store commit/tree on disk
object_content_storage_dir = swh-git-loader/object-content-storage
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db
# http://initd.org/psycopg/docs/module.html#psycopg2.connect
db_url = dbname=swhgitloader
# activate the compression for each blob object
#blob_compression = true
# compute folder's depth on disk aa/bb/cc/dd
#folder_depth=4
```
Note:
- [DB url
DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect)
- the configuration file can be changed in the CLI with the flag \`-c
\<config-filepath\>\` or \`--config-file \<config-filepath\>\`
- Run
Environment initialization
--------------------------
``` {.bash}
export PYTHONPATH=`pwd`:$PYTHONPATH
```
Help
----
``` {.bash}
bin/sgloader --help
```
Parse a repository from a clean slate
-------------------------------------
Clean and initialize the model then parse the repository git:
``` {.bash}
bin/sgloader cleandb
bin/sgloader initdb
bin/sgloader load /path/to/git/repo
```
For ease:
``` {.bash}
make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo
```
Parse an existing repository
----------------------------
``` {.bash}
bin/sgloader load /path/to/git/repo
```
Clean data
----------
``` {.bash}
bin/sgloader cleandb
```
For ease:
``` {.bash}
make cleandb
```
Init data
---------
``` {.bash}
bin/sgloader initdb
```
Log
===
Activating the debug mode (flag \`-v\` or \`--verbose\` will log more
information in the following format: \<action-verb\> \<nature-object\>
\<sha1-name-or-path\>
where: \<action-verb\>
- walk walk a tree or a reference
- skip skip an already saved/visited object or unknown object (e.g.
commit submodule)
- store save an object in db (file or object) and content (file or
object) storage
- initialize Initialize the db
- clean Clean the db's data
\<nature-object\>
- tree
- commit
- blob
- reference
- submodule-commit A commit from a submodule
- unknown-action An unknown action from swhgitloader's cli
\<sha1-name-or-path\>
- sha1 git or swh's sha1
- name object name
- path object's content storage path
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Fri, Jul 4, 1:05 PM (1 w, 1 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3285002
Attached To
rDLDG Git loader
Event Timeline
Log In to Comment