Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9124122
README
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
16 KB
Subscribers
None
README
View Options
The Software Heritage Git Loader is a tool and a library to walk a local
Git repository and inject into the SWH dataset all contained files that
weren't known before.
License
=======
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public
License along with this program.
Dependencies
============
Runtime
-------
- python3
- python3-psycopg2
- python3-pygit2
Test
----
- python3-nose
- Requirements
Functional
----------
- input: a Git bare repository available locally, on the filesystem
- input (optional): a table mapping SHA256 of individual files to path
on the filesystem that contain the corresponding content (AKA, the
file cache)
- input (optional): a set of SHA1 of Git commits that have already
been seen in the past (AKA, the Git commit cache)
- output: an augmented SWH dataset, where all files present in all
blobs referenced by any Git object, have been added
### algo
Sketch of the (naive) algorithm that the Git loader should execute
``` {.pseudo}
for each ref in the repo
for each commit referenced by the commit graph starting at that ref
if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph
for each tree referenced by the commit
for each blob referenced by the tree
compute the SHA256 checksum of the blob
lookup the checksum in the file cache
if it is not there
add the file to the dataset on the filesystem
add the file to the file cache, pointing to the file path on the filesystem
```
Non-functional
--------------
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via libgit2/pygit
- cache: implemented as Postgres tables
File-system storage
-------------------
Given a file with SHA256 of
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will
be stored at
STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
Configuration
=============
Create a configuration file in **\~/.config/sgloader.ini**:
``` {.ini}
[main]
# where to store blob on disk
file_content_storage_dir = swh-git-loader/file-content-storage
# where to store commit/tree on disk
object_content_storage_dir = swh-git-loader/object-content-storage
# Where to store the logs
log_dir = swh-git-loader/log
# url access to db
# http://initd.org/psycopg/docs/module.html#psycopg2.connect
db_url = dbname=swhgitloader
# activate the compression for each blob object
#blob_compression = true
# compute folder's depth on disk aa/bb/cc/dd
#folder_depth=4
```
Note:
- [DB url
DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect)
- the configuration file can be changed in the CLI with the flag \`-c
\<config-filepath\>\` or \`--config-file \<config-filepath\>\`
- Run
Environment initialization
--------------------------
``` {.bash}
export PYTHONPATH=`pwd`:$PYTHONPATH
```
Help
----
``` {.bash}
bin/sgloader --help
```
Parse a repository from a clean slate
-------------------------------------
Clean and initialize the model then parse the repository git:
``` {.bash}
bin/sgloader cleandb
bin/sgloader initdb
bin/sgloader load /path/to/git/repo
```
For ease:
``` {.bash}
make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo
```
Parse an existing repository
----------------------------
``` {.bash}
bin/sgloader load /path/to/git/repo
```
Clean data
----------
``` {.bash}
bin/sgloader cleandb
```
For ease:
``` {.bash}
make cleandb
```
Init data
---------
``` {.bash}
bin/sgloader initdb
```
Log
===
Activating the debug mode (flag \`-v\` or \`--verbose\` will log more
information in the following format: \<action-verb\> \<nature-object\>
\<sha1-name-or-path\>
where: \<action-verb\>
- walk walk a tree
- skip skip an already saved/visited object or unknown object (e.g.
commit submodule)
- store save an object in db (file or object) and content (file or
object) storage
- inject serialize on disk an object
- initialize Initialize the db
- clean Clean the db's data
\<nature-object\>
- tree
- commit
- blob
- reference
- submodule-commit A commit from a submodule
\<sha1-name-or-path\>
- sha1 git or swh's sha1
- name object name
- path object's content storage path
- Miscellaneous
Performance
-----------
This is not perf test per say. It's runs on a given machine.
Spec
----
cat /proc/cpuinfo:
``` {.bash}
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping : 4
microcode : 0x16
cpu MHz : 3100.195
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap
bogomips : 5187.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping : 4
microcode : 0x16
cpu MHz : 3099.992
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap
bogomips : 5187.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping : 4
microcode : 0x16
cpu MHz : 3099.992
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap
bogomips : 5187.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz
stepping : 4
microcode : 0x16
cpu MHz : 3100.093
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap
bogomips : 5187.99
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
```
Expected results
----------------
Given a specific repository <https://github.com/ardumont/dot-files.git>
Here is the expected result for each run (as per comparison purposes):
``` {.bash}
swhgitloader=> select count(*) from object_cache where type = 0; -- commit
count
-------
1744
(1 row)
swhgitloader=> select count(*) from object_cache where type = 1; -- tree
count
-------
2839
(1 row)
swhgitloader=> select count(*) from file_cache;
count
-------
2958
(1 row)
```
sqlalchemy
----------
ORM framework.
``` {.bash}
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master o [10:35:08]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
bin/sgloader -v cleandb
bin/sgloader -v initdb
bin/sgloader -v load ~/repo/perso/dot-files
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 161.05s user 10.82s system 76% cpu 3:46.01 total
```
psycopg2
--------
A simple db client.
First implementation, with one open/close for each db access:
``` {.bash}
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [17:38:56]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
bin/sgloader -v cleandb
bin/sgloader -v initdb
bin/sgloader -v load ~/repo/perso/dot-files
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 85.82s user 23.53s system 19% cpu 9:16.00 total
```
With one opened connection during all the computation:
``` {.bash}
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [18:02:27]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
bin/sgloader -v cleandb
bin/sgloader -v initdb
bin/sgloader -v load ~/repo/perso/dot-files
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 39.45s user 8.02s system 50% cpu 1:34.08 total
```
Sanitize the algorithm (remove unneeded check, use the file cache, ...)
:
``` {.bash}
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [10:42:03]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
bin/sgloader -v cleandb
bin/sgloader -v initdb
bin/sgloader -v load ~/repo/perso/dot-files
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 15.90s user 2.08s system 31% cpu 56.879 total
```
No need for byte decoding before serializing on disk:
``` {.bash}
# tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [12:36:10]
$ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files
rm -rf ./log
rm -rf ./dataset/
mkdir -p log dataset
bin/sgloader -v cleandb
bin/sgloader -v initdb
bin/sgloader -v load ~/repo/perso/dot-files
make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 14.67s user 1.64s system 30% cpu 54.303 total
```
Sample
------
repo url
-------- ------------------------------------------------
linux <https://github.com/torvalds/linux.git>
gcc <https://gcc.gnu.org/git/?p=gcc.git;a=summary>
pygit2 <https://github.com/libgit2/pygit2.git>
Filemode investigation
----------------------
git -
<https://github.com/git/git/blob/398dd4bd039680ba98497fbedffa415a43583c16/vcs-svn/repo_tree.h#L6-L9>:
``` {.c}
#define REPO_MODE_DIR 0040000
#define REPO_MODE_BLB 0100644
#define REPO_MODE_EXE 0100755
#define REPO_MODE_LNK 0120000
```
pygit2 -
<https://github.com/libgit2/pygit2/blob/d63c2d4fd7e45d99364b4d2ccc6a4dafc9b51705/src/pygit2.c#L211-L221>:
``` {.c}
ADD_CONSTANT_INT(m, GIT_OBJ_ANY)
ADD_CONSTANT_INT(m, GIT_OBJ_COMMIT)
ADD_CONSTANT_INT(m, GIT_OBJ_TREE)
ADD_CONSTANT_INT(m, GIT_OBJ_BLOB)
ADD_CONSTANT_INT(m, GIT_OBJ_TAG)
/* Valid modes for index and tree entries. */
ADD_CONSTANT_INT(m, GIT_FILEMODE_TREE)
ADD_CONSTANT_INT(m, GIT_FILEMODE_BLOB)
ADD_CONSTANT_INT(m, GIT_FILEMODE_BLOB_EXECUTABLE)
ADD_CONSTANT_INT(m, GIT_FILEMODE_LINK)
ADD_CONSTANT_INT(m, GIT_FILEMODE_COMMIT)
```
pygit2 -
<https://github.com/libgit2/pygit2/blob/c099655fc034c3be63017d0a3e112ea10928464a/src/tree.c#L52-L58>:
``` {.c}
PyDoc_STRVAR(TreeEntry_filemode__doc__, "Filemode.");
PyObject *
TreeEntry_filemode__get__(TreeEntry *self)
{
return PyLong_FromLong(git_tree_entry_filemode(self->entry));
}
```
pygit2 -
<https://github.com/libgit2/pygit2/blob/50a70086bfc72922b63a6e842582021a2bad0b24/src/utils.h#L49>:
``` {.c}
#define PyLong_FromLong PyInt_FromLong
```
From doc <https://docs.python.org/2/c-api/int.html>:
``` {.txt}
PyObject* PyInt_FromLong(long ival)
Return value: New reference.
Create a new integer object with a value of ival.
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you
create an int in that range you actually just get back a reference to the existing object. So it should be
possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-)
```
libgit2 -
<https://github.com/libgit2/libgit2/blob/623fbd93f1a7538df0c9a433df68f87bbd58b803/src/tree.c#L239-L241>:
``` {.c}
git_filemode_t git_tree_entry_filemode(const git_tree_entry *entry)
{
return normalize_filemode(entry->attr);
}
```
libgit2 -
<https://github.com/libgit2/libgit2/blob/623fbd93f1a7538df0c9a433df68f87bbd58b803/src/tree.c#L31-L51>:
``` {.c}
GIT_INLINE(git_filemode_t) normalize_filemode(git_filemode_t filemode)
{
/* Tree bits set, but it's not a commit */
if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_TREE)
return GIT_FILEMODE_TREE;
/* If any of the x bits are set */
if (GIT_PERMS_IS_EXEC(filemode))
return GIT_FILEMODE_BLOB_EXECUTABLE;
/* 16XXXX means commit */
if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_COMMIT)
return GIT_FILEMODE_COMMIT;
/* 12XXXX means commit */
if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_LINK)
return GIT_FILEMODE_LINK;
/* Otherwise, return a blob */
return GIT_FILEMODE_BLOB;
}
```
libgit2 -
<https://github.com/libgit2/libgit2/blob/f85a9c2767b43f35904bf39858488a4b7bc304e8/src/common.h#L13-L18>:
``` {.c}
/** Declare a function as always inlined. */
#if defined(_MSC_VER)
# define GIT_INLINE(type) static __inline type
#else
# define GIT_INLINE(type) static inline type
#endif
```
libgit2 -
<https://github.com/libgit2/libgit2/blob/d24a5312d8ab6d3cdb259e450ec9f1e2e6f3399d/src/fileops.h#L243-L250>:
``` {.c}
#define GIT_PERMS_IS_EXEC(MODE) (((MODE) & 0111) != 0)
#define GIT_PERMS_CANONICAL(MODE) (GIT_PERMS_IS_EXEC(MODE) ? 0755 : 0644)
#define GIT_PERMS_FOR_WRITE(MODE) (GIT_PERMS_IS_EXEC(MODE) ? 0777 : 0666)
#define GIT_MODE_PERMS_MASK 0777
#define GIT_MODE_TYPE_MASK 0170000
#define GIT_MODE_TYPE(MODE) ((MODE) & GIT_MODE_TYPE_MASK)
#define GIT_MODE_ISBLOB(MODE) (GIT_MODE_TYPE(MODE) == GIT_MODE_TYPE(GIT_FILEMODE_BLOB))
```
libgit2 -
<https://github.com/libgit2/libgit2/blob/c5c5cdb106d012d132475d9156923857f8d302fc/include/git2/types.h#L204-L212>:
``` {.c}
/** Valid modes for index and tree entries. */
typedef enum {
GIT_FILEMODE_UNREADABLE = 0000000,
GIT_FILEMODE_TREE = 0040000,
GIT_FILEMODE_BLOB = 0100644,
GIT_FILEMODE_BLOB_EXECUTABLE = 0100755,
GIT_FILEMODE_LINK = 0120000,
GIT_FILEMODE_COMMIT = 0160000,
} git_filemode_t;
```
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Sat, Jun 21, 6:42 PM (2 w, 15 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3433615
Attached To
rDLDG Git loader
Event Timeline
Log In to Comment