The Software Heritage Git Loader is a tool and a library to walk a local Git repository and inject into the SWH dataset all contained files that weren't known before. License ======= This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. See top-level LICENSE file for the full text of the GNU General Public License along with this program. Dependencies ============ Runtime ------- - python3 - python3-psycopg2 - python3-pygit2 Test ---- - python3-nose - Requirements Functional ---------- - input: a Git bare repository available locally, on the filesystem - input (optional): a table mapping SHA256 of individual files to path on the filesystem that contain the corresponding content (AKA, the file cache) - input (optional): a set of SHA1 of Git commits that have already been seen in the past (AKA, the Git commit cache) - output: an augmented SWH dataset, where all files present in all blobs referenced by any Git object, have been added ### algo Sketch of the (naive) algorithm that the Git loader should execute ``` {.pseudo} for each ref in the repo for each commit referenced by the commit graph starting at that ref if we have a git commit cache and the commit is in there: stop treating the current commit sub-graph for each tree referenced by the commit for each blob referenced by the tree compute the SHA256 checksum of the blob lookup the checksum in the file cache if it is not there add the file to the dataset on the filesystem add the file to the file cache, pointing to the file path on the filesystem ``` Non-functional -------------- - implementation language, Python3 - coding guidelines: conform to PEP8 - Git access: via libgit2/pygit - cache: implemented as Postgres tables File-system storage ------------------- Given a file with SHA256 of b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c It will be stored at STORAGE~ROOT~/b5/bb/9d/80/14a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c Configuration ============= Create a configuration file in **\~/.config/sgloader.ini**: ``` {.ini} [main] # where to store blob on disk file_content_storage_dir = swh-git-loader/file-content-storage # where to store commit/tree on disk object_content_storage_dir = swh-git-loader/object-content-storage # Where to store the logs log_dir = swh-git-loader/log # url access to db # http://initd.org/psycopg/docs/module.html#psycopg2.connect db_url = dbname=swhgitloader # activate the compression for each blob object #blob_compression = true # compute folder's depth on disk aa/bb/cc/dd #folder_depth=4 ``` Note: - [DB url DSL](http://initd.org/psycopg/docs/module.html#psycopg2.connect) - the configuration file can be changed in the CLI with the flag \`-c \\` or \`--config-file \\` - Run Environment initialization -------------------------- ``` {.bash} export PYTHONPATH=`pwd`:$PYTHONPATH ``` Help ---- ``` {.bash} bin/sgloader --help ``` Parse a repository from a clean slate ------------------------------------- Clean and initialize the model then parse the repository git: ``` {.bash} bin/sgloader cleandb bin/sgloader initdb bin/sgloader load /path/to/git/repo ``` For ease: ``` {.bash} make cleandb initdb clean-and-run REPO_PATH=/path/to/git/repo ``` Parse an existing repository ---------------------------- ``` {.bash} bin/sgloader load /path/to/git/repo ``` Clean data ---------- ``` {.bash} bin/sgloader cleandb ``` For ease: ``` {.bash} make cleandb ``` Init data --------- ``` {.bash} bin/sgloader initdb ``` Log === Activating the debug mode (flag \`-v\` or \`--verbose\` will log more information in the following format: \ \ \ where: \ - walk walk a tree - skip skip an already saved/visited object or unknown object (e.g. commit submodule) - store save an object in db (file or object) and content (file or object) storage - inject serialize on disk an object - initialize Initialize the db - clean Clean the db's data \ - tree - commit - blob - reference - submodule-commit A commit from a submodule \ - sha1 git or swh's sha1 - name object name - path object's content storage path - Miscellaneous Performance ----------- This is not perf test per say. It's runs on a given machine. Spec ---- cat /proc/cpuinfo: ``` {.bash} processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz stepping : 4 microcode : 0x16 cpu MHz : 3100.195 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap bogomips : 5187.99 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz stepping : 4 microcode : 0x16 cpu MHz : 3099.992 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap bogomips : 5187.99 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz stepping : 4 microcode : 0x16 cpu MHz : 3099.992 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap bogomips : 5187.99 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz stepping : 4 microcode : 0x16 cpu MHz : 3100.093 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap bogomips : 5187.99 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ``` Expected results ---------------- Given a specific repository Here is the expected result for each run (as per comparison purposes): ``` {.bash} swhgitloader=> select count(*) from object_cache where type = 0; -- commit count ------- 1744 (1 row) swhgitloader=> select count(*) from object_cache where type = 1; -- tree count ------- 2839 (1 row) swhgitloader=> select count(*) from file_cache; count ------- 2958 (1 row) ``` sqlalchemy ---------- ORM framework. ``` {.bash} # tony at corellia in ~/work/inria/repo/swh-git-loader on git:master o [10:35:08] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset bin/sgloader -v cleandb bin/sgloader -v initdb bin/sgloader -v load ~/repo/perso/dot-files make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 161.05s user 10.82s system 76% cpu 3:46.01 total ``` psycopg2 -------- A simple db client. First implementation, with one open/close for each db access: ``` {.bash} # tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [17:38:56] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset bin/sgloader -v cleandb bin/sgloader -v initdb bin/sgloader -v load ~/repo/perso/dot-files make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 85.82s user 23.53s system 19% cpu 9:16.00 total ``` With one opened connection during all the computation: ``` {.bash} # tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [18:02:27] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset bin/sgloader -v cleandb bin/sgloader -v initdb bin/sgloader -v load ~/repo/perso/dot-files make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 39.45s user 8.02s system 50% cpu 1:34.08 total ``` Sanitize the algorithm (remove unneeded check, use the file cache, ...) : ``` {.bash} # tony at corellia in ~/work/inria/repo/swh-git-loader on git:psycopg2-tryout x [10:42:03] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset bin/sgloader -v cleandb bin/sgloader -v initdb bin/sgloader -v load ~/repo/perso/dot-files make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 15.90s user 2.08s system 31% cpu 56.879 total ``` No need for byte decoding before serializing on disk: ``` {.bash} # tony at corellia in ~/work/inria/repo/swh-git-loader on git:master x [12:36:10] $ time make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files rm -rf ./log rm -rf ./dataset/ mkdir -p log dataset bin/sgloader -v cleandb bin/sgloader -v initdb bin/sgloader -v load ~/repo/perso/dot-files make cleandb run FLAG=-v REPO_PATH=~/repo/perso/dot-files 14.67s user 1.64s system 30% cpu 54.303 total ``` Sample ------ repo url -------- ------------------------------------------------ linux gcc pygit2 Filemode investigation ---------------------- git - : ``` {.c} #define REPO_MODE_DIR 0040000 #define REPO_MODE_BLB 0100644 #define REPO_MODE_EXE 0100755 #define REPO_MODE_LNK 0120000 ``` pygit2 - : ``` {.c} ADD_CONSTANT_INT(m, GIT_OBJ_ANY) ADD_CONSTANT_INT(m, GIT_OBJ_COMMIT) ADD_CONSTANT_INT(m, GIT_OBJ_TREE) ADD_CONSTANT_INT(m, GIT_OBJ_BLOB) ADD_CONSTANT_INT(m, GIT_OBJ_TAG) /* Valid modes for index and tree entries. */ ADD_CONSTANT_INT(m, GIT_FILEMODE_TREE) ADD_CONSTANT_INT(m, GIT_FILEMODE_BLOB) ADD_CONSTANT_INT(m, GIT_FILEMODE_BLOB_EXECUTABLE) ADD_CONSTANT_INT(m, GIT_FILEMODE_LINK) ADD_CONSTANT_INT(m, GIT_FILEMODE_COMMIT) ``` pygit2 - : ``` {.c} PyDoc_STRVAR(TreeEntry_filemode__doc__, "Filemode."); PyObject * TreeEntry_filemode__get__(TreeEntry *self) { return PyLong_FromLong(git_tree_entry_filemode(self->entry)); } ``` pygit2 - : ``` {.c} #define PyLong_FromLong PyInt_FromLong ``` From doc : ``` {.txt} PyObject* PyInt_FromLong(long ival) Return value: New reference. Create a new integer object with a value of ival. The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object. So it should be possible to change the value of 1. I suspect the behaviour of Python in this case is undefined. :-) ``` libgit2 - : ``` {.c} git_filemode_t git_tree_entry_filemode(const git_tree_entry *entry) { return normalize_filemode(entry->attr); } ``` libgit2 - : ``` {.c} GIT_INLINE(git_filemode_t) normalize_filemode(git_filemode_t filemode) { /* Tree bits set, but it's not a commit */ if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_TREE) return GIT_FILEMODE_TREE; /* If any of the x bits are set */ if (GIT_PERMS_IS_EXEC(filemode)) return GIT_FILEMODE_BLOB_EXECUTABLE; /* 16XXXX means commit */ if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_COMMIT) return GIT_FILEMODE_COMMIT; /* 12XXXX means commit */ if (GIT_MODE_TYPE(filemode) == GIT_FILEMODE_LINK) return GIT_FILEMODE_LINK; /* Otherwise, return a blob */ return GIT_FILEMODE_BLOB; } ``` libgit2 - : ``` {.c} /** Declare a function as always inlined. */ #if defined(_MSC_VER) # define GIT_INLINE(type) static __inline type #else # define GIT_INLINE(type) static inline type #endif ``` libgit2 - : ``` {.c} #define GIT_PERMS_IS_EXEC(MODE) (((MODE) & 0111) != 0) #define GIT_PERMS_CANONICAL(MODE) (GIT_PERMS_IS_EXEC(MODE) ? 0755 : 0644) #define GIT_PERMS_FOR_WRITE(MODE) (GIT_PERMS_IS_EXEC(MODE) ? 0777 : 0666) #define GIT_MODE_PERMS_MASK 0777 #define GIT_MODE_TYPE_MASK 0170000 #define GIT_MODE_TYPE(MODE) ((MODE) & GIT_MODE_TYPE_MASK) #define GIT_MODE_ISBLOB(MODE) (GIT_MODE_TYPE(MODE) == GIT_MODE_TYPE(GIT_FILEMODE_BLOB)) ``` libgit2 - : ``` {.c} /** Valid modes for index and tree entries. */ typedef enum { GIT_FILEMODE_UNREADABLE = 0000000, GIT_FILEMODE_TREE = 0040000, GIT_FILEMODE_BLOB = 0100644, GIT_FILEMODE_BLOB_EXECUTABLE = 0100755, GIT_FILEMODE_LINK = 0120000, GIT_FILEMODE_COMMIT = 0160000, } git_filemode_t; ```