Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9696711
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
76 KB
Subscribers
None
View Options
diff --git a/PKG-INFO b/PKG-INFO
index f4e2782..0607540 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,106 +1,106 @@
Metadata-Version: 2.1
Name: swh.loader.git
-Version: 1.0.1
+Version: 1.1.0
Summary: Software Heritage git loader
Home-page: https://forge.softwareheritage.org/diffusion/DLDG/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git
Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-loader-git/
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: testing
License-File: LICENSE
License-File: AUTHORS
swh-loader-git
==============
The Software Heritage Git Loader is a tool and a library to walk a local
Git repository and inject into the SWH dataset all contained files that
weren't known before.
The main entry points are
- :class:`swh.loader.git.loader.GitLoader` for the main loader which ingests a remote git
repository's contents.
- :class:`swh.loader.git.from_disk.GitLoaderFromDisk` which ingests a local git clone
repository.
- :class:`swh.loader.git.loader.GitLoaderFromArchive` which ingests a git repository
wrapped in an archive.
License
-------
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public
License along with this program.
Dependencies
------------
### Runtime
- python3
- python3-dulwich
- python3-retrying
- python3-swh.core
- python3-swh.model
- python3-swh.storage
- python3-swh.scheduler
### Test
- python3-nose
Requirements
------------
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via dulwich
CLI Run
----------
You can run the loader from a remote origin (*loader*) or from an origin on disk
(*from_disk*) directly by calling:
```
swh loader -C <config-file> run git <git-repository-url>
```
or "git_disk".
## Configuration sample
/tmp/git.yml:
```
storage:
cls: remote
args:
url: http://localhost:5002/
```
diff --git a/debian/changelog b/debian/changelog
index 2bc1f3d..0e6a0bf 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,696 +1,698 @@
-swh-loader-git (1.0.1-1~swh1~bpo10+1) buster-swh; urgency=medium
+swh-loader-git (1.1.0-1~swh1) unstable-swh; urgency=medium
- * Rebuild for buster-swh
+ * New upstream release 1.1.0 - (tagged by Antoine Lambert
+ <anlambert@softwareheritage.org> on 2021-09-28 15:10:11 +0200)
+ * Upstream changes: - version 1.1.0
- -- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 22 Sep 2021 07:55:20 +0000
+ -- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 28 Sep 2021 13:14:10 +0000
swh-loader-git (1.0.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 1.0.1 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-09-22 09:51:39
+0200)
* Upstream changes: - v1.0.1 - Fix tests for Dulwich < 0.20.22
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 22 Sep 2021 07:54:07 +0000
swh-loader-git (1.0.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 1.0.0 - (tagged by Valentin Lorentz
<vlorentz@softwareheritage.org> on 2021-09-17 12:28:36 +0200)
* Upstream changes: - v1.0.0 - * from_disk: Do not drop tags
with missing tagger or date - * Migrate to pytest-style tests
- * converters: Recompute hashes and check they match the originals
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 17 Sep 2021 10:31:49 +0000
swh-loader-git (0.10.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.10.1 - (tagged by Valentin Lorentz
<vlorentz@softwareheritage.org> on 2021-08-03 10:37:43 +0200)
* Upstream changes: - v0.10.1 - * from_disk: Improve error
logging - * converters: Preserve GPG signatures on releases -
* Do not exclude falsy git objects from being added.
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 03 Aug 2021 08:46:18 +0000
swh-loader-git (0.10.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.10.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-06-08 10:23:41
+0200)
* Upstream changes: - v0.10.0 - Spool large packfiles to disk
instead of consuming tons of memory
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 08 Jun 2021 08:27:09 +0000
swh-loader-git (0.9.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.9.1 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-04-16 15:36:06
+0200)
* Upstream changes: - v0.9.1 - Fix Pack File too big error
formatting - Rename 'git_metadata' to 'extra_headers'
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 16 Apr 2021 13:39:09 +0000
swh-loader-git (0.9.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.9.0 - (tagged by Nicolas Dandrimont
<nicolas@dandrimont.eu> on 2021-02-25 18:44:39 +0100)
* Upstream changes: - Release swh.loader.git 0.9.0 -
Throughput and memory usage improvements
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 25 Feb 2021 17:52:31 +0000
swh-loader-git (0.8.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.8.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-02-17 15:07:34
+0100)
* Upstream changes: - v0.8.0 - Rework loader instantiation
logic according to loader core api
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 17 Feb 2021 14:10:13 +0000
swh-loader-git (0.7.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.7.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-02-12 17:31:15
+0100)
* Upstream changes: - v0.7.0 - loader.git: Mark visit status
as not_found or failed when relevant - loader.git: Explicit the
failure test cases
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 12 Feb 2021 16:33:18 +0000
swh-loader-git (0.6.0-1~swh2) unstable-swh; urgency=medium
* Bump dependencies
-- Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Wed, 03 Feb 2021 14:46:51 +0100
swh-loader-git (0.6.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.6.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2021-02-03 14:31:51
+0100)
* Upstream changes: - v0.6.0 - Adapt
origin_get_latest_visit_status according to latest api change -
tox.ini: Add swh.core[testing] requirement - from_disk: Fix mypy
error with dulwich >= 0.20.13
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 03 Feb 2021 13:34:21 +0000
swh-loader-git (0.5.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.5.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-11-13 10:34:02
+0100)
* Upstream changes: - v0.5.0 - loader.git.from_disk: Register
loader in `swh loader run` cli
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 13 Nov 2020 09:34:28 +0000
swh-loader-git (0.4.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.4.1 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-10-02 17:07:58
+0200)
* Upstream changes: - v0.4.1 - git.loader*: Open configuration
passing from constructor - tox.ini: pin black to the pre-commit
version (19.10b0) to avoid flip-flops
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 02 Oct 2020 15:08:42 +0000
swh-loader-git (0.4.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.4.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-10-02 14:19:17
+0200)
* Upstream changes: - v0.4.0 - git.loader: Migrate away from
SWHConfig mixin - Drop vcversioner from setup.py (superseded by
setuptools-scm) - tests: Don't check the number of created
'person' objects. - python: Reorder imports with isort - pre-
commit: Add isort hook and configuration - pre-commit: Update
flake8 hook configuration - Tell pytest not to recurse in
dotdirs. - tests: Replace calls to snapshot_get with
snapshot_get_all_branches.
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 02 Oct 2020 12:19:50 +0000
swh-loader-git (0.3.6-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.6 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-08-06 16:49:04
+0200)
* Upstream changes: - v0.3.6 - Adapt code according to storage
signature
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 06 Aug 2020 14:50:22 +0000
swh-loader-git (0.3.5-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.5 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-07-28 07:49:13
+0200)
* Upstream changes: - v0.3.5 - loader: Update
swh.storage.origin_get call to latest api change
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 28 Jul 2020 05:51:16 +0000
swh-loader-git (0.3.4-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.4 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-07-26 05:55:34
+0200)
* Upstream changes: - v0.3.4 - setup.py: Migrate from
vcversioner to setuptools-scm - MANIFEST: Add missing conftest
requirement
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Sun, 26 Jul 2020 03:57:36 +0000
swh-loader-git (0.3.3-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.3 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-07-17 15:29:21
+0200)
* Upstream changes: - v0.3.3 - tests: Reuse pytest fixtures
from swh.loader.core - tests: Check against snapshot model
object
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 17 Jul 2020 13:30:55 +0000
swh-loader-git (0.3.2-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.2 - (tagged by David Douard
<david.douard@sdfa3.org> on 2020-07-16 10:46:20 +0200)
* Upstream changes: - v0.3.2
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 16 Jul 2020 08:49:36 +0000
swh-loader-git (0.3.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.1 - (tagged by Antoine Lambert
<antoine.lambert@inria.fr> on 2020-07-08 16:37:09 +0200)
* Upstream changes: - version 0.3.1
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 08 Jul 2020 14:39:48 +0000
swh-loader-git (0.3.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.3.0 - (tagged by David Douard
<david.douard@sdfa3.org> on 2020-07-08 15:35:34 +0200)
* Upstream changes: - v0.3.0
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 08 Jul 2020 13:38:18 +0000
swh-loader-git (0.2.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.2.0 - (tagged by Antoine R. Dumont
(@ardumont) <ardumont@softwareheritage.org> on 2020-06-23 15:14:13
+0200)
* Upstream changes: - v0.2.0 - loader: Read snapshot out of
the last origin visit status - tests: Use
assert_last_visit_matches - Adapt visit_date type from string to
datetime
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 23 Jun 2020 13:15:36 +0000
swh-loader-git (0.1.2-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.1.2 - (tagged by Antoine Lambert
<antoine.lambert@inria.fr> on 2020-06-03 14:54:36 +0200)
* Upstream changes: - version 0.1.2
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 03 Jun 2020 12:57:56 +0000
swh-loader-git (0.1.1-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.1.1 - (tagged by Antoine Lambert
<antoine.lambert@inria.fr> on 2020-06-02 17:44:44 +0200)
* Upstream changes: - version 0.1.1
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 02 Jun 2020 15:50:08 +0000
swh-loader-git (0.1.0-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.1.0 - (tagged by Nicolas Dandrimont
<nicolas@dandrimont.eu> on 2020-05-29 10:33:12 +0200)
* Upstream changes: - Release swh.loader.git v0.1.0 - Use the
previous snapshot instead of any object from the archive to do -
incremental loads - Merge branch filtering behavior between the
local and remote loaders - Add default target branch for
symbolic references
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 29 May 2020 08:37:48 +0000
swh-loader-git (0.0.60-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.60 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2020-04-15 11:52:55
+0200)
* Upstream changes: - v0.0.60 - git.loader: fix failing origin
visit update step - Add a pyproject.toml file to target py37 for
black - Enable black
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 15 Apr 2020 10:05:51 +0000
swh-loader-git (0.0.59-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.59 - (tagged by Antoine Lambert
<antoine.lambert@inria.fr> on 2020-04-06 11:59:27 +0200)
* Upstream changes: - version 0.0.59
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Mon, 06 Apr 2020 10:04:59 +0000
swh-loader-git (0.0.58-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.58 - (tagged by Valentin Lorentz
<vlorentz@softwareheritage.org> on 2020-03-02 11:25:43 +0100)
* Upstream changes: - v0.0.58 - * Use origin_visit_get_latest
instead of snapshot_get_latest. - * Use swh-model objects
instead of dicts.
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Mon, 02 Mar 2020 10:28:37 +0000
swh-loader-git (0.0.57-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.57 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2020-02-07 03:32:49
+0100)
* Upstream changes: - v0.0.57 - loaders: Remove content size
computation during conversion
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Fri, 07 Feb 2020 02:45:56 +0000
swh-loader-git (0.0.56-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.56 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2020-01-28 13:24:24
+0100)
* Upstream changes: - v0.0.56 - git.loader: Migrate from
UnbufferedLoader to DVCSLoader
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 28 Jan 2020 12:27:02 +0000
swh-loader-git (0.0.55-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.55 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2019-12-12 14:41:10
+0100)
* Upstream changes: - v0.0.55 - loader: Bump dependency on
loader-core
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 12 Dec 2019 13:44:22 +0000
swh-loader-git (0.0.54-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.54 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2019-12-12 11:43:50
+0100)
* Upstream changes: - v0.0.54 - tasks: Enforce kwargs use in
task message
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 12 Dec 2019 10:46:28 +0000
swh-loader-git (0.0.53-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.53 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2019-12-10 11:24:30
+0100)
* Upstream changes: - v0.0.53 - tasks: Unify message format
with other loaders - tasks: Use celery's shared_task decorator
- tests: Migrate to pytest-mock's fixture - loader.git: Register
git worker - tasks: Rename task according to production -
git: Unify loaders constructor - Fix a typo reported by
codespell - Add a pre-commit config file - Migrate tox.ini
to extras = xxx instead of deps = .[testing] - De-specify
testenv:py3 - Drop version constraint on pytest < 4 -
Include all requirements in MANIFEST.in - Add support for
symbolic references
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 10 Dec 2019 10:27:32 +0000
swh-loader-git (0.0.52-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.52 - (tagged by Stefano Zacchiroli
<zack@upsilon.cc> on 2019-10-10 12:07:05 +0200)
* Upstream changes: - v0.0.52 - (brown paper bag release) -
* MANIFEST.in: ship py.typed
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 10 Oct 2019 10:12:13 +0000
swh-loader-git (0.0.51-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.51 - (tagged by Stefano Zacchiroli
<zack@upsilon.cc> on 2019-10-10 11:59:01 +0200)
* Upstream changes: - v0.0.51 - * tox.ini: Fix py3 environment
to use packaged tests - * typing: minimal changes to make a no-
op mypy run pass - * test_from_disk.py: avoid shadowing base
classes in tests
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Thu, 10 Oct 2019 10:02:08 +0000
swh-loader-git (0.0.50-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.50 - (tagged by Antoine Lambert
<antoine.lambert@inria.fr> on 2019-09-03 13:07:54 +0200)
* Upstream changes: - version 0.0.50
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Tue, 03 Sep 2019 11:13:19 +0000
swh-loader-git (0.0.49-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.49 - (tagged by Valentin Lorentz
<vlorentz@softwareheritage.org> on 2019-06-12 15:05:10 +0200)
* Upstream changes: - Use origin URLs instead of numeric ids.
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 19 Jun 2019 10:28:05 +0000
swh-loader-git (0.0.48-1~swh1) unstable-swh; urgency=medium
* New upstream release 0.0.48 - (tagged by Antoine R. Dumont
(@ardumont) <antoine.romain.dumont@gmail.com> on 2019-01-30 11:18:55
+0100)
* Upstream changes: - v0.0.48 - Bump dependency on swh-
scheduler 0.0.39 - Rewrite celery tasks as a decorated function
-- Software Heritage autobuilder (on jenkins-debian1) <jenkins@jenkins-debian1.internal.softwareheritage.org> Wed, 30 Jan 2019 10:22:22 +0000
swh-loader-git (0.0.43-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.43
* Support the new paginated snapshot branch fetching functions
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 18 Oct 2018 18:49:26 +0200
swh-loader-git (0.0.42-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.42
* Fix critical bug in incremental loading
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 11 Oct 2018 17:19:07 +0200
swh-loader-git (0.0.41-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.41
* Use explicit keyword argument for base_url in the load task
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 11 Oct 2018 16:26:27 +0200
swh-loader-git (0.0.40-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.40
* Improve python packaging
* Make the loader more robust against holes in the history caused by
* buggy imports
* Allow ignoring the history to make a full load
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 09 Oct 2018 16:28:14 +0200
swh-loader-git (0.0.39-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.39
* Avoid walking the history of large git repos, which takes a long
time
* Really save packfiles
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 20 Sep 2018 17:22:17 +0200
swh-loader-git (0.0.38-1~swh1) unstable-swh; urgency=medium
* v0.0.38
* Improve origin_visit initialization step
* Properly sandbox the prepare statement so that if it breaks, we can
* update appropriately the visit with the correct status
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Wed, 07 Mar 2018 11:39:30 +0100
swh-loader-git (0.0.37-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.37
* Remove spurious debug print
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 06 Feb 2018 16:00:40 +0100
swh-loader-git (0.0.36-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.36
* Update to use snapshots instead of occurrences
* Use dulwich get_transport_and_path rather than hardcode the tcp
transport
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 06 Feb 2018 14:42:36 +0100
swh-loader-git (0.0.35-1~swh1) unstable-swh; urgency=medium
* v0.0.35
* swh.loader.git.loader: Warn when object is corrupted and continue
* swh.loader.git.loader: Add structured data to the log message
regarding skipping objects
* swh.loader.git.loader: Force further checks on objects
* swh.loader.git.loader: Unify reading object from the repository
* swh.loader.git.loader: Warn when object malformed and continue
* swh.loader.git.loader: Trap missing object id and continue
* swh.loader.git.base: Reuse swh.loader.core base loader
* swh.loader.git.converters: Fix release time conversion issue when no
date provided
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Mon, 18 Dec 2017 12:08:01 +0100
swh-loader-git (0.0.34-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git version 0.0.34
* Update packaging runes
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 12 Oct 2017 20:12:11 +0200
swh-loader-git (0.0.33-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.33
* make the updater's parent commit cache more useful
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 15 Sep 2017 18:45:41 +0200
swh-loader-git (0.0.32-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git 0.0.32
* Update tasks to new swh.scheduler API
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Mon, 12 Jun 2017 18:04:50 +0200
swh-loader-git (0.0.31-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.31
* Migrate from swh.core.hashutil to swh.model.hashutil
* Only send objects that are actually missing
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 17 Mar 2017 17:40:17 +0100
swh-loader-git (0.0.30-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.30
* Fix handling of mergetag headers
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 09 Mar 2017 11:30:08 +0100
swh-loader-git (0.0.29-1~swh1) unstable-swh; urgency=medium
* v0.0.29
* GitLoaderFromArchive: Use the same configuration file as
* GitLoader (permit to deploy both as the same unit)
* git reader: Refactor to allow listing revisions as well as contents
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Mon, 20 Feb 2017 11:32:24 +0100
swh-loader-git (0.0.28-1~swh1) unstable-swh; urgency=medium
* v0.0.28
* loader: Fix fetch_date override
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Wed, 15 Feb 2017 18:43:32 +0100
swh-loader-git (0.0.27-1~swh1) unstable-swh; urgency=medium
* v0.0.27
* Add loader-git from archive
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Tue, 14 Feb 2017 18:56:52 +0100
swh-loader-git (0.0.26-1~swh1) unstable-swh; urgency=medium
* v0.0.26
* Add a git loader able to deal with git repository in archive
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Tue, 14 Feb 2017 16:24:50 +0100
swh-loader-git (0.0.25-1~swh1) unstable-swh; urgency=medium
* v0.0.25
* Fix to permit to actually pass the fetch date as parameter for
* the loading git disk loader
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Fri, 10 Feb 2017 17:34:35 +0100
swh-loader-git (0.0.24-1~swh1) unstable-swh; urgency=medium
* v0.0.24
* Update storage configuration reading
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Thu, 15 Dec 2016 18:40:29 +0100
swh-loader-git (0.0.23-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.23
* Make the save_data mechanism generic
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 02 Dec 2016 15:34:05 +0100
swh-loader-git (0.0.22-1~swh1) unstable-swh; urgency=medium
* v0.0.22
* Improve reader to permit to use it as analyzer tool
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Fri, 04 Nov 2016 10:37:24 +0100
swh-loader-git (0.0.21-1~swh1) unstable-swh; urgency=medium
* v0.0.21
* Improve the reader git to load all contents from a pack.
* Improve to avoid unnecessary readings from db
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Wed, 26 Oct 2016 17:06:12 +0200
swh-loader-git (0.0.20-1~swh1) unstable-swh; urgency=medium
* v0.0.20
* Add new reader git task
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Tue, 25 Oct 2016 18:40:17 +0200
swh-loader-git (0.0.19-1~swh1) unstable-swh; urgency=medium
* v0.0.19
* Update git loaders to register origin_visit's state
-- Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Tue, 23 Aug 2016 16:34:15 +0200
swh-loader-git (0.0.18-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.18
* Properly handle skipped contents
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 19 Aug 2016 18:12:44 +0200
swh-loader-git (0.0.16-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.16
* Add exist_ok to packfile cache directory creation
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Mon, 01 Aug 2016 15:53:07 +0200
swh-loader-git (0.0.15-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.15
* Absence of remote refs doesn't throw an error in updater
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Wed, 15 Jun 2016 01:20:37 +0200
swh-loader-git (0.0.14-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.14
* Add a disk loader using dulwich
* Rework the loader logic to use a single pattern for both loaders
* Allow caching of packfiles for the remote loader
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 14 Jun 2016 18:10:21 +0200
swh-loader-git (0.0.13-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.13
* Update for latest schema revision
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 08 Apr 2016 16:46:41 +0200
swh-loader-git (0.0.12-1~swh1) unstable-swh; urgency=medium
* Release swh-loader-git v0.0.12
* Update to use new swh.storage api for object listing
* Add a size limit to packfiles
* Return a proper eventfulness for empty repositories
* Do not crawl the pack file if unnecessary
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 25 Feb 2016 18:21:34 +0100
swh-loader-git (0.0.11-1~swh1) unstable-swh; urgency=medium
* Release swh.loader.git v0.0.11
* Implement git updater
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 19 Feb 2016 19:13:22 +0100
swh-loader-git (0.0.10-1~swh1) unstable-swh; urgency=medium
* Prepare swh.loader.git release v0.0.10
* Update for swh.model
* Use new swh.storage
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Mon, 07 Dec 2015 18:59:46 +0100
swh-loader-git (0.0.9-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.9
* Close fetch_history on failure too
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Wed, 04 Nov 2015 10:54:37 +0100
swh-loader-git (0.0.8-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.8
* New database schema (v028)
* Populate fetch_history (T121)
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 27 Oct 2015 18:11:26 +0100
swh-loader-git (0.0.7-1~swh1) unstable-swh; urgency=medium
* Prepare swh.loader.git v0.0.7 deployment
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Mon, 19 Oct 2015 12:37:09 +0200
swh-loader-git (0.0.6-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.6
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 09 Oct 2015 17:50:35 +0200
swh-loader-git (0.0.5-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.5
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 06 Oct 2015 17:42:11 +0200
swh-loader-git (0.0.4-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.4
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 02 Oct 2015 14:54:04 +0200
swh-loader-git (0.0.3-1~swh1) unstable-swh; urgency=medium
* Prepare deployment of swh.loader.git v0.0.3
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Thu, 01 Oct 2015 11:36:28 +0200
swh-loader-git (0.0.2-1~swh1) unstable-swh; urgency=medium
* Prepare deploying swh.loader.git v0.0.2
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Tue, 29 Sep 2015 17:22:09 +0200
swh-loader-git (0.0.1-1~swh1) unstable-swh; urgency=medium
* Initial release
* Tagging swh.loader.git v0.0.1
-- Nicolas Dandrimont <nicolas@dandrimont.eu> Fri, 25 Sep 2015 16:04:00 +0200
diff --git a/debian/control b/debian/control
index 7c2bae9..9bfadc6 100644
--- a/debian/control
+++ b/debian/control
@@ -1,34 +1,35 @@
Source: swh-loader-git
Maintainer: Software Heritage developers <swh-devel@inria.fr>
Section: python
Priority: optional
Build-Depends: debhelper (>= 9),
dh-python (>= 2),
+ git,
python3-all,
python3-click,
python3-dulwich (>= 0.18.7~),
python3-pytest,
python3-pytest-mock,
python3-pytest-postgresql,
python3-retrying,
python3-setuptools,
python3-setuptools-scm,
python3-swh.core (>= 0.1.0~),
python3-swh.core.db.pytestplugin,
python3-swh.loader.core (>= 0.16~),
python3-swh.model (>= 0.4.0~),
python3-swh.scheduler (>= 0.3.0~),
python3-swh.storage (>= 0.22~)
Standards-Version: 3.9.6
Homepage: https://forge.softwareheritage.org/diffusion/DLDG/
Package: python3-swh.loader.git
Architecture: all
Depends: python3-swh.core (>= 0.1.0~),
python3-swh.loader.core (>= 0.16~),
python3-swh.model (>= 0.4.0~),
python3-swh.scheduler (>= 0.3.0~),
python3-swh.storage (>= 0.22~),
${misc:Depends},
${python3:Depends}
Description: Software Heritage Git loader
diff --git a/mypy.ini b/mypy.ini
index f916912..20fe337 100644
--- a/mypy.ini
+++ b/mypy.ini
@@ -1,21 +1,24 @@
[mypy]
namespace_packages = True
warn_unused_ignores = True
# 3rd party libraries without stubs (yet)
[mypy-celery.*]
ignore_missing_imports = True
[mypy-dulwich.*]
ignore_missing_imports = True
[mypy-pkg_resources.*]
ignore_missing_imports = True
[mypy-pytest.*]
ignore_missing_imports = True
+[mypy-urllib3.*]
+ignore_missing_imports = True
+
[mypy-swh.loader.*]
ignore_missing_imports = True
diff --git a/swh.loader.git.egg-info/PKG-INFO b/swh.loader.git.egg-info/PKG-INFO
index f4e2782..0607540 100644
--- a/swh.loader.git.egg-info/PKG-INFO
+++ b/swh.loader.git.egg-info/PKG-INFO
@@ -1,106 +1,106 @@
Metadata-Version: 2.1
Name: swh.loader.git
-Version: 1.0.1
+Version: 1.1.0
Summary: Software Heritage git loader
Home-page: https://forge.softwareheritage.org/diffusion/DLDG/
Author: Software Heritage developers
Author-email: swh-devel@inria.fr
License: UNKNOWN
Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Source, https://forge.softwareheritage.org/source/swh-loader-git
Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-loader-git/
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Provides-Extra: testing
License-File: LICENSE
License-File: AUTHORS
swh-loader-git
==============
The Software Heritage Git Loader is a tool and a library to walk a local
Git repository and inject into the SWH dataset all contained files that
weren't known before.
The main entry points are
- :class:`swh.loader.git.loader.GitLoader` for the main loader which ingests a remote git
repository's contents.
- :class:`swh.loader.git.from_disk.GitLoaderFromDisk` which ingests a local git clone
repository.
- :class:`swh.loader.git.loader.GitLoaderFromArchive` which ingests a git repository
wrapped in an archive.
License
-------
This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
See top-level LICENSE file for the full text of the GNU General Public
License along with this program.
Dependencies
------------
### Runtime
- python3
- python3-dulwich
- python3-retrying
- python3-swh.core
- python3-swh.model
- python3-swh.storage
- python3-swh.scheduler
### Test
- python3-nose
Requirements
------------
- implementation language, Python3
- coding guidelines: conform to PEP8
- Git access: via dulwich
CLI Run
----------
You can run the loader from a remote origin (*loader*) or from an origin on disk
(*from_disk*) directly by calling:
```
swh loader -C <config-file> run git <git-repository-url>
```
or "git_disk".
## Configuration sample
/tmp/git.yml:
```
storage:
cls: remote
args:
url: http://localhost:5002/
```
diff --git a/swh.loader.git.egg-info/SOURCES.txt b/swh.loader.git.egg-info/SOURCES.txt
index 2c3e5eb..63bd401 100644
--- a/swh.loader.git.egg-info/SOURCES.txt
+++ b/swh.loader.git.egg-info/SOURCES.txt
@@ -1,57 +1,58 @@
.gitignore
.pre-commit-config.yaml
AUTHORS
CODE_OF_CONDUCT.md
CONTRIBUTORS
LICENSE
MANIFEST.in
Makefile
README.md
conftest.py
mypy.ini
pyproject.toml
pytest.ini
requirements-swh.txt
requirements-test.txt
requirements.txt
setup.cfg
setup.py
tox.ini
bin/dir-git-repo-meta.sh
docs/.gitignore
docs/Makefile
docs/conf.py
docs/index.rst
docs/_static/.placeholder
docs/_templates/.placeholder
docs/attic/api-backend-protocol.txt
docs/attic/git-loading-design.txt
resources/local-loader-git.ini
resources/remote-loader-git.ini
resources/updater.ini
resources/test/back.ini
resources/test/db-manager.ini
swh/__init__.py
swh.loader.git.egg-info/PKG-INFO
swh.loader.git.egg-info/SOURCES.txt
swh.loader.git.egg-info/dependency_links.txt
swh.loader.git.egg-info/entry_points.txt
swh.loader.git.egg-info/requires.txt
swh.loader.git.egg-info/top_level.txt
swh/loader/__init__.py
swh/loader/git/__init__.py
swh/loader/git/converters.py
+swh/loader/git/dumb.py
swh/loader/git/from_disk.py
swh/loader/git/loader.py
swh/loader/git/py.typed
swh/loader/git/tasks.py
swh/loader/git/utils.py
swh/loader/git/tests/__init__.py
swh/loader/git/tests/conftest.py
swh/loader/git/tests/test_converters.py
swh/loader/git/tests/test_from_disk.py
swh/loader/git/tests/test_loader.py
swh/loader/git/tests/test_tasks.py
swh/loader/git/tests/test_utils.py
swh/loader/git/tests/data/testrepo.tgz
swh/loader/git/tests/data/git-repos/example-submodule.bundle
\ No newline at end of file
diff --git a/swh/loader/git/dumb.py b/swh/loader/git/dumb.py
new file mode 100644
index 0000000..088974b
--- /dev/null
+++ b/swh/loader/git/dumb.py
@@ -0,0 +1,197 @@
+# Copyright (C) 2021 The Software Heritage developers
+# See the AUTHORS file at the top-level directory of this distribution
+# License: GNU General Public License version 3, or any later version
+# See top-level LICENSE file for more information
+
+from __future__ import annotations
+
+from collections import defaultdict
+import logging
+import stat
+from tempfile import SpooledTemporaryFile
+from typing import TYPE_CHECKING, Callable, Dict, Iterable, List, Set, cast
+
+from dulwich.client import HttpGitClient
+from dulwich.objects import S_IFGITLINK, Commit, ShaFile, Tree
+from dulwich.pack import Pack, PackData, PackIndex, load_pack_index_file
+from urllib3.response import HTTPResponse
+
+if TYPE_CHECKING:
+ from .loader import RepoRepresentation
+
+logger = logging.getLogger(__name__)
+
+
+class DumbHttpGitClient(HttpGitClient):
+ """Simple wrapper around dulwich.client.HTTPGitClient
+ """
+
+ def __init__(self, base_url: str):
+ super().__init__(base_url)
+ self.user_agent = "Software Heritage dumb Git loader"
+
+ def get(self, url: str) -> HTTPResponse:
+ logger.debug("Fetching %s", url)
+ response, _ = self._http_request(url, headers={"User-Agent": self.user_agent})
+ return response
+
+
+def check_protocol(repo_url: str) -> bool:
+ """Checks if a git repository can be cloned using the dumb protocol.
+
+ Args:
+ repo_url: Base URL of a git repository
+
+ Returns:
+ Whether the dumb protocol is supported.
+
+ """
+ if not repo_url.startswith("http"):
+ return False
+ http_client = DumbHttpGitClient(repo_url)
+ url = http_client.get_url("info/refs?service=git-upload-pack")
+ response = http_client.get(url)
+ return (
+ response.status in (200, 304,)
+ # header is not mandatory in protocol specification
+ and response.content_type is None
+ or not response.content_type.startswith("application/x-git-")
+ )
+
+
+class GitObjectsFetcher:
+ """Git objects fetcher using dumb HTTP protocol.
+
+ Fetches a set of git objects for a repository according to its archival
+ state by Software Heritage and provides iterators on them.
+
+ Args:
+ repo_url: Base URL of a git repository
+ base_repo: State of repository archived by Software Heritage
+ """
+
+ def __init__(self, repo_url: str, base_repo: RepoRepresentation):
+ self.http_client = DumbHttpGitClient(repo_url)
+ self.base_repo = base_repo
+ self.objects: Dict[bytes, Set[bytes]] = defaultdict(set)
+ self.refs = self._get_refs()
+ self.head = self._get_head()
+ self.packs = self._get_packs()
+
+ def fetch_object_ids(self) -> None:
+ """Fetches identifiers of git objects to load into the archive.
+ """
+ wants = self.base_repo.determine_wants(self.refs)
+
+ # process refs
+ commit_objects = []
+ for ref in wants:
+ ref_object = self._get_git_object(ref)
+ if ref_object.get_type() == Commit.type_num:
+ commit_objects.append(cast(Commit, ref_object))
+ self.objects[b"commit"].add(ref)
+ else:
+ self.objects[b"tag"].add(ref)
+
+ # perform DFS on commits graph
+ while commit_objects:
+ commit = commit_objects.pop()
+ # fetch tree and blob ids recursively
+ self._fetch_tree_objects(commit.tree)
+ for parent in commit.parents:
+ if (
+ # commit not already seen in the current load
+ parent not in self.objects[b"commit"]
+ # commit not already archived by a previous load
+ and parent not in self.base_repo.heads
+ ):
+ commit_objects.append(cast(Commit, self._get_git_object(parent)))
+ self.objects[b"commit"].add(parent)
+
+ def iter_objects(self, object_type: bytes) -> Iterable[ShaFile]:
+ """Returns a generator on fetched git objects per type.
+
+ Args:
+ object_type: Git object type, either b"blob", b"commit", b"tag" or b"tree"
+
+ Returns:
+ A generator fetching git objects on the fly.
+ """
+ return map(self._get_git_object, self.objects[object_type])
+
+ def _http_get(self, path: str) -> SpooledTemporaryFile:
+ url = self.http_client.get_url(path)
+ response = self.http_client.get(url)
+ buffer = SpooledTemporaryFile(max_size=100 * 1024 * 1024)
+ buffer.write(response.data)
+ buffer.flush()
+ buffer.seek(0)
+ return buffer
+
+ def _get_refs(self) -> Dict[bytes, bytes]:
+ refs = {}
+ refs_resp_bytes = self._http_get("info/refs")
+ for ref_line in refs_resp_bytes.readlines():
+ ref_target, ref_name = ref_line.replace(b"\n", b"").split(b"\t")
+ refs[ref_name] = ref_target
+ return refs
+
+ def _get_head(self) -> Dict[bytes, bytes]:
+ head_resp_bytes = self._http_get("HEAD")
+ _, head_target = head_resp_bytes.readline().replace(b"\n", b"").split(b" ")
+ return {b"HEAD": head_target}
+
+ def _get_pack_data(self, pack_name: str) -> Callable[[], PackData]:
+ def _pack_data() -> PackData:
+ pack_data_bytes = self._http_get(f"objects/pack/{pack_name}")
+ return PackData(pack_name, file=pack_data_bytes)
+
+ return _pack_data
+
+ def _get_pack_idx(self, pack_idx_name: str) -> Callable[[], PackIndex]:
+ def _pack_idx() -> PackIndex:
+ pack_idx_bytes = self._http_get(f"objects/pack/{pack_idx_name}")
+ return load_pack_index_file(pack_idx_name, pack_idx_bytes)
+
+ return _pack_idx
+
+ def _get_packs(self) -> List[Pack]:
+ packs = []
+ packs_info_bytes = self._http_get("objects/info/packs")
+ packs_info = packs_info_bytes.read().decode()
+ for pack_info in packs_info.split("\n"):
+ if pack_info:
+ pack_name = pack_info.split(" ")[1]
+ pack_idx_name = pack_name.replace(".pack", ".idx")
+ # pack index and data file will be lazily fetched when required
+ packs.append(
+ Pack.from_lazy_objects(
+ self._get_pack_data(pack_name),
+ self._get_pack_idx(pack_idx_name),
+ )
+ )
+ return packs
+
+ def _get_git_object(self, sha: bytes) -> ShaFile:
+ # try to get the object from a pack file first to avoid flooding
+ # git server with numerous HTTP requests
+ for pack in self.packs:
+ if sha in pack:
+ return pack[sha]
+ # fetch it from object/ directory otherwise
+ sha_hex = sha.decode()
+ object_path = f"objects/{sha_hex[:2]}/{sha_hex[2:]}"
+ return ShaFile.from_file(self._http_get(object_path))
+
+ def _fetch_tree_objects(self, sha: bytes) -> None:
+ if sha not in self.objects[b"tree"]:
+ tree = cast(Tree, self._get_git_object(sha))
+ self.objects[b"tree"].add(sha)
+ for item in tree.items():
+ if item.mode == S_IFGITLINK:
+ # skip submodules as objects are not stored in repository
+ continue
+ if item.mode & stat.S_IFDIR:
+ self._fetch_tree_objects(item.sha)
+ else:
+ self.objects[b"blob"].add(item.sha)
diff --git a/swh/loader/git/loader.py b/swh/loader/git/loader.py
index c1ccbab..bdfeabc 100644
--- a/swh/loader/git/loader.py
+++ b/swh/loader/git/loader.py
@@ -1,478 +1,501 @@
# Copyright (C) 2016-2021 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from dataclasses import dataclass
import datetime
import logging
import os
import pickle
import sys
from tempfile import SpooledTemporaryFile
from typing import Any, Callable, Dict, Iterable, Iterator, List, Optional, Set, Type
import dulwich.client
from dulwich.errors import GitProtocolError, NotGitRepository
from dulwich.object_store import ObjectStoreGraphWalker
from dulwich.objects import ShaFile
from dulwich.pack import PackData, PackInflater
from swh.loader.core.loader import DVCSLoader
from swh.loader.exception import NotFound
from swh.model import hashutil
from swh.model.model import (
BaseContent,
Directory,
Origin,
Release,
Revision,
Snapshot,
SnapshotBranch,
TargetType,
)
from swh.storage.algos.snapshot import snapshot_get_latest
from swh.storage.interface import StorageInterface
-from . import converters, utils
+from . import converters, dumb, utils
logger = logging.getLogger(__name__)
class RepoRepresentation:
"""Repository representation for a Software Heritage origin."""
def __init__(
self, storage, base_snapshot: Optional[Snapshot] = None, ignore_history=False
):
self.storage = storage
self.ignore_history = ignore_history
if base_snapshot and not ignore_history:
self.base_snapshot: Snapshot = base_snapshot
else:
self.base_snapshot = Snapshot(branches={})
self.heads: Set[bytes] = set()
def get_parents(self, commit: bytes) -> List[bytes]:
"""This method should return the list of known parents"""
return []
def graph_walker(self) -> ObjectStoreGraphWalker:
return ObjectStoreGraphWalker(self.heads, self.get_parents)
def determine_wants(self, refs: Dict[bytes, bytes]) -> List[bytes]:
"""Get the list of bytehex sha1s that the git loader should fetch.
This compares the remote refs sent by the server with the base snapshot
provided by the loader.
"""
if not refs:
return []
# Cache existing heads
local_heads: Set[bytes] = set()
for branch_name, branch in self.base_snapshot.branches.items():
if not branch or branch.target_type == TargetType.ALIAS:
continue
local_heads.add(hashutil.hash_to_hex(branch.target).encode())
self.heads = local_heads
# Get the remote heads that we want to fetch
remote_heads: Set[bytes] = set()
for ref_name, ref_target in refs.items():
if utils.ignore_branch_name(ref_name):
continue
remote_heads.add(ref_target)
return list(remote_heads - local_heads)
@dataclass
class FetchPackReturn:
remote_refs: Dict[bytes, bytes]
symbolic_refs: Dict[bytes, bytes]
pack_buffer: SpooledTemporaryFile
pack_size: int
class GitLoader(DVCSLoader):
"""A bulk loader for a git repository"""
visit_type = "git"
def __init__(
self,
storage: StorageInterface,
url: str,
base_url: Optional[str] = None,
ignore_history: bool = False,
repo_representation: Type[RepoRepresentation] = RepoRepresentation,
pack_size_bytes: int = 4 * 1024 * 1024 * 1024,
temp_file_cutoff: int = 100 * 1024 * 1024,
save_data_path: Optional[str] = None,
max_content_size: Optional[int] = None,
):
"""Initialize the bulk updater.
Args:
repo_representation: swh's repository representation
which is in charge of filtering between known and remote
data.
"""
super().__init__(
storage=storage,
save_data_path=save_data_path,
max_content_size=max_content_size,
)
self.origin_url = url
self.base_url = base_url
self.ignore_history = ignore_history
self.repo_representation = repo_representation
self.pack_size_bytes = pack_size_bytes
self.temp_file_cutoff = temp_file_cutoff
# state initialized in fetch_data
self.remote_refs: Dict[bytes, bytes] = {}
self.symbolic_refs: Dict[bytes, bytes] = {}
self.ref_object_types: Dict[bytes, Optional[TargetType]] = {}
def fetch_pack_from_origin(
self,
origin_url: str,
- base_snapshot: Optional[Snapshot],
+ base_repo: RepoRepresentation,
do_activity: Callable[[bytes], None],
) -> FetchPackReturn:
"""Fetch a pack from the origin"""
- pack_buffer = SpooledTemporaryFile(max_size=self.temp_file_cutoff)
- base_repo = self.repo_representation(
- storage=self.storage,
- base_snapshot=base_snapshot,
- ignore_history=self.ignore_history,
- )
+ pack_buffer = SpooledTemporaryFile(max_size=self.temp_file_cutoff)
# Hardcode the use of the tcp transport (for GitHub origins)
# Even if the Dulwich API lets us process the packfile in chunks as it's
# received, the HTTP transport implementation needs to entirely allocate
# the packfile in memory *twice*, once in the HTTP library, and once in
# a BytesIO managed by Dulwich, before passing chunks to the `do_pack`
# method Overall this triples the memory usage before we can even try to
# interrupt the loader before it overruns its memory limit.
# In contrast, the Dulwich TCP transport just gives us the read handle
# on the underlying socket, doing no processing or copying of the bytes.
# We can interrupt it as soon as we've received too many bytes.
transport_url = origin_url
if transport_url.startswith("https://github.com/"):
transport_url = "git" + transport_url[5:]
client, path = dulwich.client.get_transport_and_path(
transport_url, thin_packs=False
)
size_limit = self.pack_size_bytes
def do_pack(data: bytes) -> None:
cur_size = pack_buffer.tell()
would_write = len(data)
if cur_size + would_write > size_limit:
raise IOError(
f"Pack file too big for repository {origin_url}, "
f"limit is {size_limit} bytes, current size is {cur_size}, "
f"would write {would_write}"
)
pack_buffer.write(data)
pack_result = client.fetch_pack(
path,
base_repo.determine_wants,
base_repo.graph_walker(),
do_pack,
progress=do_activity,
)
remote_refs = pack_result.refs or {}
symbolic_refs = pack_result.symrefs or {}
pack_buffer.flush()
pack_size = pack_buffer.tell()
pack_buffer.seek(0)
logger.debug("Fetched pack size: %s", pack_size)
+ # check if repository only supports git dumb transfer protocol,
+ # fetched pack file will be empty in that case as dulwich do
+ # not support it and do not fetch any refs
+ self.dumb = transport_url.startswith("http") and client.dumb
+
return FetchPackReturn(
remote_refs=utils.filter_refs(remote_refs),
symbolic_refs=utils.filter_refs(symbolic_refs),
pack_buffer=pack_buffer,
pack_size=pack_size,
)
def prepare_origin_visit(self) -> None:
self.visit_date = datetime.datetime.now(tz=datetime.timezone.utc)
self.origin = Origin(url=self.origin_url)
def get_full_snapshot(self, origin_url) -> Optional[Snapshot]:
return snapshot_get_latest(self.storage, origin_url)
def prepare(self) -> None:
assert self.origin is not None
prev_snapshot: Optional[Snapshot] = None
if not self.ignore_history:
prev_snapshot = self.get_full_snapshot(self.origin.url)
if self.base_url and prev_snapshot is None:
base_origin = list(self.storage.origin_get([self.base_url]))[0]
if base_origin:
prev_snapshot = self.get_full_snapshot(base_origin.url)
if prev_snapshot is not None:
self.base_snapshot = prev_snapshot
else:
self.base_snapshot = Snapshot(branches={})
def fetch_data(self) -> bool:
assert self.origin is not None
+ base_repo = self.repo_representation(
+ storage=self.storage,
+ base_snapshot=self.base_snapshot,
+ ignore_history=self.ignore_history,
+ )
+
def do_progress(msg: bytes) -> None:
sys.stderr.buffer.write(msg)
sys.stderr.flush()
try:
fetch_info = self.fetch_pack_from_origin(
- self.origin.url, self.base_snapshot, do_progress
+ self.origin.url, base_repo, do_progress
)
except NotGitRepository as e:
raise NotFound(e)
except GitProtocolError as e:
# unfortunately, that kind of error is not specific to a not found
# scenario... It depends on the value of message within the exception.
for msg in [
"Repository unavailable", # e.g DMCA takedown
"Repository not found",
"unexpected http resp 401",
]:
if msg in e.args[0]:
raise NotFound(e)
# otherwise transmit the error
raise
+ except (AttributeError, NotImplementedError, ValueError):
+ # with old dulwich versions, those exceptions types can be raised
+ # by the fetch_pack operation when encountering a repository with
+ # dumb transfer protocol so we check if the repository supports it
+ # here to continue the loading if it is the case
+ self.dumb = dumb.check_protocol(self.origin_url)
+ if not self.dumb:
+ raise
+
+ if self.dumb:
+ logger.debug("Fetching objects with HTTP dumb transfer protocol")
+ self.dumb_fetcher = dumb.GitObjectsFetcher(self.origin_url, base_repo)
+ self.dumb_fetcher.fetch_object_ids()
+ self.remote_refs = utils.filter_refs(self.dumb_fetcher.refs)
+ self.symbolic_refs = self.dumb_fetcher.head
+ else:
+ self.pack_buffer = fetch_info.pack_buffer
+ self.pack_size = fetch_info.pack_size
+ self.remote_refs = fetch_info.remote_refs
+ self.symbolic_refs = fetch_info.symbolic_refs
- self.pack_buffer = fetch_info.pack_buffer
- self.pack_size = fetch_info.pack_size
-
- self.remote_refs = fetch_info.remote_refs
self.ref_object_types = {sha1: None for sha1 in self.remote_refs.values()}
- self.symbolic_refs = fetch_info.symbolic_refs
-
self.log.info(
"Listed %d refs for repo %s" % (len(self.remote_refs), self.origin.url),
extra={
"swh_type": "git_repo_list_refs",
"swh_repo": self.origin.url,
"swh_num_refs": len(self.remote_refs),
},
)
# No more data to fetch
return False
def save_data(self) -> None:
"""Store a pack for archival"""
assert isinstance(self.visit_date, datetime.datetime)
write_size = 8192
pack_dir = self.get_save_data_path()
pack_name = "%s.pack" % self.visit_date.isoformat()
refs_name = "%s.refs" % self.visit_date.isoformat()
with open(os.path.join(pack_dir, pack_name), "xb") as f:
self.pack_buffer.seek(0)
while True:
r = self.pack_buffer.read(write_size)
if not r:
break
f.write(r)
self.pack_buffer.seek(0)
with open(os.path.join(pack_dir, refs_name), "xb") as f:
pickle.dump(self.remote_refs, f)
def iter_objects(self, object_type: bytes) -> Iterator[ShaFile]:
"""Read all the objects of type `object_type` from the packfile"""
- self.pack_buffer.seek(0)
- for obj in PackInflater.for_pack_data(
- PackData.from_file(self.pack_buffer, self.pack_size)
- ):
- if obj.type_name != object_type:
- continue
- yield obj
+ if self.dumb:
+ yield from self.dumb_fetcher.iter_objects(object_type)
+ else:
+ self.pack_buffer.seek(0)
+ for obj in PackInflater.for_pack_data(
+ PackData.from_file(self.pack_buffer, self.pack_size)
+ ):
+ if obj.type_name != object_type:
+ continue
+ yield obj
def get_contents(self) -> Iterable[BaseContent]:
"""Format the blobs from the git repository as swh contents"""
for raw_obj in self.iter_objects(b"blob"):
if raw_obj.id in self.ref_object_types:
self.ref_object_types[raw_obj.id] = TargetType.CONTENT
yield converters.dulwich_blob_to_content(
raw_obj, max_content_size=self.max_content_size
)
def get_directories(self) -> Iterable[Directory]:
"""Format the trees as swh directories"""
for raw_obj in self.iter_objects(b"tree"):
if raw_obj.id in self.ref_object_types:
self.ref_object_types[raw_obj.id] = TargetType.DIRECTORY
yield converters.dulwich_tree_to_directory(raw_obj, log=self.log)
def get_revisions(self) -> Iterable[Revision]:
"""Format commits as swh revisions"""
for raw_obj in self.iter_objects(b"commit"):
if raw_obj.id in self.ref_object_types:
self.ref_object_types[raw_obj.id] = TargetType.REVISION
yield converters.dulwich_commit_to_revision(raw_obj, log=self.log)
def get_releases(self) -> Iterable[Release]:
"""Retrieve all the release objects from the git repository"""
for raw_obj in self.iter_objects(b"tag"):
if raw_obj.id in self.ref_object_types:
self.ref_object_types[raw_obj.id] = TargetType.RELEASE
yield converters.dulwich_tag_to_release(raw_obj, log=self.log)
def get_snapshot(self) -> Snapshot:
"""Get the snapshot for the current visit.
The main complexity of this function is mapping target objects to their
types, as the `refs` dictionaries returned by the git server only give
us the identifiers for the target objects, and not their types.
The loader itself only knows the types of the objects that it has
fetched from the server (as it has parsed them while loading them to
the archive). As we only fetched an increment between the previous
snapshot and the current state of the server, we are missing the type
information for the objects that would already have been referenced by
the previous snapshot, and that the git server didn't send us. We infer
the type of these objects from the previous snapshot.
"""
branches: Dict[bytes, Optional[SnapshotBranch]] = {}
unfetched_refs: Dict[bytes, bytes] = {}
# Retrieve types from the objects loaded by the current loader
for ref_name, ref_object in self.remote_refs.items():
if ref_name in self.symbolic_refs:
continue
target = hashutil.hash_to_bytes(ref_object.decode())
target_type = self.ref_object_types.get(ref_object)
if target_type:
branches[ref_name] = SnapshotBranch(
target=target, target_type=target_type
)
else:
# The object pointed at by this ref was not fetched, supposedly
# because it existed in the base snapshot. We record it here,
# and we can get it from the base snapshot later.
unfetched_refs[ref_name] = target
dangling_branches = {}
# Handle symbolic references as alias branches
for ref_name, target in self.symbolic_refs.items():
branches[ref_name] = SnapshotBranch(
target_type=TargetType.ALIAS, target=target,
)
if target not in branches and target not in unfetched_refs:
# This handles the case where the pointer is "dangling".
# There's a chance that a further symbolic reference
# override this default value, which is totally fine.
dangling_branches[target] = ref_name
branches[target] = None
if unfetched_refs:
# Handle inference of object types from the contents of the
# previous snapshot
unknown_objects = {}
base_snapshot_reverse_branches = {
branch.target: branch
for branch in self.base_snapshot.branches.values()
if branch and branch.target_type != TargetType.ALIAS
}
for ref_name, target in unfetched_refs.items():
branch = base_snapshot_reverse_branches.get(target)
branches[ref_name] = branch
if not branch:
unknown_objects[ref_name] = target
if unknown_objects:
# This object was referenced by the server; We did not fetch
# it, and we do not know it from the previous snapshot. This is
# likely a bug in the loader.
raise RuntimeError(
"Unknown objects referenced by remote refs: %s"
% (
", ".join(
f"{name.decode()}: {hashutil.hash_to_hex(obj)}"
for name, obj in unknown_objects.items()
)
)
)
utils.warn_dangling_branches(
branches, dangling_branches, self.log, self.origin_url
)
self.snapshot = Snapshot(branches=branches)
return self.snapshot
def load_status(self) -> Dict[str, Any]:
"""The load was eventful if the current snapshot is different to
the one we retrieved at the beginning of the run"""
eventful = False
if self.base_snapshot and self.snapshot:
eventful = self.snapshot.id != self.base_snapshot.id
elif self.snapshot:
eventful = bool(self.snapshot.branches)
return {"status": ("eventful" if eventful else "uneventful")}
if __name__ == "__main__":
import click
logging.basicConfig(
level=logging.DEBUG, format="%(asctime)s %(process)d %(message)s"
)
@click.command()
@click.option("--origin-url", help="Origin url", required=True)
@click.option("--base-url", default=None, help="Optional Base url")
@click.option(
"--ignore-history/--no-ignore-history",
help="Ignore the repository history",
default=False,
)
def main(origin_url: str, base_url: str, ignore_history: bool) -> Dict[str, Any]:
from swh.storage import get_storage
storage = get_storage(cls="memory")
loader = GitLoader(
storage, origin_url, base_url=base_url, ignore_history=ignore_history,
)
return loader.load()
main()
diff --git a/swh/loader/git/tests/test_loader.py b/swh/loader/git/tests/test_loader.py
index f18d9c2..a40e4b1 100644
--- a/swh/loader/git/tests/test_loader.py
+++ b/swh/loader/git/tests/test_loader.py
@@ -1,119 +1,270 @@
# Copyright (C) 2018-2021 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
+from functools import partial
+from http.server import HTTPServer, SimpleHTTPRequestHandler
import os
+import subprocess
+from threading import Thread
from dulwich.errors import GitProtocolError, NotGitRepository, ObjectFormatException
+from dulwich.porcelain import push
import dulwich.repo
import pytest
from swh.loader.git.loader import GitLoader
from swh.loader.git.tests.test_from_disk import FullGitLoaderTests
-from swh.loader.tests import assert_last_visit_matches, prepare_repository_from_archive
+from swh.loader.tests import (
+ assert_last_visit_matches,
+ get_stats,
+ prepare_repository_from_archive,
+)
class CommonGitLoaderNotFound:
@pytest.fixture(autouse=True)
def __inject_fixtures(self, mocker):
"""Inject required fixtures in unittest.TestCase class
"""
self.mocker = mocker
@pytest.mark.parametrize(
"failure_exception",
[
GitProtocolError("Repository unavailable"), # e.g DMCA takedown
GitProtocolError("Repository not found"),
GitProtocolError("unexpected http resp 401"),
NotGitRepository("not a git repo"),
],
)
def test_load_visit_not_found(self, failure_exception):
"""Ingesting an unknown url result in a visit with not_found status
"""
# simulate an initial communication error (e.g no repository found, ...)
mock = self.mocker.patch(
"swh.loader.git.loader.GitLoader.fetch_pack_from_origin"
)
mock.side_effect = failure_exception
res = self.loader.load()
assert res == {"status": "uneventful"}
assert_last_visit_matches(
self.loader.storage,
self.repo_url,
status="not_found",
type="git",
snapshot=None,
)
@pytest.mark.parametrize(
"failure_exception",
[IOError, ObjectFormatException, OSError, ValueError, GitProtocolError,],
)
def test_load_visit_failure(self, failure_exception):
"""Failing during the fetch pack step result in failing visit
"""
# simulate a fetch communication error after the initial connection
# server error (e.g IOError, ObjectFormatException, ...)
mock = self.mocker.patch(
"swh.loader.git.loader.GitLoader.fetch_pack_from_origin"
)
mock.side_effect = failure_exception("failure")
res = self.loader.load()
assert res == {"status": "failed"}
assert_last_visit_matches(
self.loader.storage,
self.repo_url,
status="failed",
type="git",
snapshot=None,
)
class TestGitLoader(FullGitLoaderTests, CommonGitLoaderNotFound):
"""Prepare a git directory repository to be loaded through a GitLoader.
This tests all git loader scenario.
"""
@pytest.fixture(autouse=True)
def init(self, swh_storage, datadir, tmp_path):
archive_name = "testrepo"
archive_path = os.path.join(datadir, f"{archive_name}.tgz")
tmp_path = str(tmp_path)
self.repo_url = prepare_repository_from_archive(
archive_path, archive_name, tmp_path=tmp_path
)
self.destination_path = os.path.join(tmp_path, archive_name)
self.loader = GitLoader(swh_storage, self.repo_url)
self.repo = dulwich.repo.Repo(self.destination_path)
class TestGitLoader2(FullGitLoaderTests, CommonGitLoaderNotFound):
"""Mostly the same loading scenario but with a base-url different than the repo-url.
To walk slightly different paths, the end result should stay the same.
"""
@pytest.fixture(autouse=True)
def init(self, swh_storage, datadir, tmp_path):
archive_name = "testrepo"
archive_path = os.path.join(datadir, f"{archive_name}.tgz")
tmp_path = str(tmp_path)
self.repo_url = prepare_repository_from_archive(
archive_path, archive_name, tmp_path=tmp_path
)
self.destination_path = os.path.join(tmp_path, archive_name)
base_url = f"base://{self.repo_url}"
self.loader = GitLoader(swh_storage, self.repo_url, base_url=base_url)
self.repo = dulwich.repo.Repo(self.destination_path)
+
+
+class DumbGitLoaderTestBase(FullGitLoaderTests):
+ """Prepare a git repository to be loaded using the HTTP dumb transfer protocol.
+ """
+
+ @pytest.fixture(autouse=True)
+ def init(self, swh_storage, datadir, tmp_path):
+ # remove any proxy settings in order to successfully spawn a local HTTP server
+ http_proxy = os.environ.get("http_proxy")
+ https_proxy = os.environ.get("https_proxy")
+ if http_proxy:
+ del os.environ["http_proxy"]
+ if http_proxy:
+ del os.environ["https_proxy"]
+
+ # prepare test base repository using smart transfer protocol
+ archive_name = "testrepo"
+ archive_path = os.path.join(datadir, f"{archive_name}.tgz")
+ tmp_path = str(tmp_path)
+ base_repo_url = prepare_repository_from_archive(
+ archive_path, archive_name, tmp_path=tmp_path
+ )
+ destination_path = os.path.join(tmp_path, archive_name)
+ self.destination_path = destination_path
+ with_pack_files = self.with_pack_files
+
+ if with_pack_files:
+ # create a bare clone of that repository in another folder,
+ # all objects will be contained in one or two pack files in that case
+ bare_repo_path = os.path.join(tmp_path, archive_name + "_bare")
+ subprocess.run(
+ ["git", "clone", "--bare", base_repo_url, bare_repo_path], check=True,
+ )
+ else:
+ # otherwise serve objects from the bare repository located in
+ # the .git folder of the base repository
+ bare_repo_path = os.path.join(destination_path, ".git")
+
+ # spawn local HTTP server that will serve the bare repository files
+ hostname = "localhost"
+ handler = partial(SimpleHTTPRequestHandler, directory=bare_repo_path)
+ httpd = HTTPServer((hostname, 0), handler, bind_and_activate=True)
+
+ def serve_forever(httpd):
+ with httpd:
+ httpd.serve_forever()
+
+ thread = Thread(target=serve_forever, args=(httpd,))
+ thread.start()
+
+ repo = dulwich.repo.Repo(self.destination_path)
+
+ class DumbGitLoaderTest(GitLoader):
+ def load(self):
+ """
+ Override load method to ensure the bare repository will be synchronized
+ with the base one as tests can modify its content.
+ """
+ if with_pack_files:
+ # ensure HEAD ref will be the same for both repositories
+ with open(os.path.join(bare_repo_path, "HEAD"), "wb") as fw:
+ with open(
+ os.path.join(destination_path, ".git/HEAD"), "rb"
+ ) as fr:
+ head_ref = fr.read()
+ fw.write(head_ref)
+
+ # push possibly modified refs in the base repository to the bare one
+ for ref in repo.refs.allkeys():
+ if ref != b"HEAD" or head_ref in repo.refs:
+ push(
+ repo,
+ remote_location=f"file://{bare_repo_path}",
+ refspecs=ref,
+ )
+
+ # generate or update the info/refs file used in dumb protocol
+ subprocess.run(
+ ["git", "-C", bare_repo_path, "update-server-info"], check=True,
+ )
+
+ return super().load()
+
+ # bare repository with dumb protocol only URL
+ self.repo_url = f"http://{httpd.server_name}:{httpd.server_port}"
+ self.loader = DumbGitLoaderTest(swh_storage, self.repo_url)
+ self.repo = repo
+
+ yield
+
+ # shutdown HTTP server
+ httpd.shutdown()
+ thread.join()
+
+ # restore HTTP proxy settings if any
+ if http_proxy:
+ os.environ["http_proxy"] = http_proxy
+ if https_proxy:
+ os.environ["https_proxy"] = https_proxy
+
+ @pytest.mark.parametrize(
+ "failure_exception", [AttributeError, NotImplementedError, ValueError]
+ )
+ def test_load_despite_dulwich_exception(self, mocker, failure_exception):
+ """Checks repository can still be loaded when dulwich raises exception
+ when encountering a repository with dumb transfer protocol.
+ """
+
+ fetch_pack_from_origin = mocker.patch(
+ "swh.loader.git.loader.GitLoader.fetch_pack_from_origin"
+ )
+
+ fetch_pack_from_origin.side_effect = failure_exception("failure")
+
+ res = self.loader.load()
+
+ assert res == {"status": "eventful"}
+
+ stats = get_stats(self.loader.storage)
+ assert stats == {
+ "content": 4,
+ "directory": 7,
+ "origin": 1,
+ "origin_visit": 1,
+ "release": 0,
+ "revision": 7,
+ "skipped_content": 0,
+ "snapshot": 1,
+ }
+
+
+class TestDumbGitLoaderWithPack(DumbGitLoaderTestBase):
+ @classmethod
+ def setup_class(cls):
+ cls.with_pack_files = True
+
+
+class TestDumbGitLoaderWithoutPack(DumbGitLoaderTestBase):
+ @classmethod
+ def setup_class(cls):
+ cls.with_pack_files = False
File Metadata
Details
Attached
Mime Type
text/x-diff
Expires
Mon, Aug 18, 9:06 PM (6 h, 31 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3341276
Attached To
rDLDG Git loader
Event Timeline
Log In to Comment