Reopening as i'm still refactoring/cleaning up more modules.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jul 16 2020
Jul 10 2020
Jul 9 2020
Jul 8 2020
Jul 7 2020
In the end, it's more dead code since it's only code we pass into when the storage used is an in-memory instance.
This is no longer the case, tests are now using pg-storage instance.
Jul 6 2020
Jun 19 2020
The heuristic you're talking about only applies for branches which name starts with refs/. All other branches are passed through unscathed, I think (which is a good thing, because most snapshots we generate as swh don't do refs/).
as a related data point, the current graph export code applies the following heuristic to decide which outbound edges from snapshot nodes to emit:
- keep branch names starting with refs/heads/
- keep branch names starting with refs/tags/
- drop everything else
We still need to try to ingest the zeq2 repo, but that can be done in a followup task.
May 30 2020
The following repositories failed to import. Their on-disk structure is either completely empty, or only contains refs (no actual git objects stored):
May 29 2020
After the first (naive, I guess) pass, 1470 repositories are still missing.
May 19 2020
The code for loading git repositories from disk hasn't been run in production in a while, so I've decided to run the imports of the missing repos manually.
We also have a single origin with no full visit:
After dumping all origins starting with https://gitorious.org/ in the archive:
Apr 28 2020
Currently running this again with debug logs...
Currently running this again with debug logs...
Thanks for the input.
Reading this again, and seeing that the workers have 16GB of RAM, there's something weird going on that's not related to the volume of the packfile (which is 2GB max).
The base logic of the git loader regarding packfiles hasn't really been touched since it was first implemented: it's never been really profiled/optimized with respect to its memory usage; This issue isn't specific to the staging infra, it's only more salient there because the workers have been made with tight constraints.
Apr 22 2020
[2] I will add some swap to that node to check if that goes further with it.
Apr 21 2020
Apr 15 2020
I got the exact same situation when I updated the mercurial loader to swh-model objects.
Build is green
Improve the git commit
Jan 22 2020
I agree that this may be a useful optimization for some upstreams where getting the state of the remote repository is expensive.
Jan 21 2020
Nov 19 2019
This has been fixed by cb42fea77070
Reproduced.
Nov 15 2019
Nov 5 2019
Note that this doesn't solve the question of pulling release notes from e.g. GitHub release pages, which is something that would need to be done by some other component (T17 comes to mind).
Oct 1 2019
Sep 30 2019
To ease the analysis, here is an aggregate of the 09/2019 latest failures:
New dashboards with latest errors as of 09/2019 [1]
Sep 10 2019
I've backported dulwich 0.19.13-1 to our stretch repo, upgraded all workers and they're restarting.
Sep 7 2019
And nice work on the investigation and the fix within dulwich ;)
Sep 6 2019
May 25 2019
This is done, I've forked off the part about consistently documenting configuration options to T1758.
Feb 5 2019
That's a fairly large repo (as seen with how the content bundles get spread out to limit their size). It looks like it has some large directories (e.g. the .bugs directory looks like it has a lot of entries) so I'm not too surprised.