- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 8 2023
Dec 15 2022
Dec 1 2022
Nov 4 2022
swh.loader.git 2.1.0 has now been deployed on all workers.
Nov 3 2022
Oct 19 2022
Jun 21 2022
May 30 2022
May 20 2022
I did some profiling early this week, and found that when incrementally loading a linux fork we already visited:
May 16 2022
In T4219#84994, @vlorentz wrote:This indicates we should load incrementally from the last snapshot of the origin AND the last snapshot of its parent, so we would capture these new commits without reloading half of the parent's history. As @olasd puts it, "that's a (very) lightweight way of doing global deduplication".
May 13 2022
https://grafana.softwareheritage.org/d/FqGC4zu7z/vlorentz-loader-metrics?orgId=1&var-environment=production&var-interval=1h&var-visit_type=git&var-has_parent_origins=True shows we spend a considerable amount of time loading data from git repositories with an existing visit + a parent:
May 10 2022
Currently can't do it on GitLab while logged out: https://gitlab.com/gitlab-org/gitlab/-/issues/361952
May 6 2022
May 3 2022
May 2 2022
Apr 29 2022
I'm closing this. I've submitted T4216 to track the actual packfile limit issue.
Apr 28 2022
Apr 27 2022
Apr 26 2022
Apr 22 2022
Apr 21 2022
Apr 19 2022
In summary, we would archive everything with priority "high" or "mid", as well as the "license" and "main language" fields, as they are all easy to fetch and store
Apr 11 2022
dealt with (at least in terms of only using https for clones)
Looks like *what* we want to collect is a solved issue.
Feb 18 2022
Jan 18 2022
Jan 11 2022
I guess this is then related to T3653 somehow
Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).
For now, I've disabled our hardcoding of the TCP transport for GitHub origins.
The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.
Nov 15 2021
Oct 21 2021
Sep 2 2021
I updated the task with a breakdown of the cost of getting each info.
Sep 1 2021
In T3544#69746, @olasd wrote:I can see a few alternatives to using git:// over tcp:
- Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.
The dulwich HTTP(s) support is implemented on top of urllib(3?).
no and yes, respectively
do we need the "list of forks" if we keep the "fork of what"? I mean these are the 2 ends of the fork relation, right?
Aug 31 2021
"topics" (these are the "tags", right?)
Here's an opinionated and prioritized list.
In T3542#69656, @moranegg wrote:At the moment, I think that all the properties you have selected in the task are needed.
At the moment, I think that all the properties you have selected in the task are needed.
+1 for License (it is something they show on the interface even if it is based on a heuristic).