Fix [1] first
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sep 22 2021
Sep 21 2021
Sep 20 2021
Fix [1] first
Sep 17 2021
Sep 1 2021
In T3544#69746, @olasd wrote:I can see a few alternatives to using git:// over tcp:
- Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.
The dulwich HTTP(s) support is implemented on top of urllib(3?).
Aug 10 2021
Another example in production, during the stop phase of a worker, the loader was alone on the server (with 12Go of ram) and was oom killed:
Aug 10 08:53:24 worker05 python3[871]: [2021-08-10 08:53:24,745: INFO/ForkPoolWorker-1] Load origin 'https://github.com/evands/Specs' with type 'git' Aug 10 08:54:17 worker05 python3[871]: [62B blob data] Aug 10 08:54:17 worker05 python3[871]: [586B blob data] Aug 10 08:54:17 worker05 python3[871]: [473B blob data] Aug 10 08:54:29 worker05 python3[871]: Total 782419 (delta 6), reused 5 (delta 5), pack-reused 782401 Aug 10 08:54:29 worker05 python3[871]: [2021-08-10 08:54:29,044: INFO/ForkPoolWorker-1] Listed 6 refs for repo https://github.com/evands/Specs Aug 10 08:59:21 worker05 kernel: [ 871] 1004 871 247194 161634 1826816 46260 0 python3 Aug 10 09:08:29 worker05 systemd[1]: swh-worker@loader_git.service: Unit process 871 (python3) remains running after unit stopped. Aug 10 09:15:29 worker05 kernel: [ 871] 1004 871 412057 372785 3145728 0 0 python3 Aug 10 09:16:57 worker05 kernel: [ 871] 1004 871 823648 784496 6443008 0 0 python3 Aug 10 09:24:44 worker05 kernel: CPU: 2 PID: 871 Comm: python3 Not tainted 5.10.0-0.bpo.7-amd64 #1 Debian 5.10.40-1~bpo10+1 Aug 10 09:24:44 worker05 kernel: [ 871] 1004 871 2800000 2760713 22286336 0 0 python3 Aug 10 09:24:44 worker05 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0-2,oom_memcg=/system.slice/system-swh\x2dworker.slice,task_memcg=/system.slice/system-swh\x2dworker.slice/swh-worker@loader_git.service,task=python3,pid=871,uid=1004 Aug 10 09:24:44 worker05 kernel: Memory cgroup out of memory: Killed process 871 (python3) total-vm:11200000kB, anon-rss:11038844kB, file-rss:4008kB, shmem-rss:0kB, UID:1004 pgtables:21764kB oom_score_adj:0 Aug 10 09:24:45 worker05 kernel: oom_reaper: reaped process 871 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Aug 9 2021
[3] possibly T2373
Aug 5 2021
It's exactly the same issue AFAIK
For information, @vlorentz opened a related issue in dulwich [1].
Aug 4 2021
Jul 13 2021
if it would be worth submitting these recursive origins with "save code now" so we can try to get submodule updates close to the update of the main repository
Jul 12 2021
I also wonder if we have a somewhat common approach to handle the SVN externals as well.
I think this is worthwhile in general, at least for repositories that are still live.
May 6 2021
In T3311#64737, @vlorentz wrote:I think the only issue with (3) is not being retroactive
I think the only issue with (3) is not being retroactive
This is a good idea, thanks for raising it.
Apr 14 2021
Apr 5 2021
Apr 4 2021
I am here to just say: swh-loader-git doesn't have a CONTRIBUTORS file. You may ask the contributor to add it as well :)
Mar 15 2021
Mar 5 2021
Mar 1 2021
Lowering task priority to normal, nothing critical here.
Feb 3 2021
After mulling this over with @zack, and looking at the starved worker logs for a while, I suspect that we're also being bitten by our (early, early) choice of using celery acks_late, which only acknowledges tasks when they're done: when a worker is OOM-killed, it will never send task acknowledgements to rabbitmq, which will keep re-sending it the tasks.
My current workaround attempt is switching pack fetches from https://github.com/* to git://github.com/*, transparently in the git loader; dulwich's git over TCP transport doesn't have to do the same "double-buffering" as the https transport, so it should allow us to fail earlier (hopefully without involving the oom killer).
Attempts at mitigating the issue:
Jan 13 2021
Jan 7 2021
In T2926#56128, @rdicosmo wrote:Thanks Antoine, any way to have this kind of errors also reported in the admin dashboard for save code now.
Thanks Antoine, any way to have this kind of errors also reported in the admin dashboard for save code now.
For the record, the load failure on 2021-01-04T17:05:11Z was due to a network error (found via Kibana):
Jan 6 2021
The repository has been correctly ingested on 05 January 2021, 11:56 UTC .
Jan 4 2021
Oct 16 2020
Sep 24 2020
I don't think so; the loader is storing the data elsewhere, but still doesn't write the archive type in each of these entries
Sep 22 2020
I suspect that this is superseded by work done by @vlorentz for the extrinsic metadata store.
running some of the sources on production. I have "save code now" guix and
nixpkgs repositories, i could also add the linux kernel (it the visit is old
enough).
Sep 21 2020
I have opened a "fresher" dashboard on kibana with the errors (grouped by error message as kibana filter, they needs toggling on/off to actually see them) [1]
I think we need to cross those filtering messages with sentry to actually have some context though... (as we don't have really any with that board...).
fwiw, loader-core v0.11.0 deployed in production.
In T2373#49214, @ardumont wrote:fwiw, loader-core v0.11.0 deployed in production.
fwiw, loader-core v0.11.0 deployed in production.
Sep 20 2020
I can confirm that with the current master HEAD of swh-loader-core (452fa224f9ca635a979cf1a8e98c88bb560ca98a), loading of the Linux kernel repo no longer OOM.
(It failed after ~24 hours, but apparently for unrelated reasons.)
Sep 18 2020
Status on this. Loader-core has been tagged 0.11.0 which includes D3976.
Build is green
rebase
Build is green
Sep 17 2020
Adding pagination to these endpoints seems quite overkill.
In T2373#48877, @ardumont wrote:So content_missing call explodes mid-air client side (`"POST /content/missing
HTTP/1.1" 200 9475383` so client received the data).It so happens that the content_missing api is taking an unlimited amount of
bytes ids as input [1] And then "tries" to stream to the client the results
(rpc layer in the middle makes that moot).
So content_missing call explodes mid-air client side (`"POST /content/missing
HTTP/1.1" 200 9475383` so client received the data).
FTR, in a test setup I made a few days ago on docker, I had a git loader crunching ~28GB of RES mem (on 32 available on that machine). Not sure which repo it was ingesting, but it was on codeberg.
Very likely the same issue, thanks @ardumont !
Given what @olasd said in that issue (the ingestion logic having remained pretty much the same since ever), and that I can confirm linux.git was loading just fine on my laptop no more than a year ago, the increased memory usage probably comes from elsewhere.
Anyway, it looks like a potentially important issue, so I'm raising priority and also removing the association with the docker env (as you could also reproduce this on staging).
possibly related to T2373.
Sep 16 2020
Sep 11 2020
Let's call it fixed (until further notice).