Rescheduled the lister instance to scrape clojars and now it continues on, skipping (and not failing) when it fails to retrieve artifacts informations:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 14 2022
Loader maven status are now ingesting without failing:
Deploy new version:
root@pergamon:~# clush -b -w @staging-workers "dpkg -l python3-swh.lister python3-swh.loader.core" --------------- worker[0-3].internal.staging.swh.network (4) --------------- Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=======================-====================-============-================================================================= ii python3-swh.lister 2.8.0-1~swh2~bpo10+1 all Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...) ii python3-swh.loader.core 2.6.2-1~swh1~bpo10+1 all Software Heritage Loader Core
ll
Apr 13 2022
So the gist of the deployment is done, let's fix those lister and loader issue in the dedicated task [1].
The swh-scheduler-scheduler-recurrent service needed a restart to take into account maven tasks to be loaded.
maven central listing is actually ongoing (at least up until the lister founds some 404 and it will behave the same, crash and stop).
Still some origins are now present in listed_origins:
Maven central scheduled as well:
Scheduling and the lister kicked in [1] but that fails on 404 [2].
And that stopped the listing.
Trigger the run for maven-central triggered issue due to the high volume of data for that one.
I'll debug some more tomorrow.
Apr 11 2022
Trigger the run for maven-central triggered issue due to the high volume of data for that one.
I'll debug some more tomorrow.
Trigger the run for clojars:
root@maven-exporter0:~# systemctl start maven_index_exporter@clojars root@maven-exporter0:~# systemctl status maven_index_exporter@clojars ● maven_index_exporter@clojars.service - Software Heritage Maven Index Exporter clojars Loaded: loaded (/etc/systemd/system/maven_index_exporter@.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/maven_index_exporter@clojars.service.d └─parameters.conf Active: active (running) since Mon 2022-04-11 16:53:51 UTC; 2s ago TriggeredBy: ● maven_index_exporter@clojars.timer Main PID: 4569 (bash) Tasks: 9 (limit: 4675) Memory: 56.4M CPU: 160ms CGroup: /system.slice/system-maven_index_exporter.slice/maven_index_exporter@clojars.service ├─4569 bash /usr/local/bin/run_maven_index_exporter.sh clojars └─4571 docker run -v /srv/softwareheritage/maven-index-exporter//clojars/work:/work -v /var/www/maven_index_exporter:/publish -e MVN_IDX_EXPORTER_BASE_URL=http://clojars.org/repo/ softwareheritage/maven-index-exporter:v0.2.0 ... root@maven-exporter0:~# ls -lah /var/www/maven_index_exporter/export-clojars.fld -rwxrwxrwx 1 root root 61M Apr 11 16:54 /var/www/maven_index_exporter/export-clojars.fld root@maven-exporter0:~# zfs get all | grep compress data compressratio 8.26x - data compression off default data refcompressratio 1.00x - data/mvn-idx-publish compressratio 18.88x - data/mvn-idx-publish compression zstd local data/mvn-idx-publish refcompressratio 18.88x - data/mvn-idx-work compressratio 5.84x - data/mvn-idx-work compression zstd local data/mvn-idx-work refcompressratio 5.84x -
Configure zfs partitions:
root@maven-exporter0:~# lsblk | grep vdb vdb 254:16 0 50G 0 disk root@maven-exporter0:~# zpool create -f data /dev/vdb root@maven-exporter0:~# zpool status pool: data state: ONLINE config:
Apr 8 2022
Apr 7 2022
Apr 6 2022
Apr 5 2022
Apr 1 2022
Yes, thx. Will do.
I see there is a lot of progress here, nice!
I try to follow the thread as time allows, but if you're stuck please do not hesitate to notify me.
Mar 28 2022
Mar 24 2022
Mar 22 2022
Mar 17 2022
The worst case scenario is that someone maliciously creates repositories generated on the fly that refer to each other via .gitmodules, so we end up in an infinite loop of loading garbage.
Mar 16 2022
In T3311#80997, @olasd wrote:I'm not comfortable always creating high priority tasks in this context either, as I'm not sure what the throttling implications are when we inevitably end up on a repository that references a commit in a submodule that doesn't exist.
I think the approach in D7332 is interesting, but it feels a bit expensive to be doing it for every instance of a .gitmodules file found in any new directory for all git repos that are being loaded, as well as doing it again for the top level of any known branch in the git snapshot being loaded currently.
Mar 14 2022
It's been more/less discussed above but IMHO it would make sense to:
Mar 10 2022
Feb 22 2022
Feb 18 2022
The actual issues were currently not reported properly. It's now fixed.
Feb 17 2022
Currently deployed in staging and production.
So future listing will do the right thing.
Feb 16 2022
Feb 15 2022
Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?
Feb 14 2022
Feb 13 2022
Hi @ardumont, sorry for the delay, wild week here. And thanks for the iso 8601 fix.