Page MenuHomeSoftware Heritage

staging: Running nixguix on guix sources
Closed, MigratedEdits Locked

Description

Scheduling command:

swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add load-nixguix url=http://guix.gnu.org/sources.json
WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal'
INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml
Created 1 tasks

Task 1224855
  Next run: just now (2020-07-06 14:03:59+00:00)
  Interval: 1 day, 0:00:00
  Type: load-nixguix
  Policy: recurring
  Args:
  Keyword args:
    url: 'http://guix.gnu.org/sources.json'

From irc ping:

15:39 <zimoun> lewo: Hi, the sources.json for Guix is served at http://guix.gnu.org/sources.json so could you add to your staging for testing?
15:40 <+olasd> nice!
15:40 <+olasd> ardumont: ^
15:41 <lewo> zimoun: yeah! Cool ;)
15:42 <+ardumont> ack on this, nice!
15:42 <+ardumont> lemme finish my stuff first ;)

Event Timeline

ardumont triaged this task as Normal priority.Jul 6 2020, 4:06 PM
ardumont created this task.

Patched staging nixguix loader worker with the diffs above on staging and triggered back a run.
It seems to no longer complain.

Let's see how long that will take to finish its run then.

Run completed.

Jul 07 00:35:58 worker1 python3[955117]: [2020-07-07 00:35:58,004: INFO/ForkPoolWorker-1] Task swh.loader.package.nixguix.tasks.LoadNixguix[4f26ecb9-ae86-459d-b5cf-2e360f5074b3] succeeded in 31961.72341875988s: {'status': 'eventful', 'snapshot_id': '869153d018394df0b75789134d87992eb2353bd4'}

Hello!

This is great news, thank you! :-)

@ardumont , https://archive.softwareheritage.org/api/1/snapshot/869153d018394df0b75789134d87992eb2353bd4/ says this particular snapshot could not be found. Am I missing something?

Ludo'.

Second run btw (forgot to hit enter a while back):

Jul 07 12:10:49 worker2 python3[475116]: [2020-07-07 12:10:49,714: INFO/ForkPoolWorker-1] Task swh.loader.package.nixguix.tasks.LoadNixguix[082dd536-6294-421a-881e-e0bf28e94e0b] succeeded in 4497.450984489056s: {'status': 'uneventful', 'snapshot_id': 'ae96e93d0e24fb4ec484d56109c669da0b267908'}

Note: status uneventful with a different snapshot is kinda unexpected for me. Not something drastically problematic though. I'll dig in at some point.

@ardumont , https://archive.softwareheritage.org/api/1/snapshot/869153d018394df0b75789134d87992eb2353bd4/ says this particular snapshot could not be found. Am I missing something?

What you are missing is that archive is the production.
I only ran the loader on the guix source into staging to expose problems if any (well, there were and they are fixed now ;)

Cheers,

Note: status uneventful with a different snapshot is kinda unexpected for me. Not something drastically problematic though. I'll dig in at some point.

@ardumont: did you load the same sources.json? Because http://guix.gnu.org/sources.json is refreshed every X hours and some stats of the commits after 2018-12-05 (v0.16.0) says mean at 21 and median at 13, both per day. And since loading requires ~1h15min, you need some luck to read the same son file twice.

Speaking about stats, the first test took 31961.72341875988s which is ~9h and the second 4497.450984489056s which is ~1h15mn. The difference comes from that the 2nd time, the loader skips a lot of sources because there are already ingested, right? However, between the 2 tests, only few sources have changed so I am a bit doubtful that it takes so "long". Does it mean that this ~1h+ is almost trying to download non-.{tar,gz,bz2} files as .gem files which obviously fail like @lewo has explained?

Note: status uneventful with a different snapshot is kinda unexpected for me. Not something drastically problematic though. I'll dig in at some point.

To clarify, I would have expected the loader to say "eventful".
As the snapshot is different, that means it loaded some new tarballs.

@ardumont: did you load the same sources.json?

yes.
The scheduled origin to load is the sources.json url you gave to me.
So that's what the loader is regularly passing on (expectedtly that fill changes over time ;).

Because http://guix.gnu.org/sources.json is refreshed every X hours and some stats of the commits after 2018-12-05 (v0.16.0) says mean at 21 and median at 13, both per day.
And since loading requires ~1h15min, you need some luck to read the same son file twice.

ok.

Speaking about stats, the first test took 31961.72341875988s which is ~9h and the second 4497.450984489056s which is ~1h15mn. The difference comes from that the 2nd time, the loader skips a lot of sources because there are already ingested, right?

Yes.

However, between the 2 tests, only few sources have changed so I am a bit doubtful that it takes so "long". Does it mean that this ~1h+ is almost trying to download non-.{tar,gz,bz2} files as .gem files which obviously fail like @lewo has explained?

Most probably, yes.
There is no filtering on a per extension basis within the loader itself.
As mentioned in T1991, there is still room for improvments ;)

loader-core 0.9.0 which includes T2510 improvment got deployed on staging to see if that improves time/performance.
(both run for guix and nix sources)

I'll check when i'm getting back (heading for some rest now ;)

Cheers,

Seems to have reduced the cost (from ~4500s to ~1500s) but there might still be margin for improvments [1]
For one, the extensions to skip were not finely analyzed (from the top of my head, we could add ".el' extensions to filter out for example).

That's still progress though ;)

[1]

Aug 08 08:22:22 worker0 python3[2927840]: [2020-08-08 08:22:22,603: INFO/ForkPoolWorker-2] Task swh.loader.package.nixguix.tasks.LoadNixguix[cb6353a5-c8fc-4fa6-ac84-5670630686f3] 
succeeded in 1539.961107660085s: {'status': 'eventful', 'snapshot_id': 'f8851ebb6b03bec0605588f430b3ca7ccc1b4d5f'}

Aug 08 11:03:37 worker0 python3[2927840]: [2020-08-08 11:03:37,919: INFO/ForkPoolWorker-2] Task swh.loader.package.nixguix.tasks.LoadNixguix[af68c000-176c-4554-b32b-3db83634a6ec] 
succeeded in 1444.3805841361172s: {'status': 'eventful', 'snapshot_id': '8060598856f3e209381b604c5af3cce698b206f3'}

For one, the extensions to skip were not finely analyzed (from the top of my head, we could add ".el' for example).

work is in progress to improve that part both analysis [1] and ease of maintenance [2]

[1] https://forge.softwareheritage.org/D3746

[2] https://forge.softwareheritage.org/D3745

fwiw, the nix sources benefit from this as well

Aug 08 06:25:21 worker0 python3[2927840]: [2020-08-08 06:25:21,761: INFO/ForkPoolWorker-2] Task swh.loader.package.nixguix.tasks.LoadNixguix[7bcb0acc-d501-4aa7-9105-3188bb958142] 
succeeded in 962.0553193497472s: {'status': 'eventful', 'snapshot_id': '4034058e8183d95cc2bbcac72c0d77f64f89d1f9'}

Aug 08 11:22:41 worker0 python3[2927840]: [2020-08-08 11:22:41,402: INFO/ForkPoolWorker-2] Task swh.loader.package.nixguix.tasks.LoadNixguix[c7deac62-dc7e-48b2-82f1-bd89c14e2366] 
succeeded in 971.4250831687823s: {'status': 'eventful', 'snapshot_id': '4034058e8183d95cc2bbcac72c0d77f64f89d1f9'}
ardumont claimed this task.