Page MenuHomeSoftware Heritage

Inject antepedia contents (backuped in sesi and not in swh) into swh
Closed, MigratedEdits Locked

Event Timeline

ardumont renamed this task from Download antepedia's sesi contents not in swh to Download antepedia contents (backuped in sesi) not in swh and injects in swh.Mar 7 2016, 10:54 AM
ardumont changed the task status from Open to Work in Progress.
ardumont claimed this task.

worker01 is producing messages in the swh_antelink_sesi_download queue.
Each message represents either roughly 100Mib of compressed files or 1000 compressed files to scp.

worker01 is ingesting those messages.
Each job represents the following steps:

  • download block of files from sesi to local tmp storage
  • compute checksums and filters those possibly newly corrupted ones
  • store them in swh

The files mentioned are only the ones with size <= 100Mib for now.

It will remain 51056 files with size > 100Mib to inject after the first batch is consumed.

Multiple problems:

  • slow
  • ssh connection problems when too many concurrent jobs (either with ssh multiplex or with simple scp).

So:
I have stopped the actual sesi injection and purged the queue.

Analysis:
What we want to do is inject in swh the files, plain and simple.
The fact that the files are remote are an implementation detail.
So as a separation of concern principle, why not use mount point.
And sshfs seems exactly to solve this needs.

Consequently:
I have adapted sesidownloader to read files from an sshfs mount point named /antelink from sesi:/antelink
And so far so good.

The new jobs are reinjecting now in the sesi queue.

Pros:

  • separation of concern
  • no need to separate policy between files (the big files were kept outside the scope for later)

Cons or concern:
The only concern is to check the sshfs mount point does not drop (the way nfs sometimes does...)

ardumont renamed this task from Download antepedia contents (backuped in sesi) not in swh and injects in swh to Inject antepedia contents (backuped in sesi and not in swh) into swh.Mar 8 2016, 7:31 PM
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM
gitlab-migration changed the task status from Resolved to Migrated.Jan 8 2023, 4:18 PM
gitlab-migration claimed this task.
gitlab-migration changed the status of subtask T317: Inject sesi files hashes in antelink db from Resolved to Migrated.
gitlab-migration changed the status of subtask T320: sesi content downloader and injection in swh worker from Resolved to Migrated.
gitlab-migration added a subscriber: gitlab-migration.