Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T309 Delete duplicated antelink/antepedia content from s3 | ||
Migrated | gitlab-migration | T322 Inject antepedia contents (backuped in sesi and not in swh) into swh | ||
Migrated | gitlab-migration | T317 Inject sesi files hashes in antelink db | ||
Migrated | gitlab-migration | T316 List and compute hashes of actual sesi files | ||
Migrated | gitlab-migration | T320 sesi content downloader and injection in swh worker | ||
Migrated | gitlab-migration | T347 ingest antelink s3 contents in swh |
Event Timeline
worker01 is producing messages in the swh_antelink_sesi_download queue.
Each message represents either roughly 100Mib of compressed files or 1000 compressed files to scp.
worker01 is ingesting those messages.
Each job represents the following steps:
- download block of files from sesi to local tmp storage
- compute checksums and filters those possibly newly corrupted ones
- store them in swh
The files mentioned are only the ones with size <= 100Mib for now.
It will remain 51056 files with size > 100Mib to inject after the first batch is consumed.
Multiple problems:
- slow
- ssh connection problems when too many concurrent jobs (either with ssh multiplex or with simple scp).
So:
I have stopped the actual sesi injection and purged the queue.
Analysis:
What we want to do is inject in swh the files, plain and simple.
The fact that the files are remote are an implementation detail.
So as a separation of concern principle, why not use mount point.
And sshfs seems exactly to solve this needs.
Consequently:
I have adapted sesidownloader to read files from an sshfs mount point named /antelink from sesi:/antelink
And so far so good.
The new jobs are reinjecting now in the sesi queue.
Pros:
- separation of concern
- no need to separate policy between files (the big files were kept outside the scope for later)
Cons or concern:
The only concern is to check the sshfs mount point does not drop (the way nfs sometimes does...)