Page MenuHomeSoftware Heritage

azure: Specialize get_batch() to fetch objects concurrently
ClosedPublic

Authored by vlorentz on May 17 2021, 1:40 PM.

Details

Summary

The default implementation is a naive 'for' loop calling 'get'
sequentially.

This replaces this default implementation with a local asyncio loop,
which manages a task for each of the blobs to fetch.

This reduces my cooker test run time by about 30%

Depends on D5742

Test Plan

bleh

Diff Detail

Repository
rDOBJS Object storage
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5743 (id=20534)

Could not rebase; Attempt merge onto 326fcef47b...

Updating 326fcef..e003bb4
Fast-forward
 swh/objstorage/backends/azure.py              | 108 ++++++++++++++++++++++++--
 swh/objstorage/tests/test_objstorage_azure.py |  90 +++++++++++++++------
 2 files changed, 168 insertions(+), 30 deletions(-)
Changes applied before test
commit e003bb4491c0d8e761b150c9bec4cc68bdaa53b2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 17 13:40:18 2021 +0200

    azure: Specialize get_batch() to fetch objects concurrently
    
    The default implementation is a naive 'for' loop calling 'get'
    sequentially.
    
    This replaces this default implementation with a local asyncio loop,
    which manages a task for each of the blobs to fetch.

commit 5a3111c6d02c0d6fcdf6c295ff6b12aff9fd4fb7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 17 13:32:07 2021 +0200

    azure: Use container URLs instead of ContainerClient instances as attributes
    
    A future commit will start using aio.ContainerClient instances, which must
    be created on the fly from these URLs.

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/23/ for more details.

Rather than duplicate most of the get() code into _get_async, would it make sense to turn get into a shim over _get_async? I guess the question is, how expensive is it to start up and tear down the local asyncio loop? (it makes sense for get_batch, but does it make sense for get itself too?)

dedup code by making get() use the async impl

Build is green

Patch application report for D5743 (id=20545)

Could not rebase; Attempt merge onto 326fcef47b...

Updating 326fcef..09fea27
Fast-forward
 swh/objstorage/backends/azure.py              | 116 +++++++++++++++++++++++---
 swh/objstorage/tests/test_objstorage_azure.py |  90 +++++++++++++++-----
 2 files changed, 172 insertions(+), 34 deletions(-)
Changes applied before test
commit 09fea27df2f1f2ecda48b5b3d9e3de3e6f5764ac
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 17 13:40:18 2021 +0200

    azure: Specialize get_batch() to fetch objects concurrently
    
    The default implementation is a naive 'for' loop calling 'get'
    sequentially.
    
    This replaces this default implementation with a local asyncio loop,
    which manages a task for each of the blobs to fetch.
    
    To avoid code duplication, this also rewrites 'get' to use the same
    code as 'get_batch'. Quick benchmarks did not show a noticeable
    slowdown when doing this.

commit 5a3111c6d02c0d6fcdf6c295ff6b12aff9fd4fb7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon May 17 13:32:07 2021 +0200

    azure: Use container URLs instead of ContainerClient instances as attributes
    
    A future commit will start using aio.ContainerClient instances, which must
    be created on the fly from these URLs.

See https://jenkins.softwareheritage.org/job/DOBJS/job/tests-on-diff/24/ for more details.

douardda added inline comments.
swh/objstorage/backends/azure.py
157

I may have missed something completely here, but why is this generating a dict with only one "" entry?

swh/objstorage/backends/azure.py
157

I want it to be a dict for consistency with the prefixed implementation, but in this implementation, only one entry is used, so that's a compromise

This revision is now accepted and ready to land.May 21 2021, 10:10 AM
swh/objstorage/backends/azure.py
157

I got after writing this comment, however IMHO this should be explicitly explained somewhere... Thinking of it now, wouldn't it be possible to only keep the Prefixed version of this class (and make it compatible with an empty prefix or something)? Not related to this diff, obviously, just wondering...

swh/objstorage/backends/azure.py
157

I guess so