Page MenuHomeSoftware Heritage

Setup async interface for discovery module
ClosedPublic

Authored by Alphare on Sep 26 2022, 6:29 PM.

Details

Summary

This will allow us to use this interface in async code like `swh-scanner`.

Unfortunately, this means calling `asyncio.run` for sync code, but the
performance impact should be negligible.

The `swh_storage.*missing*` APIs are inconsistent for each type, which
requires a lot of boilerplate code. This should be addressed in a
follow-up.

There is also one hack (marked as such inline) needed to have an API that
isn't also plagued by implementation details of needing to keep
ID -> object matchers around. This should be a removed as a direct
consequence of aligning the storage APIs.

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8538 (id=30790)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit 19db5769cbe618b77c322189b66994222e2c8668
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:21:54 2022 +0200

    Setup async interface for discovery module
    
    This will allow us to use this interface in async code like ``swh-scanner``.
    
    Unfortunately, this means calling ``asyncio.run`` for sync code, but the
    performance impact should be negligible.
    
    The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
    requires a lot of boilerplate code. This should be addressed in a
    follow-up.
    
    There is also one hack (marked as such inline) needed to have an API that
    isn't also plagued by implementation details of needing to keep
    ID -> object matchers around. This should be a removed as a direct
    consequence of aligning the storage APIs.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/896/ for more details.

One question for reviewers: I've tried using Protocol for ArchiveDiscoveryInterface (should I?), but I'm not sure if it's possible to define a concrete method within the protocol, namely if __init__ could be implemented by default in ArchiveDiscoveryInterface.

classes should not inherit from protocols. Use an abstract class instead of a protocol if you want to use inheritance.

Build is green

Patch application report for D8538 (id=30795)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit fe6610629b1bff545b30ea86743d8e90a0487a47
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:21:54 2022 +0200

    Setup async interface for discovery module
    
    This will allow us to use this interface in async code like ``swh-scanner``.
    
    Unfortunately, this means calling ``asyncio.run`` for sync code, but the
    performance impact should be negligible.
    
    The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
    requires a lot of boilerplate code. This should be addressed in a
    follow-up.
    
    There is also one hack (marked as such inline) needed to have an API that
    isn't also plagued by implementation details of needing to keep
    ID -> object matchers around. This should be a removed as a direct
    consequence of aligning the storage APIs.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/897/ for more details.

I don't understand why inconsistencies in swh-storage's API require passing this dict around.

Could you explain what needs to change in the swh-storage API, and how this code would look like after the change?

swh/loader/core/discovery.py
44–53

You need to decorate them with @abc.abstractmethod

swh/loader/core/discovery.py
29

docstring, please

Alphare marked an inline comment as done.

Add docstring, abstract method decorators, explain hack

Build is green

Patch application report for D8538 (id=30851)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit 404c1b9a30928f42a8f32bbfbcfe64986fdc8126
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:21:54 2022 +0200

    Setup async interface for discovery module
    
    This will allow us to use this interface in async code like ``swh-scanner``.
    
    Unfortunately, this means calling ``asyncio.run`` for sync code, but the
    performance impact should be negligible.
    
    The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
    requires a lot of boilerplate code. This should be addressed in a
    follow-up.
    
    There is also one hack (marked as such inline) needed to have an API that
    isn't also plagued by implementation details of needing to keep
    ID -> object matchers around. This should be a removed as a direct
    consequence of aligning the storage APIs. See inline comment.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/899/ for more details.

I still don't understand why you need the _all_contents hack instead of doing this:

async def skipped_content_missing(
    self, skipped_contents: List[Sha1Git]
) -> Iterable[Sha1Git]:
    """List skipped content missing from the archive by sha1"""
    contents = [{"sha1_git": s} s in skipped_contents]
    return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents))

I still don't understand why you need the _all_contents hack instead of doing this:

async def skipped_content_missing(
    self, skipped_contents: List[Sha1Git]
) -> Iterable[Sha1Git]:
    """List skipped content missing from the archive by sha1"""
    contents = [{"sha1_git": s} s in skipped_contents]
    return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents))

Ah. I was sure skipped_content_missing took SkippedContent instead of dicts. Sorry about that, will update this patch soon.

Build is green

Patch application report for D8538 (id=30912)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit 1facea3cd215155f77ae4083a33837b7c6f642b0
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Mon Sep 26 18:21:54 2022 +0200

    Setup async interface for discovery module
    
    This will allow us to use this interface in async code like ``swh-scanner``.
    
    Unfortunately, this means calling ``asyncio.run`` for sync code, but the
    performance impact should be negligible.
    
    The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
    requires a lot of boilerplate code. This should be addressed in a
    follow-up.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/905/ for more details.

This revision is now accepted and ready to land.Sep 28 2022, 5:51 PM

I'm coming late to the battle ¯\_(ツ)_/¯.

Still a couple of remarks/questions.

  1. Isn't this missing an asyncio dependency requirements in /requirements.txt
  1. Why is this defined here and not in the storage since all the current dicsovery methods involve the storage?

Still a couple of remarks/questions.

  1. Isn't this missing an asyncio dependency requirements in /requirements.txt

nope, it's fine [1]

[1]

10:35 <+ardumont> is asyncio part of python3.7?
10:36 <+ardumont> (for loader.core and the new discovery async code ^)
10:36 <Alphare> yes
10:36 <Alphare> https://docs.python.org/3.7/library/asyncio.html
  1. Why is this defined here and not in the storage since all the current dicsovery methods involve the storage?

see [2]

[2] https://forge.softwareheritage.org/D8539?id=30916#inline-60749