Page MenuHomeSoftware Heritage

Refactor output of indexer storage's `get` methods.
Open, NormalPublic

Description

In the Indexer Storage API, most get methods (eg. content_ctags_get) yield items with this format:

{"id": sha1, "tool": TOOL, "ctags": ctags1}
{"id": sha1, "tool": TOOL, "ctags": ctags2}

Starting with T782/D301, content_fossology_license_get yields item with this format:

{sha1: {"tool": TOOL, "licenses": [license1, license2]}}

This task is twofold:

  • first, improve content_fossology_license_get's result to return a dictionary instead of yielding dictionaries each with a single key-value
  • secondly, refactor other _get methods to use the same format.

The files that should be edited are:

  • swh/indexer/tests/storage/test_storage.py: this are the test cases for both Indexer Storage implementations. It should be adapted to test for the new format.
  • swh/indexer/storage/in_memory.py: a fully in-memory implementation of the Indexer Storage. This is the easiest implementation to start with.
  • swh/indexer/storage/__init__.py and swh/indexer/storage/converters.py: an implementation of the Indexer Storage backed by postgresql. Look at D301 for examples of how to do it.

Event Timeline

vlorentz created this task.Dec 6 2018, 4:04 PM
vlorentz triaged this task as Low priority.
vlorentz updated the task description. (Show Details)Dec 6 2018, 4:07 PM
vlorentz updated the task description. (Show Details)
vlorentz raised the priority of this task from Low to Normal.Dec 13 2018, 1:56 PM
Sowmya added a subscriber: Sowmya.Mar 9 2019, 3:43 AM
twitu added a subscriber: twitu.EditedSun, Jul 7, 4:23 PM

I am familiar with the web APIs and I went through the discussion in T782. When you say output a single dictionary, I believe you mean something like this

{
  sha1: [
    {tool: TOOL, licenses: [licences]},
    {tool: TOOL, licenses: [licences]}
  ],

  sha1: [
    {tool: TOOL, licenses: [licences]},
    {tool: TOOL, licenses: [licences]}
  ]
}

Following the setup guide I have hosted indexer locally and will start refactoring all the APIs one by one.

vlorentz updated the task description. (Show Details)Mon, Jul 8, 1:25 PM
twitu added a comment.Tue, Jul 9, 5:28 AM

I went through all the tests in test_storage.py. It appears that only content_fossology_license_get needs to be refactored. All other storage methods return a dictionary or a list of dictionaries, where each dictionary has multiple keys.