Page MenuHomeSoftware Heritage

Web API: make endpoints that expose extracted metadata return *lists* of factual information
Closed, MigratedEdits Locked

Description

We are currently following quite closely our principles of "storing only facts" in the archive, so we never report just statements like "the licence is GPL", but qualified statements like "according to tool X, the licence is GPL".

See for example https://archive.softwareheritage.org/api/1/content/sha1:2d8280fbabf9a1eabbcbc562b9763cb07952118b/license/

{

"content_url": "/api/1/content/sha1:2d8280fbabf9a1eabbcbc562b9763cb07952118b/",
"id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
"licenses": [
    "Dual-license",
    "GPL",
    "MIT",
    "MIT-style"
],
"tool": {
    "configuration": {
        "command_line": "nomossa <filepath>"
    },
    "id": 1,
    "name": "nomos",
    "version": "3.1.0rc2-31-ga2cbb8c"
}

}

The JSON format in the above example, though, is not general enough. In the future, we will store several qualified statements for the same property, like "according to tool X, licence is GPL, and according to tool Y, licence is MPL".

Event Timeline

rdicosmo created this object in space Restricted Space.
zack triaged this task as Normal priority.Sep 23 2017, 9:27 AM

right, we should make that JSON output a list of, rather than single dictionary

zack renamed this task from Preparer JSON output for storing multiple value/tool entries to Web API: make endpoints that expose extracted metadata return *lists* of factual information.Sep 23 2017, 9:29 AM
zack shifted this object from the Restricted Space space to the S1 Public space.

@zack, I would like to work on this issue.

The value of tool must look like this:

{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

In T782#17393, @s wrote:
{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

yep, that is correct!

@zack, I've been getting myself acquainted with the swh codebase for the past few days.

Would updating the db_to_* functions in [[https://forge.softwareheritage.org/source/swh-indexer/browse/master/swh/indexer/storage/converters.py||swh.indexer.storage.converters]] resolve this task?

Also, I want to know the canonical way to setup the development environment (my host system runs debian unstable).

What I've already done:

  • Cloned swh-environment
  • Pulled all swh repos using mr up.
  • Setup Arcanist.
  • Created ~/.config/swh/storage.yml and ~/.config/swh/webapp/webapp.yml
  • And
# terminal 0

## install external python deps.
cd /path/to/swh-environment
virtualenv --python=python3 .venv
source .venv/bin/activate
for i in `bin/ls-py-modules`; do
cd $i; pip3 install -r requirements.txt; cd .. 
done

## dump test data into postgres
make rebuild-testdata

# terminal 0

## start storage server
cd /path/to/swh-environment
source .venv/bin/activate
source pythonpath.sh
python3 -m swh.storage.api.server ~/.config/swh/storage.yml

# terminal 1

## start webapp
cd /path/to/swh-environment
source .venv/bin/activate
source pythonpath.sh 
python3 -m swh.web.manage runserver

Now, If I go to http://127.0.0.1:5004/api/1/content/sha1:1fc6129a692e7a87b5450e2ba56e7669d0c5775d/license/, It returns a 503

{
    "exception": "StorageAPIError",
    "reason": "An unexpected error occurred in the api backend: HTTPConnectionPool(host='127.0.0.1', port=5007): Max retries exceeded with url: /content/fossology_license (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))"
}

How do I make swh-indexer available at 127.0.0.1:5007?
Is there anything else that I've to setup in the development environment?

Hi s!, and thanks for your interest in helping us out!

First of all, I should apologize for the lack of "getting started" documentation for developers. It's on my plage, and the stuff you see at https://docs.softwareheritage.org/devel/ is the beginning of it, but it is "not there yet", so your questions are entirely legitimate. We'll try to walk you through what's needed here, hoping in the future the doc will directly address needs like yours.

@ardumont can you give a hand to @s here?

@s Hello, following your nicely summed up actions, you'd need another terminal and configuration file.

The main storage and the indexer one have been splitted so you'd need:

In ~/.config/swh/storage/indexer.yml:

storage:
  cls: local
  args:
    db: softwareheritage-indexer-dev

Note: make rebuild-testdata you mention already setup-ed the db for you

And then running (following your initial logic)

# terminal 2
## start indexer storage server
$ cd /path/to/swh-environment
$ source .venv/bin/activate
$ source pythonpath.sh
$ python3 -m swh.indexer.storage.api.server
 * Running on http://0.0.0.0:5007/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 113-212-857

Cheers,

In T782#17395, @zack wrote:
In T782#17393, @s wrote:
{
    ...
    "tool": [
        {
            "version": "3.1.0rc2-31-ga2cbb8c",
            "name": "nomos",
            "id": 1,
            "configuration": {
                "command_line": "nomossa <filepath>"
            }
        }
    ]
}

Right?

yep, that is correct!

Not totally so, licenses and tool must be side by side to be factual.
Nomossa actually detected the 4 licenses (we have only one tool for license detection).

So that'd be more:

{
    "content_url": "/api/1/content/sha1:2d8280fbabf9a1eabbcbc562b9763cb07952118b/",
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "facts": [
        {
            "licenses": [
                "Dual-license",
                "GPL",
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "nomossa <filepath>"
                },
                "id": 1,
                "name": "nomos",
                "version": "3.1.0rc2-31-ga2cbb8c"
            }
        }
    ]
}

Would updating the db_to_* functions in swh.indexer.storage.converters resolve this task?

I'd say along those line, yes, probably one step before.
Starting from the get endpoint.
The storage will return you the content's associated licenses (including tool).

Aggregating those according to my previous comment...

...

Something is fishy here.

That endpoint returns a list already...

Yes, there is an issue in the web-app.
It extracts the first result instead of listing all detected licenses by the tool for the sha1.

So, actually, the result of the current endpoint that should be seen is :

[{  // <- that's a list of {tool, licenses, id}
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "licenses": [
        "Dual-license",
        "GPL",
        "MIT",
        "MIT-style"
    ],
    "tool": {
        "configuration": {
            "command_line": "nomossa <filepath>"
        },
        "id": 1,
        "name": "nomos",
        "version": "3.1.0rc2-31-ga2cbb8c"
    }
},
{ 
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "licenses": [
        "MIT",
        "MIT-style"
    ],
    "tool": {
        "configuration": {
            "command_line": "scancode <filepath>"
        },
        "id": 2,
        "name": "scancode",
        "version": "9.2.3"
    }
}]

Which answers correctly @rdicosmo's initial concern (Note: there is redundancy since it will repeat the 'id' in for each tool).

We can take the opportunity to rewrite the endpoint's result as mentioned.

Starting from the get endpoint.

That would be indeed one place where to actually convert the result as proposed.

And then adapt the webapp endpoint to actually list all detected licenses for that sha1 (so not take the first result as is actually done).

Cheers,

Still, following the sample i proposed, that would then give something like:

[{  // <- that's still a list because there can be other different ids
    "id": "2d8280fbabf9a1eabbcbc562b9763cb07952118b",
    "facts": [
        {
            "licenses": [
                "Dual-license",
                "GPL",
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "nomossa <filepath>"
                },
                "id": 1,
                "name": "nomos",
                "version": "3.1.0rc2-31-ga2cbb8c"
            }
        },
        {
            "licenses": [
                "MIT",
                "MIT-style"
            ],
            "tool": {
                "configuration": {
                    "command_line": "scancode <filepath>"
                },
                "id": 2,
                "name": "scancode",
                "version": "9.2.3"
            }
        }
    ]
}]

Note that could also be a dict with key <id> and values the current "facts" (that will just diverge from the current pattern we have, list in, list out in storage api).
That could be converted in the web layer (if that makes sense).

Cheers,

@ardumont Thank you very much for helping out.

I was able to start the swh-indexer at 127.0.0.1:5007.

I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in T782#18418.

Here's the diff:

diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.py
index e71523d..1c6f0c9 100644
--- a/swh/indexer/storage/__init__.py
+++ b/swh/indexer/storage/__init__.py
@@ -298,9 +298,16 @@ class IndexerStorage():
         db = self.db
         db.store_tmp_bytea(ids, cur)
 
+        d = {}
         for c in db.content_fossology_license_get_from_temp():
             license = dict(zip(db.content_fossology_license_cols, c))
-            yield converters.db_to_fossology_license(license)
+
+            id_ = license['id']
+            if  id_ not in d:
+                d[id_] = []
+            d[id_].append(converters.db_to_fossology_license(license))
+
+        yield self._generate(d)
 
     @db_transaction
     def content_fossology_license_add(self, licenses,
@@ -548,3 +555,10 @@ class IndexerStorage():
         if not idx:
             return None
         return dict(zip(self.db.indexer_configuration_cols, idx))
+
+    def _generate(self, d):
+        for id_, facts in d.items():
+            yield {
+                'id': id_,
+                'facts': facts
+                }
diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.py
index db7a295..3cf5da1 100644
--- a/swh/indexer/storage/converters.py
+++ b/swh/indexer/storage/converters.py
@@ -129,7 +129,6 @@ def db_to_metadata(metadata):
 
 def db_to_fossology_license(license):
     return {
-        'id': license['id'],
         'licenses': license['licenses'],
         'tool': {
             'id': license['tool_id'],

The modified files:

I would like to know if I'm going in the right direction.

Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.

@ardumont Thank you very much for helping out.

You are welcome ;)

I was able to start the swh-indexer at 127.0.0.1:5007.

Awesome!

I modified the storage module in swh-indexer to make IndexerStorage.content_fossology_license_get to return a list of dicts as suggested in T782#18418.
I would like to know if I'm going in the right direction.

Yes, sounds like it :)

Does make rebuild-storage-testdata add objects (hashes) with license information connected to it? If yes, how do I find these objects? -- I would like to test the /api/1/content/sha1:OBJ_HASH/license endpoint with these objects.

No, that routine drops all dbs, rebuilds them to their latest schema changes (well, as current HEAD in your current swh-environment).
To have data, you need to run our different modules (loader, then indexer).

I started writing something but that takes more times to be thorough, so i stopped mid-air to reply to you on the first points already.
More to come soon on how you could have data locally ;)

Cheers,

More to come soon on how you could have data locally ;)

Here, it goes. Mounting partial dumps (softwareheritage-dev, softwareheritage-indexer-dev dbs).

  1. Following make rebuild-testdata
  1. Mount a dump with the following small data for the indexer db:
  • Retrieve the file from F3040109.
  • Then mount the dump using something like:
cat swh-index-dev.114-small-data.sql | psql softwareheritage-indexer-dev

This will only mount the data for the indexer api though.

  1. Consistently mount the data dump for the softwareheritage db

To actually see the data from the webapp, you need to have something consistent in both db.

For this, mount the following dump, F3040114.

cat 118-hylang-hy-origin-data.sql | psql softwareheritage-dev

And there you should be good to go.

Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;)
For what is worth, that was my initial response P238.

Cheers,

@ardumont, Thanks for the data dumps (F3040109 and F3040114). I'm able [[ https://ricketyspace.net/tmp/t782-swh-indexer-changes.webm | to test the /api/1/content/sha1:HASH/license/ endpoint ]].

Here's an updated diff:

diff --git a/swh/indexer/storage/__init__.py b/swh/indexer/storage/__init__.py
index e71523d..1f06056 100644
--- a/swh/indexer/storage/__init__.py
+++ b/swh/indexer/storage/__init__.py
@@ -292,15 +292,24 @@ class IndexerStorage():
             list: dictionaries with the following keys:
 
             - id (bytes)
-            - licenses ([str]): associated licenses for that content
+            - facts ([str]): associated licenses for that content
 
         """
         db = self.db
         db.store_tmp_bytea(ids, cur)
 
+        d = {}
         for c in db.content_fossology_license_get_from_temp():
             license = dict(zip(db.content_fossology_license_cols, c))
-            yield converters.db_to_fossology_license(license)
+
+            id_ = license['id']
+            if  id_ not in d:
+                d[id_] = []
+            d[id_].append(converters.db_to_fossology_license(license))
+
+        for id_, facts in d.items():
+            yield { 'id': id_, 'facts': facts}
+
 
     @db_transaction
     def content_fossology_license_add(self, licenses,
diff --git a/swh/indexer/storage/converters.py b/swh/indexer/storage/converters.py
index db7a295..3cf5da1 100644
--- a/swh/indexer/storage/converters.py
+++ b/swh/indexer/storage/converters.py
@@ -129,7 +129,6 @@ def db_to_metadata(metadata):
 
 def db_to_fossology_license(license):
     return {
-        'id': license['id'],
         'licenses': license['licenses'],
         'tool': {
             'id': license['tool_id'],

If the changes look good, I'll update the failing tests and submit a patch for code review.

Note that there is a more complete explanation on actually loading and indexing yourself the data, but it's kinda long (way more than this one that is ;)
For what is worth, that was my initial response P238.

Thanks for writing the quick howto! I'll check it out.

@ardumont, Thanks for the data dumps (F3040109 and F3040114).

Sure

I'm able to test the /api/1/content/sha1:HASH/license/ endpoint.

Awesome. That way of demoing is quite cool, loving it, thanks.

Setup

If the changes look good, I'll update the failing tests and submit a patch for code review.

I'd be more than ok if you could open a diff now with arcanist?
(Even if tests are not completely ok yet ;)

Once the setup is done, it's only a arc diff origin/master command away.
More detail can be found in our wiki documentation.

That would help in reviewing and possibly suggesting code adaptation in context (as explained below).
And as you are quite advanced in the contributions now, that would make sense ;)

Thanks in advance.

Changes

I currently see some possible changes that your real cool demo illustrates.

  1. The value returned should be directly the d dictionary's content
  • So either yield from d should be ok (possibly, not checked)
  • or your return must change to plain d (you would possibly need to change the decorator from @db_transaction_generator to @db_transaction.

Either way, another change is needed [[ https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/service.py;10e6e3f2cf54665adbce7c80845873b639a5b86f$218 | in the actual buggy webapp's current endpoint service ]]. That should be adapted to whatever you choose.

Note: It's buggy because in the current state prior to your adaptation, we could already have multiple results (1 per tool) even if we have only one sha1 in the endpoint's input (T782#18417).
It's not that much of a deal now because currently we only have one tool (which is currently running). But long term, it's not correct. Thus, that issue in the first place btw ;)

  1. Implementation wise, you could use from collections import defaultdict and initializes the d as a defaultdict(list), that way you can remove the if id not in d test statement and directly use d[id_].append You can check some use case samples in the python3 documentation.

Cheers,

Setup

If the changes look good, I'll update the failing tests and submit a patch for code review.

I'd be more than ok if you could open a diff now with arcanist?
(Even if tests are not completely ok yet ;)

Ok! I've opened D301 for review.

In the meanwhile, I will update the tests.

Ok! I've opened D301 for review.

Awesome! Thanks. Heading there.

ardumont changed the task status from Open to Work in Progress.Mar 21 2018, 5:10 PM
ardumont assigned this task to s.

@ardumont, Can I:

  • Merge and push the changes in D301 to swh-indexer master branch?
  • Merge and push the changes in D302 to swh-web master branch?

Also, I would like to know if the following methods in swh.indexer.storage.IndexerStorage and the corresponding endpoints in swh-web must be updated to work like the updated content_fossology_license_get method:

  • swh.indexer.storage.IndexerStorage.content_ctags_get
  • swh.indexer.storage.IndexerStorage.content_mimetype_get
  • swh.indexer.storage.IndexerStorage.content_language_get
  • swh.indexer.storage.IndexerStorage.content_metadata_get
In T782#18978, @s wrote:

Can I:

  • Merge and push the changes in D301 to swh-indexer master branch?
  • Merge and push the changes in D302 to swh-web master branch?

Hi @s, first of all thanks a lot for your work on this. I can congratulate for being the first non-staffer contributor to the Software Heritage code base, and that's awesome!

Accompanying that record, though, there is the fact that we hadn't yet tested our processes for incoming contributions, which explains why we had you waiting for a while.

The last blocker on this front is that Inria needs to make sure they can redistribute the code you're contributing, to avoid risks the code will have to be pulled later on when it will be hard to do. In practical terms that means that you'll have to tick a box under a declaration stating you're allowed to contribute your patch (what is generally called a "contribution agreement"). The good news is that the process will be automated and integrated into this forge. The bad news is that the agreement is not ready yet. We're now working on a first draft of it, which we will publicly announce when it's ready, seeking comments from our community, and documenting the process for everyone. I expect a first version that we can use for this first contribution will be ready in a couple of weeks, I hope that's fine with you. If you're interested, I can share working drafts with you before it's public; in that case feel free to reach out to me via email at zack@upsilon.cc . If not, just stay tuned and I'll ping you when I've a first version.

I'm really sorry to keep you waiting, but I can assure you that this process we're going through is gonna help future contributors *a lot*. So thanks again not only for your contribution, but also for bearing with us!

"zack (Stefano Zacchiroli)" <forge@softwareheritage.org> writes:

zack added a comment.

In T782#18978 <https://forge.softwareheritage.org/T782#18978>, @s wrote:

> Can I:
>
> - Merge and push the changes in D301 <https://forge.softwareheritage.org/D301> to swh-indexer master branch?
> - Merge and push the changes in D302 <https://forge.softwareheritage.org/D302> to swh-web master branch?

Hi @s, first of all thanks a lot for your work on this. I can
congratulate for being the first non-staffer contributor to the
Software Heritage code base, and that's awesome!

Accompanying that record, though, there is the fact that we hadn't yet
tested our processes for incoming contributions, which explains why we
had you waiting for a while.

The last blocker on this front is that Inria needs to make sure they
can redistribute the code you're contributing, to avoid risks the code
will have to be pulled later on when it will be hard to do. In
practical terms that means that you'll have to tick a box under a
declaration stating you're allowed to contribute your patch (what is
generally called a "contribution agreement"). The good news is that
the process will be automated and integrated into this forge. The bad
news is that the agreement is not ready yet. We're now working on a
first draft of it, which we will publicly announce when it's ready,
seeking comments from our community, and documenting the process for
everyone. I expect a first version that we can use for this first
contribution will be ready in a couple of weeks, I hope that's fine
with you. If you're interested, I can share working drafts with you
before it's public; in that case feel free to reach out to me via
email at zack@upsilon.cc . If not, just stay tuned and I'll ping you
when I've a first version.

@zack, Thanks for getting back to me on this. Let me know when the
Contributor Agreement is ready for signing.

So, good news at last! The CLA is ready, here it is: L3 . (It misses a preamble, but it's pretty short anyway without one.)
@s : Can you have a look and sign it (using your legal name) ?
Happy to answer any question you might have !

In T782#20044, @zack wrote:

So, good news at last! The CLA is ready, here it is: L3 . (It misses a preamble, but it's pretty short anyway without one.)
@s : Can you have a look and sign it (using your legal name) ?
Happy to answer any question you might have !

@zack, I signed it.

Hello,

Thanks again for your contributions.

This has been merged and deployed.

@team, i added a wiki entry page about team workflow integration of external contributions [1].

[1] https://wiki.softwareheritage.org/index.php?title=External_contribution_integration

Cheers,