Page MenuHomeSoftware Heritage

Add the origin intrinsic metadata indexer
ClosedPublic

Authored by vlorentz on Oct 22 2018, 11:42 AM.

Details

Summary

Import indexer from D537 (+ apply @moranegg's comments)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-metadata-indexer
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1856
Build 2234: arc lint + arc unit

Event Timeline

I'm not done yet ;)

swh/indexer/__init__.py
17

Can you please drop the SWH prefix (and the Task suffix as well), that's redundant (we are in a swh module, as per the fqdn shows already).

I'm trying to progressively remove that when i stumble upon it, same for the other task.

swh/indexer/origin_head.py
41

persistent

58

It's redundant, isn't it?

You cannot pass here since the method will already have returned if one of them is None.

73

For information, you may want to update this with the latest swh.storage.algo.snapshot_get_all_branches api olasd opened [1]

[1] rDSTOad922485bf1cdfd3ad8a911ab390466e76a6881e

swh/indexer/tasks.py
24

Something is not clear here, what's the indexer? Is that an indexer or a chain?

At first i thought you called the indexer twice...
Now i'm just wondering ;)

I have the feeling that's the indexer's next_step call which returns a celery chain (or something, well, a celery task or group of tasks...). But i'm 100% sure ;)

Can you please rename indexer to something more explicit?

vlorentz marked 5 inline comments as done.
swh/indexer/origin_head.py
58

Oops, should have been an and in the condition, not an or.

swh/indexer/tasks.py
24

Oops, good catch. run was not meant to be called twice, that's totally a TypeError (which is not caught by tests because this code is mocked).

This comment was removed by vlorentz.
  • Remove useless constant.
swh/indexer/tests/__init__.py
28

To avoid repeating myself, D518#inline-2716.

Also, maybe stack this diff onto D518?

vlorentz retitled this revision from Add the origin intrinsic metadata to Add the origin intrinsic metadata indexer.Oct 23 2018, 11:36 AM

I have the following error: P321
Don't know if this is due to local configuration or about the other unlanded diff.

Anyway, looks good.

swh/indexer/metadata.py
298

I'm kinda curious about the argument annotation *, could you point me to documentation explaining its usage?
or tell me what's the advantage here..

310

indentation?

swh/indexer/__init__.py
17

same for the other task.

I was unclear, i was just speaking about the other new one introduced here OriginMetadataTask.

Since everything is touched now... and that impacts the production anyway.
We could benefit from modifying everything correctly (once ;).

Could you please also remove the suffix Task from all those tasks?
(I'll deal with the production impacts when times comes to package and deploy)

Thanks in advance.

  • Fix blocking test and remove Celery test-specific hacks.
swh/indexer/origin_head.py
73

What i failed to explain is that currently, the snapshot_get_latest is a paginated result so you won't have all branches if you keep the old api call.

I think that's relevant here.

vlorentz added inline comments.
swh/indexer/metadata.py
298

It prevents the next arguments from being called as positional arguments, ie. index('foo', 'bar', 'baz') is invalid; you have to use this: index('foo', revisions_metadata='bar', origin_head_map='baz').

That's intended as a way to prevent mistakes that would cause hard-to-find bugs in future versions with a different API.

310

I too wanted to indent, but flakes8 doesn't agree with us

swh/indexer/origin_head.py
73

Will do in a later diff.

swh/indexer/origin_head.py
12

oopsy ;)

34

doopsy, cf. patch below.

diff --git a/swh/indexer/origin_head.py b/swh/indexer/origin_head.py
index 7551fde..7ec69ce 100644
--- a/swh/indexer/origin_head.py
+++ b/swh/indexer/origin_head.py
@@ -10,7 +10,7 @@ import logging
 from celery import chain

 from swh.indexer.indexer import OriginIndexer
-from swh.indexer.tasks import OriginMetadataTask, RevisionMetadataTask
+from swh.indexer.tasks import OriginMetadata, RevisionMetadata


 class OriginHeadIndexer(OriginIndexer):
@@ -31,8 +31,8 @@ class OriginHeadIndexer(OriginIndexer):

     CONFIG_BASE_FILENAME = 'indexer/origin_head'

-    revision_metadata_task = RevisionMetadataTask()
-    origin_intrinsic_metadata_task = OriginMetadataTask()
+    revision_metadata_task = RevisionMetadata()
+    origin_intrinsic_metadata_task = OriginMetadata()

     def filter(self, ids):
         yield from ids
73

sure, the explanation was needed though ;)

That sounds fine as long as [1] is done.

make test ok (with [1])

[1] D558#inline-2791

This revision now requires changes to proceed.Oct 24 2018, 1:49 PM
This revision is now accepted and ready to land.Oct 24 2018, 1:54 PM
This revision was automatically updated to reflect the committed changes.
ardumont added inline comments.
swh/indexer/__init__.py
17
swh/indexer/tasks.py
22–28

@vlorentz And that's wrong. I did not see that ;)

In production:

Oct 25 17:32:54 worker01 python3[79159]: [2018-10-25 17:32:54,017: ERROR/MainProcess] Task swh.indexer.tasks.OrchestratorAllContents[b39b8a5b-a046-4e0e-98b0-88aeab65af5a] raised unexpected: NotImplementedError('Tasks must define the run method.',)
                                         Traceback (most recent call last):
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 240, in trace_task
                                             R = retval = fun(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 438, in __protected_call__
                                             return self.run(*args, **kwargs)
                                           File "/usr/lib/python3/dist-packages/celery/app/task.py", line 437, in run
                                             raise NotImplementedError('Tasks must define the run method.')
                                         NotImplementedError: Tasks must define the run method.

We must keep the swh.scheduler.task here.

vlorentz added inline comments.
swh/indexer/tasks.py
22–28

Let's just rename run_task to run