⚙ D266 enable metadata injection from loader core

moranegg created this revision.Nov 8 2017, 5:19 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptNov 8 2017, 5:19 PM

Harbormaster completed remote builds in B1085: Diff 880.Nov 8 2017, 5:19 PM

Just remove the comment and it's good to go.

swh/loader/core/loader.py
776	The comment is not necessary since you explicit it in the method's docstring :)

This revision now requires changes to proceed.Nov 8 2017, 5:29 PM

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

We are biased by the deposit (we do have it before).

But indeed, sounds more extensible that way.
Plus, we keep the symmetry with other objects (origin, content, directory, revision...).

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

Right.

Refactor adding origin_metadata to the loaders

Harbormaster completed remote builds in B1087: Diff 882.Nov 9 2017, 12:40 PM

In D266#5422, @olasd wrote:

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

ack but I'm keeping it in store_metadata()

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

Excellent remark
now store_metadata() is in the generic load and implemented in the sub class calling for send_origin_metadata() which is doing the actual exchanges with storage.

ardumont accepted this revision.Nov 9 2017, 5:57 PM

Change metadata provider and tool as additional config

Huh?

I think you arc diff --update the wrong diff. It should have beem arc diff --update D265.

swh/deposit/api/private/deposit_read.py
88 ↗	(On Diff #893)	I don't think you need it for that endpoint. That might be a remnant from your tests :)

Sorry about the mix-up, here it is again..

Refactor adding origin_metadata to the loaders

Harbormaster completed remote builds in B1106: Diff 898.Nov 14 2017, 11:21 AM

ardumont added inline comments.Nov 14 2017, 1:16 PM

swh/loader/core/loader.py
258	It did not shock me earlier but still. This shows that it should be better to hide the resolution of the `tool` and the `provider` in swh-storage. Instead of 1 request to store the metadata, we do 3 for each call. Maybe a FIXME for later?

Sorry about the mix-up, here it is again..

no worries there, it happens :)

moranegg marked an inline comment as done.Nov 16 2017, 11:14 AM

moranegg added inline comments.

swh/loader/core/loader.py
258	Yes you are right. and after your conversation with olasd yesterday about self-contained tools, seems like they should be created on the fly if missing so we don't need to store the data in swh-data.sql. It's time to open a new diff on storage for this, and have the storage resolve the missing data.

ardumont accepted this revision.Nov 16 2017, 12:19 PM

Refactor adding origin_metadata to the loaders

Harbormaster completed remote builds in B1108: Diff 900.Nov 16 2017, 4:48 PM

ardumont added a child revision: D265: enable metadata injection from deposit.Nov 16 2017, 4:52 PM

ardumont edited the summary of this revision. (Show Details)

ardumont requested changes to this revision.Nov 17 2017, 10:00 AM

ardumont added inline comments.

swh/loader/core/loader.py
258	Thinking more about this. I realize that what's stands out here is not so much the 3 requests. It is really that it is done at the same place (which is for example a divergence from all other methods). When we call that method, we should really have all we need to just store the metadata. I think what we should do here is decoupling the `origin_metadata` creation from the tool and the provider creation. A way to do this is opening new methods here for the tool and provider creation (`send_tool`, `send_provider` or better name if you have some :). Then for example, in our loader-deposit's case we can: adapt the prepare method implementation. The loader will register its tool and provider (resulting in a `tool_id` and `provider_id`). then in the `store_metadata` implementation, we can actually call that actual method implementation you propose with those id (`tool_id`, `provider_id`). And there we obtain a self-contained loader. Checking the storage, the only change in there needed is to open the `indexer_configuration_add`. This does not exist yet since we currently add new values directly in the DB. This relates to T851 as to implement this, we will need to open such endpoint (heading towards it now since i'm waiting for other stuff to finish anyway).

This revision now requires changes to proceed.Nov 17 2017, 10:00 AM

The question remains, is it the storage job or the other packages job to know that a certain provider has a certain id.
your solution [solA]- sending tool and provider as part of the prepare method is a good idea but
the resolution is made by the loader and not by the storage so it keeps the resolution in the loader's scope.

If we want to start the new logic [solB] we were discussing, that only the storage resolves the ids
and all the other elements that send stuff to the storage are oblivious to the creation of tools or providers.
Tha will leave the resolution out of scope for the loader and all other future tools that will implement the origin_metadata_add function.

Advantages for solA:

no changes to storage today
storage only deals with storing and not id resolution, which might depend on context

Advantages for solB:

origin_metadata will be updated by loaders, listers and registries_crawlers so each one will have to prepare tool and provider and resolve the id,

if the resolution is at the storage level, no need for send_tool and send_provider implemented for each tool

a more seperated implementation of the control on the inserts to the DB

with all that, I'm still not sure what is better.
we can do solA now (which is easier today to implement) and solB later when refactoring.

In D266#5544, @moranegg wrote:

The question remains, is it the storage job or the other packages job to know that a certain provider has a certain id.
your solution [solA]- sending tool and provider as part of the prepare method is a good idea but
the resolution is made by the loader and not by the storage so it keeps the resolution in the loader's scope.

Yes, but that does not mean that because it is doing so, it's not self-contained already.
For example, we do this for the origin.

Why would we need to do differently for the tool, provider, and whatever new object we would need in the future.

If we want to start the new logic [solB] we were discussing,

In my opinion, it's already done.

that only the storage resolves the ids

That's where we diverge, for me, it is not required that the storage resolves the id for the loader to be self-contained.

For example, today other layer needs to be able to deal with tool registering (all indexers + loader-deposit).

If we hide this in the storage, we need to duplicate that logic wherever it is hidden (in my understanding of your way, that would be in the origin_metadata_add entrypoint and in the content_mimetype_add, content_language_add...).

I don't have time to answer the rest yet.
I'll answer a little later :)

If we want to start the new logic [solB] we were discussing, that only the storage resolves the ids

I don't think that's needed anymore (as per mention in previous comment).

and all the other elements that send stuff to the storage are oblivious to the creation of tools or providers.
Tha will leave the resolution out of scope for the loader and all other future tools that will implement the origin_metadata_add function.

Advantages for solA:

no changes to storage today

storage only deals with storing and not id resolution, which might depend on context

Yes, and that's what we are doing today.

Also, now T851 is done so the new endpoint to register tools exists (and the one for provider existed already).
So we can implement and finish the loader deposit soon.

Advantages for solB:

origin_metadata will be updated by loaders, listers and registries_crawlers so each one will have to prepare tool and provider and resolve the id,

Yes, as that is the case for loaders for the origin case :)

if the resolution is at the storage level, no need for send_tool and send_provider implemented for each tool

Because you are only seeing the origin_metadata case (think about the indexer which also have the tool registering step needed).
Maybe what you say is true for the provider part (which may be specific to the origin_metadata endpoint) but it's not the case of the tools so possibly.

a more seperated implementation of the control on the inserts to the DB

That would not be separated but bound to the origin_metadata case.

with all that, I'm still not sure what is better.

For me it's solA.
Also, that would be the wise choice in regards to deadline (it's near now :).

we can do solA now (which is easier today to implement)

Yes, we do.

Cheers,

After some thoughts this weekend.
I agree that it is more logic to have a send_provider and a send_tool at the loader level if the tool or provider are not in storage.

I'm working now on these methods in the loader core:
in prepare_metadata
get provider_id and get tool_id from storage
if there are no ids:
send_tool or send_provider are called to create the entries and return the id
send_metadata sends to storage the metadata entry as it is now in the storage (no storage refactoring)

Seems it's the way we are dealing with origin_id

moranegg updated this revision to Diff 903.Nov 20 2017, 2:28 PM

This comment was removed by moranegg.

Harbormaster completed remote builds in B1111: Diff 903.Nov 20 2017, 2:28 PM

Refactor for provider_id and tool_id resolution at loader level

Harbormaster completed remote builds in B1112: Diff 904.Nov 20 2017, 2:32 PM

moranegg retitled this revision from Added metadata injection possible from loader core to enable metadata injection from loader core.Nov 20 2017, 3:35 PM

ardumont requested changes to this revision.Nov 21 2017, 10:34 AM

ardumont added inline comments.

swh/loader/core/loader.py
251	`indexer_configuration_add` endpoint takes a list of dict (with usual tool key) as parameter. This creates the tools not already present in the storage. In any case, returns the list of all tools (list of dict) but populated with their identifier.
653	I did not think to add this here. I guess, that depends on this question. Will this be always the way an origin_metadata is prepared? If yes, then ok ;)
663	You can then (due to change earlier) use directly `send_tool` here So i think this would do: tools = self.send_tool([tool]) tool_id = tools[0]['id']
670	I think we should raise an error since we cannot really do anything without a tool after this point.
679	As a note here, i noticed now that we could call multiple times the provider_add endpoint and it will much oblige without filtering any existing provider. So we need to do the filtering ourselves as you did here. This may need to change to work the same way other more common exists
682	Same here, we cannot really do anything after this point without a provider, so we should raise an error.

This revision now requires changes to proceed.Nov 21 2017, 10:34 AM

moranegg added inline comments.Nov 21 2017, 10:45 AM

swh/loader/core/loader.py
653	we have to access the storage to fetch the data, as I understood it shouldn't be in the loader-deposit. and yes, we will always have to resolve somewhere the ids, I thought the storage should do it, but if we're doing it in the loader it should be in the loader-core.
663	ack
670	OK
679	could you rephrase? i don't get that
682	OK

ardumont mentioned this in D265: enable metadata injection from deposit.Nov 21 2017, 10:57 AM

ardumont added inline comments.Nov 21 2017, 11:06 AM

swh/loader/core/loader.py
679	could you rephrase? i don't get that oops, yes, typos, sorry. This may need to change to work the same way other more common exists This may need to change to work the same way other more common endpoints do. The `metadata_provider_add` method in the storage does a simple insert. So if we call it multiple times: if right index exists on field, this will break (we do not check for error here, so this will cascade here) otherwise, if no index, this will insert a new provider with the same configuration multiple times. (I did not check the state on index for the provider table, thus the conditional in my reasoning). What we have in other endpoints (most of them) is only add what does not already exist. This behavior is similar in content_add, directory_add, etc... We filter out duplicates to only insert new stuff. I implemented this in the new `indexer_configuration_add` if you want to check ;)