Page MenuHomeSoftware Heritage

enable metadata injection from loader core
ClosedPublic

Authored by moranegg on Nov 8 2017, 5:19 PM.

Details

Summary

explicit implementation of this method should be in the sub class

Depends on a future diff to adapt the storage persistence of the origin_metadata (to sandbox the tool/provider bootstrap)

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont requested changes to this revision.Nov 8 2017, 5:29 PM

Just remove the comment and it's good to go.

swh/loader/core/loader.py
776

The comment is not necessary since you explicit it in the method's docstring :)

This revision now requires changes to proceed.Nov 8 2017, 5:29 PM
olasd requested changes to this revision.Nov 8 2017, 5:36 PM
olasd added a subscriber: olasd.

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

We are biased by the deposit (we do have it before).

But indeed, sounds more extensible that way.
Plus, we keep the symmetry with other objects (origin, content, directory, revision...).

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

Right.

  • Refactor adding origin_metadata to the loaders
In D266#5422, @olasd wrote:

I think metadata should be added after the data has been fetched (maybe in store_data()?), as it might not be available before.

ack but I'm keeping it in store_metadata()

The critical part of loading metadata is the storage calls, so that's really what should be implemented here, as send_origin_metadata / send_origin_visit_metadata functions that call the underlying storage methods.

Excellent remark
now store_metadata() is in the generic load and implemented in the sub class calling for send_origin_metadata() which is doing the actual exchanges with storage.

  • Change metadata provider and tool as additional config

Huh?

I think you arc diff --update the wrong diff. It should have beem arc diff --update D265.

swh/deposit/api/private/deposit_read.py
88 ↗(On Diff #893)

I don't think you need it for that endpoint.
That might be a remnant from your tests :)

Sorry about the mix-up, here it is again..

  • Refactor adding origin_metadata to the loaders
swh/loader/core/loader.py
258

It did not shock me earlier but still.

This shows that it should be better to hide the resolution of the tool and the provider in swh-storage.
Instead of 1 request to store the metadata, we do 3 for each call.

Maybe a FIXME for later?

Sorry about the mix-up, here it is again..

no worries there, it happens :)

moranegg added inline comments.
swh/loader/core/loader.py
258

Yes you are right.
and after your conversation with olasd yesterday about self-contained tools, seems like they should be created on the fly if missing so we don't need to store the data in swh-data.sql.

It's time to open a new diff on storage for this, and have the storage resolve the missing data.

  • Refactor adding origin_metadata to the loaders
ardumont added inline comments.
swh/loader/core/loader.py
258

Thinking more about this.

I realize that what's stands out here is not so much the 3 requests.
It is really that it is done at the same place (which is for example a divergence from all other methods).
When we call that method, we should really have all we need to just store the metadata.

I think what we should do here is decoupling the origin_metadata creation from the tool and the provider creation.
A way to do this is opening new methods here for the tool and provider creation (send_tool, send_provider or better name if you have some :).

Then for example, in our loader-deposit's case we can:

  • adapt the prepare method implementation. The loader will register its tool and provider (resulting in a tool_id and provider_id).
  • then in the store_metadata implementation, we can actually call that actual method implementation you propose with those id (tool_id, provider_id).

And there we obtain a self-contained loader.

Checking the storage, the only change in there needed is to open the indexer_configuration_add.
This does not exist yet since we currently add new values directly in the DB.

This relates to T851 as to implement this, we will need to open such endpoint (heading towards it now since i'm waiting for other stuff to finish anyway).

This revision now requires changes to proceed.Nov 17 2017, 10:00 AM

The question remains, is it the storage job or the other packages job to know that a certain provider has a certain id.
your solution [solA]- sending tool and provider as part of the prepare method is a good idea but
the resolution is made by the loader and not by the storage so it keeps the resolution in the loader's scope.

If we want to start the new logic [solB] we were discussing, that only the storage resolves the ids
and all the other elements that send stuff to the storage are oblivious to the creation of tools or providers.
Tha will leave the resolution out of scope for the loader and all other future tools that will implement the origin_metadata_add function.

Advantages for solA:

  • no changes to storage today
  • storage only deals with storing and not id resolution, which might depend on context

Advantages for solB:

  • origin_metadata will be updated by loaders, listers and registries_crawlers so each one will have to prepare tool and provider and resolve the id,

if the resolution is at the storage level, no need for send_tool and send_provider implemented for each tool

  • a more seperated implementation of the control on the inserts to the DB

with all that, I'm still not sure what is better.
we can do solA now (which is easier today to implement) and solB later when refactoring.

In D266#5544, @moranegg wrote:

The question remains, is it the storage job or the other packages job to know that a certain provider has a certain id.
your solution [solA]- sending tool and provider as part of the prepare method is a good idea but
the resolution is made by the loader and not by the storage so it keeps the resolution in the loader's scope.

Yes, but that does not mean that because it is doing so, it's not self-contained already.
For example, we do this for the origin.

Why would we need to do differently for the tool, provider, and whatever new object we would need in the future.

If we want to start the new logic [solB] we were discussing,

In my opinion, it's already done.

that only the storage resolves the ids

That's where we diverge, for me, it is not required that the storage resolves the id for the loader to be self-contained.

For example, today other layer needs to be able to deal with tool registering (all indexers + loader-deposit).

If we hide this in the storage, we need to duplicate that logic wherever it is hidden (in my understanding of your way, that would be in the origin_metadata_add entrypoint and in the content_mimetype_add, content_language_add...).

I don't have time to answer the rest yet.
I'll answer a little later :)

If we want to start the new logic [solB] we were discussing, that only the storage resolves the ids

I don't think that's needed anymore (as per mention in previous comment).

and all the other elements that send stuff to the storage are oblivious to the creation of tools or providers.
Tha will leave the resolution out of scope for the loader and all other future tools that will implement the origin_metadata_add function.

Advantages for solA:

  • no changes to storage today
  • storage only deals with storing and not id resolution, which might depend on context

Yes, and that's what we are doing today.

Also, now T851 is done so the new endpoint to register tools exists (and the one for provider existed already).
So we can implement and finish the loader deposit soon.

Advantages for solB:

  • origin_metadata will be updated by loaders, listers and registries_crawlers so each one will have to prepare tool and provider and resolve the id,

Yes, as that is the case for loaders for the origin case :)

if the resolution is at the storage level, no need for send_tool and send_provider implemented for each tool

Because you are only seeing the origin_metadata case (think about the indexer which also have the tool registering step needed).
Maybe what you say is true for the provider part (which may be specific to the origin_metadata endpoint) but it's not the case of the tools so possibly.

  • a more seperated implementation of the control on the inserts to the DB

That would not be separated but bound to the origin_metadata case.

with all that, I'm still not sure what is better.

For me it's solA.
Also, that would be the wise choice in regards to deadline (it's near now :).

we can do solA now (which is easier today to implement)

Yes, we do.

Cheers,

After some thoughts this weekend.
I agree that it is more logic to have a send_provider and a send_tool at the loader level if the tool or provider are not in storage.

I'm working now on these methods in the loader core:
in prepare_metadata
get provider_id and get tool_id from storage
if there are no ids:
send_tool or send_provider are called to create the entries and return the id
send_metadata sends to storage the metadata entry as it is now in the storage (no storage refactoring)

Seems it's the way we are dealing with origin_id

This comment was removed by moranegg.
  • Refactor for provider_id and tool_id resolution at loader level
moranegg retitled this revision from Added metadata injection possible from loader core to enable metadata injection from loader core.Nov 20 2017, 3:35 PM
ardumont added inline comments.
swh/loader/core/loader.py
251

indexer_configuration_add endpoint takes a list of dict (with usual tool key) as parameter.
This creates the tools not already present in the storage.
In any case, returns the list of all tools (list of dict) but populated with their identifier.

653

I did not think to add this here. I guess, that depends on this question.
Will this be always the way an origin_metadata is prepared?

If yes, then ok ;)

663

You can then (due to change earlier) use directly send_tool here

So i think this would do:

tools = self.send_tool([tool])
tool_id = tools[0]['id']
670

I think we should raise an error since we cannot really do anything without a tool after this point.

679

As a note here, i noticed now that we could call multiple times the provider_add endpoint and it will much oblige without filtering any existing provider. So we need to do the filtering ourselves as you did here.

This may need to change to work the same way other more common exists

682

Same here, we cannot really do anything after this point without a provider, so we should raise an error.

This revision now requires changes to proceed.Nov 21 2017, 10:34 AM
swh/loader/core/loader.py
653

we have to access the storage to fetch the data, as I understood it shouldn't be in the loader-deposit.

and yes, we will always have to resolve somewhere the ids, I thought the storage should do it,
but if we're doing it in the loader it should be in the loader-core.

663

ack

670

OK

679

could you rephrase? i don't get that

682

OK

swh/loader/core/loader.py
679

could you rephrase? i don't get that

oops, yes, typos, sorry.

This may need to change to work the same way other more common exists

This may need to change to work the same way other more common *endpoints* do.

The metadata_provider_add method in the storage does a simple insert.
So if we call it multiple times:

  • if right index exists on field, this will break (we do not check for error here, so this will cascade here)
  • otherwise, if no index, this will insert a new provider with the same configuration multiple times.

(I did not check the state on index for the provider table, thus the conditional in my reasoning).

What we have in other endpoints (most of them) is only add what does not already exist.
This behavior is similar in content_add, directory_add, etc...
We filter out duplicates to only insert new stuff.

I implemented this in the new indexer_configuration_add if you want to check ;)

swh/loader/core/loader.py
251

This was done that way for example, IIRC, for the metadata indexer, which will use tools and not only one.

  • Align prepare_metadata and send_tool with Storage capacity to return id for existing tool
This revision was automatically updated to reflect the committed changes.