Page MenuHomeSoftware Heritage

Add to_model() method to from_disk.{Content,Directory}, to convert to canonical model objects.
ClosedPublic

Authored by vlorentz on Feb 20 2020, 4:51 PM.

Details

Summary

They will be used by loaders, so they can deal only with canonical
model objects, instead of having to do the same conversion themselves.

Depends on D2702.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

What do you think?

swh/model/from_disk.py
185–191

By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

No, in that case it's a SkippedContent.

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

In D2703#64521, @olasd wrote:

That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

I don't understand what collect() has to do with it

By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

No, in that case it's a SkippedContent.

Then you're generating SkippedContents for all files that are supposed to be lazy-loaded from disk. That's even more wrong than my original expectation.

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

In D2703#64521, @olasd wrote:

That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

I don't understand what collect() has to do with it

collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

In D2703#64548, @olasd wrote:

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

In D2703#64549, @olasd wrote:

That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

ack

collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

ack

In D2703#64548, @olasd wrote:

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

Ah, right.

Most loaders that are using from_disk will be re-processing files that they've already imported in one way or another: for instance, the svn loader calls runs it for every commit in turn; package manager loaders for every version in turn; lazy-loading allows us to only carry the full data in memory at the point we're actually sending it away to the archive, rather than all the time. In testing, the memory consumption savings were quite substantial.

refactor diff to defer loading content data after creating content object.

Thanks!

swh/model/tests/test_from_disk.py
84

I don't think the expected content is that useful. The test would be easier to follow with just assert content_with_data.data == b'bar'.

Using explicit contents such as b'written before instantiation', b'written before with_data()', b'written after with_data()' instead of metasyntactic stuff would make the test even clearer.

This revision is now accepted and ready to land.Feb 24 2020, 2:42 PM