swh/model/from_disk.py
185–191	By default, `from_disk` will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a `Content` object that has an empty data attribute. We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a `DiskBackedContent` (`DiskContent`?) model, inheriting from `Content`, with a `path` attribute as well as a lazy `data` attribute that will read data from disk the first time it's accessed.

By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

No, in that case it's a SkippedContent.

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

In D2703#64521, @olasd wrote:

That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

I don't understand what collect() has to do with it

In D2703#64534, @vlorentz wrote:

By default, from_disk will not pull the full data from disk into memory, so by doing this there's a good chance you'll end up with a Content object that has an empty data attribute.

No, in that case it's a SkippedContent.

Then you're generating SkippedContents for all files that are supposed to be lazy-loaded from disk. That's even more wrong than my original expectation.

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

In D2703#64535, @vlorentz wrote:

In D2703#64521, @olasd wrote:

That'd be a lot more work downstream, but wouldn't it make sense to have get_data return model objects directly instead of dicts?

They don't exactly return the same thing; from_disk.Content.get_data() also returns perms, name, and path.

That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

As most users are calling collect() on a directory object, that would remove a large chunk of indirection from loaders.

I don't understand what collect() has to do with it

collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

In D2703#64548, @olasd wrote:

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

In D2703#64549, @olasd wrote:

That's really just an abstraction failure/leak in its implementation. These fields are not useful to the callers if the API of the returned Contents and Directorys is sensible (see also the lazy loading comments).

ack

collect() is the external API of Directory that's used by loaders to get the lists of objects to load. That function calls get_data() on every node in the tree, which currently converts the data structure to a dict. So you don't have any Content objects on which to call .to_model by the time you've collect()ed, which makes to_model's usefulness on its own quite limited.

ack

In D2703#64550, @vlorentz wrote:

In D2703#64548, @olasd wrote:

We already have lots of issues with loaders having an "optimistic" usage of memory, so I'd be tempted to introduce a DiskBackedContent (DiskContent?) model, inheriting from Content, with a path attribute as well as a lazy data attribute that will read data from disk the first time it's accessed.

What's the difference?

The difference between what and what?

I mean, what's the point of lazy-loading, instead of using the data argument of from_disk?

Ah, right.

Most loaders that are using from_disk will be re-processing files that they've already imported in one way or another: for instance, the svn loader calls runs it for every commit in turn; package manager loaders for every version in turn; lazy-loading allows us to only carry the full data in memory at the point we're actually sending it away to the archive, rather than all the time. In testing, the memory consumption savings were quite substantial.

vlorentz planned changes to this revision.Feb 21 2020, 1:39 PM

refactor diff to defer loading content data after creating content object.

Harbormaster failed remote builds in B10727: Diff 9660!Feb 21 2020, 4:35 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tox/186/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tox/186/console

vlorentz edited parent revisions, added: D2706: Fix tests of special devices.; removed: D2702: Sort from_disk.Directory entries..Feb 21 2020, 4:35 PM

Build is green
See https://jenkins.softwareheritage.org/job/DMOD/job/tox/187/ for more details.

Harbormaster completed remote builds in B10727: Diff 9660.Feb 21 2020, 4:36 PM

Thanks!

swh/model/tests/test_from_disk.py
84	I don't think the expected content is that useful. The test would be easier to follow with just `assert content_with_data.data == b'bar'`. Using explicit contents such as `b'written before instantiation'`, `b'written before with_data()'`, `b'written after with_data()'` instead of metasyntactic stuff would make the test even clearer.