Page MenuHomeSoftware Heritage

storage.algos.dir_iterators: Fixes and improvements
ClosedPublic

Authored by anlambert on Sep 26 2018, 3:01 PM.

Details

Reviewers
olasd
seirl
Group Reviewers
Reviewers
Summary

Some fixes and improvements regarding recursive directory iteration
implemented in the storage.algos.dir_iterators sub-module.

  • _empty_dir_hash variable did not have correct type (str instead of bytes) so empty directory test based on hash comparison was always failing
  • in the step method of DirectoryIterator, no need to push a new frame for an empty directory as this will stop the iteration
  • add Python iterator protocol support in the DirectoryIterator class in order to easily visit in a recursive way any directory stored in the archive.
  • add convenient function dir_iterator wrapping the instantiation of the DirectoryIterator class
  • add tests

Those changes are needed to use client-side iteration in the directory
cooker from the vault.

Related T1177

Diff Detail

Repository
rDSTO Storage manager
Branch
dir_iterators-fixes-and-improvements
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1464
Build 1808: arc lint + arc unit

Event Timeline

I'm not accepting this until the doubt about the __iter__ protocol is resolved, but apart from that this looks good to go.

swh/storage/algos/dir_iterators.py
207–209

Shouldn't this return a new iterator rather than reset the current one? (I honestly have no idea what the proper protocol should be).

swh/storage/tests/algos/test_dir_iterator.py
27

Ideally we'd use the constants from swh.model.from_disk.DentryPerms but that's really cosmetic.

swh/storage/algos/dir_iterators.py
207–209

Quoting the Python doc:

object.__iter__(self)

   This method is called when an iterator is required for a container. This method should return a new iterator object that can iterate over all the objects in the container. For mappings, it should iterate over the keys of the container.

So effectively, returning a new instance of DirectoryIterator seems to be the right way here.

swh/storage/tests/algos/test_dir_iterator.py
27

Good to know, I will update the diff accordingly

swh/storage/algos/dir_iterators.py
78

It's kind of bad here because you have to do an additional N writes for no reason. I think you can use reversed() (it's an iterator, so it'll keep the list in the same order but iterate from the end)

Update diff acording to comments:

  • iter() method must return a new iterator instance
  • use swh.model.from_disk.DPerms enum
swh/storage/algos/dir_iterators.py
78

In fact, the sorted function as an optional 'reverse' parameter, quoting the doc:

reverse is a boolean value. If set to True, then the list elements are sorted as if each comparison were reversed.

I should have use it in the first place, will update diff accordingly

Another diff update: do not reverse a list of dir entries after sorting it,
simply sort it in descending order.

This revision is now accepted and ready to land.Sep 28 2018, 2:25 PM

Landed in 9adaed700be3 and 2eced78da85d (commit ids changed due to rebase)