Page MenuHomeSoftware Heritage

from_disk: convert on-disk data to Software Heritage archive objects
ClosedPublic

Authored by olasd on Sep 19 2017, 4:13 PM.

Details

Summary

This module is a reimplementation of swh.model.git, with the underlying goal of
replacing it and fixing T709 in the process.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

It looks great to me, I just added some remarks on places it wasn't immediately obvious what the code was doing, and some other little nitpicks.

swh/model/from_disk.py
124 ↗(On Diff #818)

Why do you use chunks=chunks here? Shouldn't you already have it in the closure?

171 ↗(On Diff #818)

Not important, but have you considered using pathlib here? It generally makes the code more readable when there's some os.path.* clutter. (Anyway, probably not worth rewriting it).

swh/model/tests/test_from_disk.py
122–126 ↗(On Diff #818)

With pathlib:

for filename, content in self.contents.items():
    path = directory / filename
    path.write_bytes(content['data'])
    path.chmod(content['mode'])

:-P

145–152 ↗(On Diff #818)

I get why you need to pop() here, but it's not immediately obvious, plus the pop() from self.contents prevents people from writing different tests in this class. Wouldn't you rather do something like:

for k in ('x', 'y', 'z'): # keys you want to check
    self.assertEqual(conv_content[k], content[k])

and use get() everywhere?

If you don't think that matters, maybe just a comment to understand what the last line is comparing exactly would be helpful.

166–169 ↗(On Diff #818)

ditto

180 ↗(On Diff #818)

I'm not sure I understand why it's needed to run all the tests both while checking and not checking that data matches. Maybe you could add a comment?

194 ↗(On Diff #818)

same as before, and here it would allow you to remove that copy.

I really like this module, the API is very clean and what I'd like to use.

My only nit is using symbolic constants for perms (see comments).

swh/model/from_disk.py
15–21 ↗(On Diff #819)

Notwithstanding the dictionary, how about defining constants and/or enums for the various mode here?

They will come in handy in a couple of places noted below in separate comments.

34–37 ↗(On Diff #819)

If we have symbolic names for these permissions, we can mention them in this docstring, relieving client code from the obligation of using magic numbers.

swh/model/tests/test_from_disk.py
22–37 ↗(On Diff #819)

symbolic constants will allow this test case and a couple other ones below be a bit more readable (at least in most cases)

sounds good.

swh/model/from_disk.py
245 ↗(On Diff #819)

For information, local tests fail (when patching with this diff, the build does not reflect this).

Probably due to this current limitation.

$ arc patch D248 && make test
...
======================================================================
FAIL: test_directory_to_objects_ignore_empty (model.tests.test_from_disk.DirectoryToObjects)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/tests/test_from_disk.py", line 256, in test_directory_to_objects_ignore_empty
    self.assertEqual(len(objs['directory']), 4)
AssertionError: 6 != 4

======================================================================
FAIL: test_directory_to_objects_ignore_name (model.tests.test_from_disk.DirectoryToObjects)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/tests/test_from_disk.py", line 271, in test_directory_to_objects_ignore_name
    self.assertEqual(len(objs['directory']), 5)
AssertionError: 6 != 5

----------------------------------------------------------------------
Ran 94 tests in 0.102s

FAILED (failures=2)
../Makefile.python:17: recipe for target 'test' failed
make: *** [test] Error 1
swh/model/tests/test_from_disk.py
101 ↗(On Diff #819)

doubled key

122–126 ↗(On Diff #818)

it is clearer ;)

Use a Merkle tree structure to model on-disk data (yes tests are broken)

Use an enum for Permissions

I've updated the approach after looking at some of the call sites for swh.model.git:

  • swh.loader.dir: only uses compute_hashes_from_directory -> Directory.from_disk
  • swh.loader.svn:
    • uses compute_revision_sha1_git -> identifiers.revision_identifier
    • uses compute_blob_metadata -> Content.from_file
    • however, ra.SWHDirEditor uses the guts of swh.model.git to minimize updates. It should be reasonably easy to port to the Merkle datastructure now.
swh/model/from_disk.py
124 ↗(On Diff #818)

I like being explicit when using Python's lexical scoping because I hate it, but yes, it's not needed.

171 ↗(On Diff #818)

So, I've investigated pathlib, and it really, really, really wants you to use unicode (with os.fsdecode and associated gory surrogateescape), when we're using bytes everywhere. I'm afraid it doesn't map to what we do.

Furthermore, even if we shoehorn pathlib use everywhere, it breaks down when we use os.readlink, as the os functions were only updated to accept path objects in Python 3.6 (sigh).

245 ↗(On Diff #819)

Yeah, the tests were -voluntarily- broken (they still are currently) as I didn't update them yet for the new classmethods.

olasd marked an inline comment as done.

Update tests for new from_disk API

Remove useless 'name' and 'type' attributes from Content

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

In D248#5132, @ardumont wrote:

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

You should use the collect method on the root directory to generate all the objects recursively; you can then use the hash method to get the identifier for the toplevel directory (to be passed to the revision) and you should be golden.

Clearly this needs to be better documented :)

In D248#5133, @olasd wrote:
In D248#5132, @ardumont wrote:

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

You should use the collect method on the root directory to generate all the objects recursively; you can then use the hash method to get the identifier for the toplevel directory (to be passed to the revision) and you should be golden.

way better!

Clearly this needs to be better documented :)

Oh, well, it is documented, i just did not make the right connections there.
At the time i read typeA, typeB, it did not yet make sense to me...
Now it does.

Thanks.

I've re-looked at this after the switch to the Merkle AST, and it looks good to me.

Only a question/comment: it'd be nice to have in test_from_disk.py a unit test that the resulting checksums of 1) a content, and 2) a directory match what we can obtain by hand using something else (e.g., git itself, or the current implementation before switching to from_disk.py). Unless I'm missing something, there are no such tests ATM. (Possible mismatches there should be caught by unit tests of loaders that use from_disk, but having them in the inner circle would be nice too.)

swh/model/merkle.py
27–33

I couldn't find where this Sphinx syntax for source code examples is documented. I'm sure you've tried it and hence that it works, but if anyone has pointers to where this is documented I'd love to note it down.

olasd marked an inline comment as done.Oct 2 2017, 5:30 PM
In D248#5138, @zack wrote:

Only a question/comment: it'd be nice to have in test_from_disk.py a unit test that the resulting checksums of 1) a content, and 2) a directory match what we can obtain by hand using something else (e.g., git itself, or the current implementation before switching to from_disk.py). Unless I'm missing something, there are no such tests ATM. (Possible mismatches there should be caught by unit tests of loaders that use from_disk, but having them in the inner circle would be nice too.)

I'll pull the test logic from swh.model.git which tries to do that. I'm not too excited about the dependency on swh-storage-testdata but apart from that it's sensible to do.

swh/model/merkle.py
27–33

It's the doctest syntax, which mimics an interactive python session:

https://docs.python.org/3.6/library/doctest.html

I've updated swh-environment to run these.

olasd marked an inline comment as done.

Cleaner tests

olasd marked 8 inline comments as done.

Rebase from master

Big round of test updates for from_disk : cleaned up the equality tests to address @seirl's comments.

Add some toplevel documentation to make it more obvious how to use this.

Merge expected tomorrow.

This revision was automatically updated to reflect the committed changes.