⚙ D248 from_disk: convert on-disk data to Software Heritage archive objects

olasd created this revision.Sep 19 2017, 4:13 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptSep 19 2017, 4:13 PM

It looks great to me, I just added some remarks on places it wasn't immediately obvious what the code was doing, and some other little nitpicks.

swh/model/from_disk.py
124 ↗	(On Diff #818)	Why do you use chunks=chunks here? Shouldn't you already have it in the closure?
171 ↗	(On Diff #818)	Not important, but have you considered using pathlib here? It generally makes the code more readable when there's some os.path.* clutter. (Anyway, probably not worth rewriting it).
swh/model/tests/test_from_disk.py
122–126 ↗	(On Diff #818)	With pathlib: for filename, content in self.contents.items(): path = directory / filename path.write_bytes(content['data']) path.chmod(content['mode']) :-P
145–152 ↗	(On Diff #818)	I get why you need to `pop()` here, but it's not immediately obvious, plus the pop() from self.contents prevents people from writing different tests in this class. Wouldn't you rather do something like: for k in ('x', 'y', 'z'): # keys you want to check self.assertEqual(conv_content[k], content[k]) and use `get()` everywhere? If you don't think that matters, maybe just a comment to understand what the last line is comparing exactly would be helpful.
166–169 ↗	(On Diff #818)	ditto
180 ↗	(On Diff #818)	I'm not sure I understand why it's needed to run all the tests both while checking and not checking that data matches. Maybe you could add a comment?
194 ↗	(On Diff #818)	same as before, and here it would allow you to remove that copy.

Add more tests

I really like this module, the API is very clean and what I'd like to use.

My only nit is using symbolic constants for perms (see comments).

swh/model/from_disk.py
15–21 ↗	(On Diff #819)	Notwithstanding the dictionary, how about defining constants and/or enums for the various mode here? They will come in handy in a couple of places noted below in separate comments.
34–37 ↗	(On Diff #819)	If we have symbolic names for these permissions, we can mention them in this docstring, relieving client code from the obligation of using magic numbers.
swh/model/tests/test_from_disk.py
22–37 ↗	(On Diff #819)	symbolic constants will allow this test case and a couple other ones below be a bit more readable (at least in most cases)

sounds good.

swh/model/from_disk.py
245 ↗	(On Diff #819)	For information, local tests fail (when patching with this diff, the build does not reflect this). Probably due to this current limitation. $ arc patch D248 && make test ... ====================================================================== FAIL: test_directory_to_objects_ignore_empty (model.tests.test_from_disk.DirectoryToObjects) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/tests/test_from_disk.py", line 256, in test_directory_to_objects_ignore_empty self.assertEqual(len(objs['directory']), 4) AssertionError: 6 != 4 ====================================================================== FAIL: test_directory_to_objects_ignore_name (model.tests.test_from_disk.DirectoryToObjects) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/tests/test_from_disk.py", line 271, in test_directory_to_objects_ignore_name self.assertEqual(len(objs['directory']), 5) AssertionError: 6 != 5 ---------------------------------------------------------------------- Ran 94 tests in 0.102s FAILED (failures=2) ../Makefile.python:17: recipe for target 'test' failed make: *** [test] Error 1
swh/model/tests/test_from_disk.py
101 ↗	(On Diff #819)	doubled key
122–126 ↗	(On Diff #818)	it is clearer ;)

Use a Merkle tree structure to model on-disk data (yes tests are broken)

Use an enum for Permissions

I've updated the approach after looking at some of the call sites for swh.model.git:

swh.loader.dir: only uses compute_hashes_from_directory -> Directory.from_disk
swh.loader.svn:
- uses compute_revision_sha1_git -> identifiers.revision_identifier
- uses compute_blob_metadata -> Content.from_file
- however, ra.SWHDirEditor uses the guts of swh.model.git to minimize updates. It should be reasonably easy to port to the Merkle datastructure now.

swh/model/from_disk.py
124 ↗	(On Diff #818)	I like being explicit when using Python's lexical scoping because I hate it, but yes, it's not needed.
171 ↗	(On Diff #818)	So, I've investigated pathlib, and it really, really, really wants you to use unicode (with `os.fsdecode` and associated gory `surrogateescape`), when we're using bytes everywhere. I'm afraid it doesn't map to what we do. Furthermore, even if we shoehorn pathlib use everywhere, it breaks down when we use `os.readlink`, as the `os` functions were only updated to accept path objects in Python 3.6 (sigh).
245 ↗	(On Diff #819)	Yeah, the tests were -voluntarily- broken (they still are currently) as I didn't update them yet for the new classmethods.

Update tests for new from_disk API

Remove useless 'name' and 'type' attributes from Content

Harbormaster completed remote builds in B1020: Diff 825.Sep 22 2017, 9:49 PM

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

In D248#5132, @ardumont wrote:

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

You should use the collect method on the root directory to generate all the objects recursively; you can then use the hash method to get the identifier for the toplevel directory (to be passed to the revision) and you should be golden.

Clearly this needs to be better documented :)

In D248#5133, @olasd wrote:

In D248#5132, @ardumont wrote:

This sounds quite fine.

For information, i've tested locally the loader-dir/tar migration with this (using an adapter like the following code - swh.loader.dir.loader):

Python
   ...
        from swh.model.from_disk import Content, Directory
        directory = Directory.from_disk(path=dir_path)

        def all_entries_from(directory):
            """Retrieve recursively all entries from the top directory.

            """
            files = []
            dirs = []
            for name, child in directory.items():
                if isinstance(child, Content):
                    files.append(directory.child_to_directory_entry(
                        name, child))
                    continue
                elif isinstance(child, Directory):
                    dirs.append(directory.child_to_directory_entry(
                        name, child))
                    subfiles, subdirs = all_entries_from(child)
                    files.extend(subfiles)
                    dirs.extend(subdirs)
                else:
                    raise ValueError('Unknown child')

            return files, dirs

        files, dirs = all_entries_from(directory)

        tree_hash = directory.hash
        full_rev = _revision_from(tree_hash, revision)

        objects = {
            BLOB: files,
            TREE: dirs,
            COMM: [full_rev],
            RELE: []
        }

  ...

prior version:

Python
       objects_per_path = git.compute_hashes_from_directory(dir_path)

       tree_hash = objects_per_path[dir_path]['checksums']['sha1_git']
       full_rev = _revision_from(tree_hash, revision)

       objects = {
           GitType.BLOB: list(
               git.objects_per_type(GitType.BLOB, objects_per_path)),
           GitType.TREE: list(
               git.objects_per_type(GitType.TREE, objects_per_path)),
           GitType.COMM: [full_rev],
           GitType.RELE: []
       }

And the loader-dir/tar tests were fine after that.

You should use the collect method on the root directory to generate all the objects recursively; you can then use the hash method to get the identifier for the toplevel directory (to be passed to the revision) and you should be golden.

way better!

Clearly this needs to be better documented :)

Oh, well, it is documented, i just did not make the right connections there.
At the time i read typeA, typeB, it did not yet make sense to me...
Now it does.

Thanks.

I've re-looked at this after the switch to the Merkle AST, and it looks good to me.

Only a question/comment: it'd be nice to have in test_from_disk.py a unit test that the resulting checksums of 1) a content, and 2) a directory match what we can obtain by hand using something else (e.g., git itself, or the current implementation before switching to from_disk.py). Unless I'm missing something, there are no such tests ATM. (Possible mismatches there should be caught by unit tests of loaders that use from_disk, but having them in the inner circle would be nice too.)

zack added inline comments.Oct 1 2017, 9:06 AM

swh/model/merkle.py
27–33	I couldn't find where this Sphinx syntax for source code examples is documented. I'm sure you've tried it and hence that it works, but if anyone has pointers to where this is documented I'd love to note it down.

zack mentioned this in T787: abstract data model (Merkle DAG) documentation.Oct 2 2017, 1:58 PM

In D248#5138, @zack wrote:

Only a question/comment: it'd be nice to have in test_from_disk.py a unit test that the resulting checksums of 1) a content, and 2) a directory match what we can obtain by hand using something else (e.g., git itself, or the current implementation before switching to from_disk.py). Unless I'm missing something, there are no such tests ATM. (Possible mismatches there should be caught by unit tests of loaders that use from_disk, but having them in the inner circle would be nice too.)

I'll pull the test logic from swh.model.git which tries to do that. I'm not too excited about the dependency on swh-storage-testdata but apart from that it's sensible to do.

swh/model/merkle.py
27–33	It's the doctest syntax, which mimics an interactive python session: https://docs.python.org/3.6/library/doctest.html I've updated swh-environment to run these.

Cleaner tests

silly doctest is silly

Harbormaster completed remote builds in B1032: Diff 844.Oct 3 2017, 7:24 PM

Rebase from master

Harbormaster completed remote builds in B1033: Diff 845.Oct 3 2017, 7:48 PM

Big round of test updates for from_disk : cleaned up the equality tests to address @seirl's comments.

Add some toplevel documentation to make it more obvious how to use this.

Merge expected tomorrow.

Closed by commit rDMODf44949a9ed39: Add a Merkle tree data structure (authored by olasd). · Explain WhyOct 4 2017, 12:10 PM

This revision was automatically updated to reflect the committed changes.

from_disk: convert on-disk data to Software Heritage archive objects
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 846

swh/model/merkle.py

swh/model/tests/test_merkle.py

from_disk: convert on-disk data to Software Heritage archive objectsClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 846

swh/model/merkle.py

swh/model/tests/test_merkle.py

from_disk: convert on-disk data to Software Heritage archive objects
ClosedPublic
Actions

Revision Contents
Changeset List