Page MenuHomeSoftware Heritage

Implement efficient comparison of revision trees
ClosedPublic

Authored by anlambert on Feb 1 2018, 11:28 AM.

Details

Reviewers
olasd
Group Reviewers
Reviewers
Summary

This diff adds implementation of an efficient algorithm for comparing two
revision trees in order to compute the list of introduced file changes.
The algorithm detects the insertion / deletion / modification of files and can also
track their renaming if requested.

That algorithm can be found in the diff module, located in the new
namespace swh.storage.algos.

That feature is needed to enrich the revision view in the browse application
in swh-web. Besides giving the list of changes for a particular revision, it
will allow to compute the diffs for each changed file and display them as in
every software forge web interface.

Regarding the location of that new utility module, I put it in a new
namespace swh.storage.algos. I am not sure if it is the adequate solution,
maybe it could be put in a new sub-project swh-algos, centralizing high level
algorithms and usefull operations that can be applied to the swh objects
contained in the archive.

Related T921

Diff Detail

Repository
rDSTO Storage manager
Branch
revision_diff
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1180
Build 1524: arc lint + arc unit

Event Timeline

Small update compared to my first Diff:

  • rename swh.storage.utils.diff_revisions module into swh.storage.utils.diff
  • add function diff_directories in diff module
  • fix a couple of typos

LGTM.

Regarding the placement of the code, the (ideal) data structure you need for this is a lazy version of the DAG, because you need to only fetch the nodes that are different, and the only module that can do that for you is indeed swh.storage.
So I'm fine having this in swh.storage. I'm not sure I like "utils" (which sounds low-level), maybe "operations" or "algo" (which sound higher-level, like this functionality actually is) would be more appropriate here? I've no strong opinion either...
Either way, I'd like to uniform with what @seirl has done for toposort in swh.model. So whatever name we pick here, toposort in swh.model should go under swh.model.THATNAME.toposort .

On an unrelated note, isn't the Directory iterator you've implemented here something reusable for other needs?
If so, it might deserve its own model and to drop the leading underscore.

swh/storage/utils/diff.py
8 ↗(On Diff #991)

You should probably say "inspired from" here, as AFAICT you didn't reuse any code, but simply reused the same approach.

Regarding the placement of the code, the (ideal) data structure you need for this is a lazy version of the DAG, because you need to only fetch the nodes that are different, and the only module that can do that for you is indeed swh.storage.
So I'm fine having this in swh.storage.

From my point of view, this could have been implemented outside the swh.storage namespace (put in a swh.algos one for instance)
but for commodity of use and easy setup, it will stay in storage at the moment.

I'm not sure I like "utils" (which sounds low-level), maybe "operations" or "algo" (which sound higher-level, like this functionality actually is) would be more appropriate here? I've no strong opinion either...
Either way, I'd like to uniform with what @seirl has done for toposort in swh.model. So whatever name we pick here, toposort in swh.model should go under swh.model.THATNAME.toposort .

I would go for "swh.storage.algos" as it is concise and self explanatory.

On an unrelated note, isn't the Directory iterator you've implemented here something reusable for other needs?
If so, it might deserve its own model and to drop the leading underscore.

Sure, this could be reused. I could move the implemented iterators in a "swh.storage.algos.dir_iterators"
for instance.

Update adressing @zack comments:

  • rename swh.storage.utils to swh.storage.algos
  • move directory iterators to swh.storage.algos.dir_iterators

Also improve documentation and turns some methods private.

I feel like this looks a lot like the diff tree algorithm that's already here: https://forge.softwareheritage.org/source/swh-vault/browse/master/swh/vault/cookers/revision_gitfast.py;bca66197586f31c10042d43897717d7aec90e797$21 but I don't think anyone cares enough to look at how we could merge those two algorithms now, so I guess we can just have both separately :-P

I think that the swh.storage.Storage API should expose at least the last two, and probably all three of the following operations:

  • diff two directories (basic building block of this API)
  • diff the directories for two arbitrary revisions
  • get the diffs of a revision with (all) its parents

The last function is a superset of the current exposed API. In the case of a commit with a single parent, it would return the single diff representation. In case of a merge commit with several parents, it would give the diff to each parent, which would then allow frontends to show a git-like Combined Diff (https://git-scm.com/docs/diff-format#_combined_diff_format)

New update adressing @olasd comments:

  • add methods diff_directories and diff_revisions to the storage API for comparing two arbitrary directories/revisions
  • regarding the diffs of a revision and all its parents, I think the current API state is sufficient to easily implement it from client code. As we do not have an urgent need to compute combined diffs, I preferred not to integrate that feature in the API.
This revision is now accepted and ready to land.Feb 20 2018, 11:13 AM