Page MenuHomeSoftware Heritage

Revisited history graph implementation
ClosedPublic

Authored by aeviso on Aug 9 2021, 3:35 PM.

Details

Summary

A new class was added to properly represent a history graph (a tree of
HistoryNodes was used until now), and the origin-revision layer algorithm
was revisited to avoid processing with a merged path in the graph.

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6071 (id=21982)

Rebasing onto 3431de8f49...

Current branch diff-target is up to date.
Changes applied before test
commit b3e51bcca468d8970d4befdb77a82b370ec23a3f
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Aug 9 15:32:10 2021 +0200

    Revisited history graph implementation
    
    A new class was added to properly represent a history graph (a tree of
    `HistoryNode`s was used until now), and the origin-revision layer algorithm
    was revisited to avoid processing with a merged path in the graph.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/291/ for more details.

aeviso requested review of this revision.Aug 9 2021, 3:37 PM

A few remarks:

  • please follow the git commit message good practices, especially:
    • Use the imperative mood in the subject line (and in the whole commit message also IMHO)
    • Use the body to explain what and why vs. how
  • the use of newly introduced as_dict() methods seems unrelated here; unless I'm mistaken, the purpose if this change is better assertion reports by pytest on failure; if so, it should be presented as this in a dedicated revision
  • the refactoring of the test (killing the history_graph_from_dict function) would preferably be in a dedicated revision also (prior to the main refactoring), even if the refactoring of the graph creation code would also need a minor modification in this test.

more importantly, I don't see the purpose of this refactoring. I don't mean it's unneeded, but as it is presented and described, I don't see why it's needed. What feature/function does the introduction of the new class brings?

And I don't understand the part of the commit message about merges. If I get it right, this seems an improvement (prevent from doing things twice), then it should be in a dedicated revision with only this improvement and the test that shows it works as desired.

Thanks

swh/provenance/graph.py
34–38

I'd rather remove them until they are needed: this just adds complexity for (IMHO) no benefit.

67–68

makes no sense to use dict.setdefault(k) in an if k in dict body. Simply use the self._graph[current] = set() form here.

80–87

Not sure I see the point of using "pseudo-private members' here and expose as is a properties. Why not rename them as (resp.) self.head and self.parents and get rid of these properties?

This revision now requires changes to proceed.Aug 11 2021, 12:37 PM
  • the use of newly introduced as_dict() methods seems unrelated here; unless I'm mistaken, the purpose if this change is better assertion reports by pytest on failure; if so, it should be presented as this in a dedicated revision

This method is only used for test purposes but it doesn't make sense without the refactoring (the complete HistoryGraph class was not even present prior to the refactoring),

  • the refactoring of the test (killing the history_graph_from_dict function) would preferably be in a dedicated revision also (prior to the main refactoring), even if the refactoring of the graph creation code would also need a minor modification in this test.

I don't agree. Killing history_graph_from_dict was actually a consequence of the refactoring, not the other way around.

more importantly, I don't see the purpose of this refactoring. I don't mean it's unneeded, but as it is presented and described, I don't see why it's needed. What feature/function does the introduction of the new class brings?

As stated in the commit message, the main reason behind this refactoring is to avoid processing (and generating) duplicated branches in the history graph due the a bad structure choice (it used to be a tree).

And I don't understand the part of the commit message about merges. If I get it right, this seems an improvement (prevent from doing things twice), then it should be in a dedicated revision with only this improvement and the test that shows it works as desired.

This is precisely the revision with the improvement and the test that shows it works as desired. I honestly don't understand the review.

swh/provenance/graph.py
34–38

This is something I need to discuss with Guillaume about how the origin-revision algorithm should work. Current version is just a simplification of what we have discussed because I found out an error in the logic and it was not working. I rather keep this here until this is decided.

aeviso marked an inline comment as done.

rebase

Build is green

Patch application report for D6071 (id=21999)

Rebasing onto 3431de8f49...

Current branch diff-target is up to date.
Changes applied before test
commit 69268268666835cbee176ce67988b1d48004c2d1
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Aug 9 15:32:10 2021 +0200

    Revisit history graph implementation
    
    A new class was added to properly represent a history graph (a tree of
    `HistoryNode`s was used until now), and the origin-revision layer algorithm
    was revisited to avoid processing with a merged path in the graph.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/292/ for more details.

  • the use of newly introduced as_dict() methods seems unrelated here; unless I'm mistaken, the purpose if this change is better assertion reports by pytest on failure; if so, it should be presented as this in a dedicated revision

This method is only used for test purposes but it doesn't make sense without the refactoring (the complete HistoryGraph class was not even present prior to the refactoring),

the fact that this method is only used for tests is exactly my point here. But even without HistoryGraph, it makes sense to have this method on the *former* HistoryNode class and use it in the tests, instead of using the __eq__ operator, to get better assertion reports from pytest in case of failure. pytest will give you by default a much more detailed report of what differs between 2 dicts than using this __eq__ operator (in which case pytest will only report that these 2 objects are not the same, because when the assert expected_node == computed_node failed, it cannot get insights on why these 2 are not the same.
Note that it's possible to add hooks in pytest to make better reports, but just comparing dicts instead of objects is much simpler. If you are interested, see:

https://docs.pytest.org/en/6.2.x/assert.html#defining-your-own-explanation-for-failed-assertions

My point is that this part of the refactoring is somewhat unrelated with the refactoring. I mean conceptually, even if you did made it because yo needed it for your refactoring. You could have very well implemented the __eq__ for the new class and used it in the tests.

  • the refactoring of the test (killing the history_graph_from_dict function) would preferably be in a dedicated revision also (prior to the main refactoring), even if the refactoring of the graph creation code would also need a minor modification in this test.

I don't agree. Killing history_graph_from_dict was actually a consequence of the refactoring, not the other way around.

This is IMHO the other side of the same medal than the previous point. This is (again conceptually) part of the refactoring of the tests to have better assertion reports (i.e. comparing dicts instead of custom equality operators).

If this test refactoring is done prior to the "main" refactoring, then it makes this later easier to do. It's kind of a preparation step.

more importantly, I don't see the purpose of this refactoring. I don't mean it's unneeded, but as it is presented and described, I don't see why it's needed. What feature/function does the introduction of the new class brings?

As stated in the commit message, the main reason behind this refactoring is to avoid processing (and generating) duplicated branches in the history graph due the a bad structure choice (it used to be a tree).

The explanation was not clear to me, so I sensed this, but was a bit confused. So at the very least, a better commit message would help.

Anyway, the HistoryNode now only holds the Revision (self.entry), and since the 2 others flags is_head and in_history are actually unused, the whole object could be removed to keep only the HistoryGraph class.
Basically, you replace a recursive structure (HistoryNode with its parents list) by a flat one (the graph dict in the HistoryGraph class).
You could very well have keep the recursive approach. The fact that it was actually a tree instead of a graph is IMHO in fact a bug: if a Revision is a parent of 2 other revisions, the same (old) HistoryNode should be in both parent lists, it should be 2 copies of the same object (if I get this right).

So it looks to me that this refactoring is actually replacing a recursive structure by a flat one to store the graph to fix a bug in the recursive implementation.

It may very well make sense to prefer the flat structure rather than the recursive one, but it is not *required* to fix the bug.

And I don't understand the part of the commit message about merges. If I get it right, this seems an improvement (prevent from doing things twice), then it should be in a dedicated revision with only this improvement and the test that shows it works as desired.

This is precisely the revision with the improvement and the test that shows it works as desired.

I get this now.
But if I'm right in my claims just above (this is actually a bug fix of the former recursive data model), then I'd like to easily see it explained as is and, more importantly, have a clear view of how this bug fix is checked in tests. The size of the diffstat of the yaml data file makes it impossible (to me) to *see* that the bug fix part of the refactoring is checked.

I honestly don't understand the review.

I am very sorry about that.

That's why I am insisting on the "why" of a commit message. This whole refactoring is in fact a bug fix. It should be advertised as is.
When I see a Diff that is a bugfix, I expect it to be as minimal as possible and provide an easy to read test for the fix.

My claim here is that the refactoring (new data structure) is not needed for this fix. It can be a better data model, but it's not inherently related to the fix.

So I won't annoy you any more on this, hopefully I've made myself clear on why I did this review the way I did.

So let me accept it, but I'd be delighted to see a better commit message at very least.

This revision is now accepted and ready to land.Aug 12 2021, 10:04 AM

Build is green

Patch application report for D6071 (id=22031)

Rebasing onto 3431de8f49...

Current branch diff-target is up to date.
Changes applied before test
commit b0951e2e2199afa6bc3bd72a7c7f2c2e6bb58cb7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Aug 9 15:32:10 2021 +0200

    Revisit history graph implementation
    
    A new class was added to properly represent a history graph (a tree of
    `HistoryNode`s was used until now), and the origin-revision layer algorithm
    was revisited to avoid processing with a merged path in the graph.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/293/ for more details.

This revision was automatically updated to reflect the committed changes.