Page MenuHomeSoftware Heritage

Refactor origin-revision layer
ClosedPublic

Authored by aeviso on Jun 4 2021, 4:04 PM.

Details

Summary

Removed archive parameter from RevisionEntry since is was no longer required.

Refactored OriginEntry to include info about visit date and snapshot.
Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
Also origin_add function was updated accordingly, and CLI command now uses a CSVOriginIterator similar to that previously developed for revisions.
Updated tests as well to ensure nothing was broken during the refactoring.

Diff Detail

Repository
rDPROV Provenance database
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D5822 (id=20828)

Rebasing onto f64a41b6ee...

Current branch diff-target is up to date.
Changes applied before test
commit 6c4f25bb8fdb291d545feecb45cee96b631f1f58
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot.
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator similar to that previously developed for revisions.
    Updated tests as well to ensure nothing was broken during the refactoring.

commit 1619d26bcd7049a595c9279f0cc158a732637fd4
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/89/ for more details.

aeviso requested review of this revision.Jun 4 2021, 4:06 PM
douardda requested changes to this revision.Jun 4 2021, 5:11 PM
douardda added a subscriber: douardda.

Also please make sure to add a blank line between the "title" and the remaining of the git commit message (in revision 6c4f25bb8fdb "Refactor [...]")

Thanks

swh/provenance/model.py
24 ↗(On Diff #20828)

I'd rather stick to the idea of keeping apart the "retrieve revisions from the archive" from the "iterate on those revisions", put most of this method in a "retrieve_revisions" method and then have a property to access said revisions (and so be consistent with the DirectoryEntry).

As for the DirectoryEntry, it's mostly matter of separation of concerns here.

24 ↗(On Diff #20828)

please also annotate the result of the method here.

53 ↗(On Diff #20828)

i don't uderstand this line. Oh, it's just for building batches of targets... right.
You may want to use swh.core.utils.grouper, like:

for targets in grouper(targets_set, batchsize):
  revisions.update(RevisionEntry(revision.id) 
    for revision in archive.revision_get(targets) if revision is not None)

or so

56 ↗(On Diff #20828)

maybe something like

revisions.update(RevisionEntry(revision.id) 
    for revision in archive.revision_get(targets) if revision is not None)

?

81 ↗(On Diff #20828)

why is this list() needed now? because either archive.revision_get already returns a list (in which case this is not needed) or it returns a generator, then the if revision line above is not what it seems to do...

swh/provenance/origin.py
41 ↗(On Diff #20828)

why not write this as a generator directly in the __iter__ method?
Also I don't think there really is a need for the if date.tzinfo is None part, just use the default_timezone argument of iso8601.parse_date.

e.g.:

def __iter__(self):
  for url, date, snap in self.statuses:
    date = iso8601.parse_date(date, default_timezone=timezone.utc)
    yield OriginEntry(url, date, snap)
50 ↗(On Diff #20828)

there is no need to keep these commented classes. (worst case, we have the git history if one really want to get them back).

This revision now requires changes to proceed.Jun 4 2021, 5:11 PM
swh/provenance/model.py
24 ↗(On Diff #20828)

This iterator follows the same scheme as CSVRevisionIterator that you introduced. I guess this shall be done for both of them to keep code consistent, but that's not the goal of this revision. Maybe we can mark that as a future tasks since I need to continue working on the origin-revision layer algorithm (which is a priority)

53 ↗(On Diff #20828)

Maybe, I don't know how the other module works. This is doing exactly what I wanted it to do.

56 ↗(On Diff #20828)

It is clearer to me this way. What's the advantage of the other syntax?

81 ↗(On Diff #20828)

I don't know why this suddenly started complaining about revision being a generator, hence not possible to index it.

aeviso marked an inline comment as not done.
  • Requested fixes to CSVOriginIterator.

Build is green

Patch application report for D5822 (id=20834)

Rebasing onto 100384fb8f...

First, rewinding head to replay your work on top of it...
Applying: Removed archive parameter from RevisionEntry
Applying: Refactored OriginEntry to include info about visit date and snapshot.
Applying: Requested fixes to CSVOriginIterator.
Changes applied before test
commit 07e59002089eb2226d18398a28253abddcad2905
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:16:32 2021 +0200

    Requested fixes to CSVOriginIterator.

commit 86fc1ca8e6dcadf76d6265237aa2220d6f283913
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot.
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator similar to that previously developed for revisions.
    Updated tests as well to ensure nothing was broken during the refactoring.

commit 9ac9a7e3ec0fdbb4086472f83ef209f2d10bca57
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/92/ for more details.

swh/provenance/model.py
24 ↗(On Diff #20828)

This iterator follows the same scheme as CSVRevisionIterator that you introduced.

I don't see the link with this OriginEntry object...

56 ↗(On Diff #20828)

it's mostly a matter of taste, but (probably only marginally here) also of performance: it's more efficient to update the set from a genexp than doing a set.add() in a for loop.

81 ↗(On Diff #20828)

it shows there is some typing annotation incorrect or missing, and this needs to be addressed. But I insists, if revision is now a generator, this code will not work, the if revision statement will always be true!

swh/provenance/model.py
24 ↗(On Diff #20828)

I don't see the link with this OriginEntry object...

I mean the fact that CSVOriginIterator is consistent with CSVRevisionIterator is a good thing, but this is not what we are discussing here, since this is the OriginEntry class we are talking about.

  • Added equality check functions to model classes.
  • Merge branch 'master' of ssh://forge.softwareheritage.org/diffusion/222/swh-provenance
  • Added test for isochrone graph topology.

Build is green

Patch application report for D5822 (id=20836)

Rebasing onto 826b3b1041...

First, rewinding head to replay your work on top of it...
Applying: Removed archive parameter from RevisionEntry
Applying: Refactored OriginEntry to include info about visit date and snapshot.
Applying: Requested fixes to CSVOriginIterator.
Applying: Added equality check functions to model classes.
Applying: Added test for isochrone graph topology.
Changes applied before test
commit 6438ae5f6ef4f8782d2ee88fb24fff347b1b269c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 17:09:43 2021 +0200

    Added test for isochrone graph topology.
    
    The expected isochrone graphs for each revision in the test should be
    provided as a dictionary in an associated yaml file.
    Currently onle heuristic lower with depth=1 is being tested.
    
    Also, model clases DirectoryEntry, FileEntry and IsochroneNode were
    modified so that they can be compared by equlity and hashed.

commit 6f3577135bd32667f96731704fb5e082bde93a3c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:45:25 2021 +0200

    Added equality check functions to model classes.

commit 51dbe1fed3470ccaa78fda9bc7a22b474b366548
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:16:32 2021 +0200

    Requested fixes to CSVOriginIterator.

commit 7ffdf03cedd3073e4cd7e40f7456fcafc3f2ac39
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot.
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator similar to that previously developed for revisions.
    Updated tests as well to ensure nothing was broken during the refactoring.

commit 58e4d02791f85516f0d6b1ef6d95dd5316a88d5b
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/94/ for more details.

please don't include the revisions about testing the isochrone graph (nor the merge, which should not be there at all, do rebase instead) in this diff, it is unrelated and should be in a dedicated diff.

please rework your git history, not including revisions unrelated with this diff, etc.

Also I have several comments/questions I believe are not addressed or answered.

This revision now requires changes to proceed.Jun 9 2021, 11:06 AM

Updated diff with proper git revisions.

Build is green

Patch application report for D5822 (id=20900)

Rebasing onto 6cdd424eba...

Current branch diff-target is up to date.
Changes applied before test
commit 3fe09b908ffb1f7a66e767a0deb82a4a39887b2a
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Mon Jun 7 11:16:32 2021 +0200

    Requested fixes to CSVOriginIterator.

commit 073917e87fbebadcd9a653fd86618f2df7e6b19c
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot.
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator similar to that previously developed for revisions.
    Updated tests as well to ensure nothing was broken during the refactoring.

commit 14ac342ee62c0411156d0e2328292b3b33fd8040
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/105/ for more details.

Merge requested fixes with actual revision

Build is green

Patch application report for D5822 (id=20901)

Rebasing onto 6cdd424eba...

Current branch diff-target is up to date.
Changes applied before test
commit 134ae3c784326111de4775a9f5da9ac7d9c34cf7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactored OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit 436730ceb8c384141686ee262dc1e87d7f0a5f48
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Removed archive parameter from RevisionEntry
    
    since is was no longer required.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/106/ for more details.

aeviso retitled this revision from Removed archive parameter from RevisionEntry since is was no longer required. to Origin-revision layer refactoring.Jun 10 2021, 12:01 PM
aeviso edited the summary of this revision. (Show Details)
aeviso retitled this revision from Origin-revision layer refactoring to Refactor origin-revision layer.Jun 10 2021, 12:04 PM

Build is green

Patch application report for D5822 (id=20903)

Rebasing onto 6cdd424eba...

Current branch diff-target is up to date.
Changes applied before test
commit 4ebab8d2ce933637c85bf456a796b6da8d12b513
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit 6ea9313800b86e996783f0bf5e37cc8c34f3627e
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/108/ for more details.

Build is green

Patch application report for D5822 (id=20938)

Rebasing onto 075b0d6cd6...

Current branch diff-target is up to date.
Changes applied before test
commit 5a9fb987c9aa169095185b1559a87bce536776b7
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/117/ for more details.

There are several comments that have not been addressed

This revision now requires changes to proceed.Jun 11 2021, 12:26 PM
swh/provenance/model.py
81 ↗(On Diff #20828)

maybe this has to do with the update of the other modules? Because I see no relation between what this diff is aiming to do and this typing issue. The point it that mypy will prevent me from committing anyway. Maybe this should be treated/fixed in a separate revision

swh/provenance/model.py
81 ↗(On Diff #20828)

maybe this has to do with the update of the other modules?

The archive object here is an object defined swh.provenance that should be specified by the ArchiveInterface, but this later does not specify the returned type of the methods. So no, it has nothing to do with other modules.

Now, since the type of object returned by ArchiveInterface.revision_get() is not specified, this piece of code cannot work, since the object revision (resulting from the line revision = archive.revision_get([self.id])) can be a generator. The fact you had to add a list() is a clue. In which case, the if revision will not work.

So either the list is unnecessarily, or the code is buggy.
Both situations demand for a proper fix, which is properly type annotate the ArchiveInterface (which should be a new diff prior to this diff) and make the code compliant with this ArchiveInterface API.

The point it that mypy will prevent me from committing anyway.

It will not prevent you from committing if there are no type annotations at all.

Maybe this should be treated/fixed in a separate revision

yes, but in a diff prior to this one. Once again, as is, the code is wrong. It could be made at least valid by converting revision to a list immediately (i.e. revision = list(archive.revision_get([self.id]))), but that's not a proper fix. (not even sure it's a fix at all, i.e. if revision_get() can return None).

Build is green

Patch application report for D5822 (id=20957)

Rebasing onto 075b0d6cd6...

Current branch diff-target is up to date.
Changes applied before test
commit 9aaaedb3ebc981555276e99616a0c4fc837b78e9
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 15:01:38 2021 +0200

    Refactor OriginEntry to include info about visit date and snapshot
    
    Revisions reachable from an OriginEntry are now queried separately and returned in an iterable.
    Also `origin_add` function was updated accordingly, and CLI command now uses a CSVOriginIterator
    similar to that previously developed for revisions. Updated tests as well to ensure nothing was
    broken during the refactoring.

commit fa4942ddff353c4d1d46c7f61ec570c9a28bc648
Author: Andres Ezequiel Viso <aeviso@softwareheritage.org>
Date:   Fri Jun 4 14:54:56 2021 +0200

    Remove archive parameter from RevisionEntry

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/122/ for more details.

There is still the if revision (RevisonEntry.parents) that needs a fix, otherwise ok.

This revision is now accepted and ready to land.Jun 14 2021, 10:41 AM