Page MenuHomeSoftware Heritage

storage*: type origin_visit_get_latest endpoint result
ClosedPublic

Authored by ardumont on Jul 27 2020, 8:16 AM.

Details

Summary

The endpoint returns an optional OriginVisit object instead of a dict:

def origin_visit_get_latest(...) -> Optional[OriginVisit]

It also fixes the in-memory storage implementation which filtered data too
early. It only filtered on the latest origin visit status associated to the
origin visit. So depending on filters, this could have been wrong. It was not
much of a problem as there is no longer any direct clients of this api (they
are using [1] now).

This will allow in a subsequent diff to simplify the function [1] (and reintroduce
the use of this api, albeit indirectly).

[1] swh.storage.algos.origin.origin_get_latest_origin_visit_status function.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
refactor-algos-origin
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 13971
Build 21441: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21440: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D3620 (id=12730)

Rebasing onto 789972fd5f...

First, rewinding head to replay your work on top of it...
Applying: storage*: type origin_visit_get_latest endpoint result
Changes applied before test
commit 68a7ee55c75a61fea8bc1a8c93b21b3a6f18f61c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Jul 26 08:05:10 2020 +0200

    storage*: type origin_visit_get_latest endpoint result
    
    The endpoint returns an optional OriginVisit object instead of a dict:
def origin_visit_get_latest(...) -> Optional[OriginVisit]
```

Note that it also fixes the in-memory storage implementation which filtered
data too early.

Related to T645
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/560/ for more details.
vlorentz added a subscriber: vlorentz.

It also fixes the in-memory storage implementation which filtered data too early.

Could you add a test for that?

swh/storage/in_memory.py
931

visit_statuses

swh/storage/tests/test_storage.py
1650

why wasn't the rounding needed before?

1998–2010

You could switch the equality operands to make it more readable imo (same comment for other instances below)

This revision now requires changes to proceed.Jul 27 2020, 9:42 AM

Could you add a test for that?

I don't see yet how ;)

swh/storage/in_memory.py
931

It was too long and i was annoyed by it...

swh/storage/tests/test_storage.py
1650

No idea.

1998–2010

I did it that way because it matches most of what we do elsewhere (in other modules or even here i mean).

actual == expected.

Which i now found clearer (force of habits i guess).

Could you add a test for that?

I don't see yet how ;)

like this?

visit = origin_visit_add(origin)
st1 = OriginVisitStatus(origin, visit, date1, status="partial")
st2 = OriginVisitStatus(origin, visit, date2, status="full")
origin_visit_status_add(st1)
origin_visit_status_add(st2)

assert st2 == origin_visit_get_latest(allowed_statuses=["partial"])
swh/storage/in_memory.py
931

I get that, but a abbreviations harm readability

swh/storage/tests/test_storage.py
1998–2010

Then could you add temporary variables (eg. with names containing actual and expected) and compare them?

swh/storage/tests/test_storage.py
1650

To be clear, that's been bothering me for a while...

I have no idea why it's kinda not needed anywhere near we deal with date...
Pondering whether we should not make the swh.storage.utils.now do that rounding itself...

Also, I don't really know what that rounding in itself entails for the day we really switch to cass instead of pg... (in terms of data).
Because as far as i understood this, when we'll read from cass, we'll lose that rounded data right?
(That's my current thinking...)

swh/storage/tests/test_storage.py
1650

I have no idea why it's kinda not needed anywhere near we deal with date...

It's only needed when doing equality comparisons.

Because as far as i understood this, when we'll read from cass, we'll lose that rounded data right?

Yes, but it is not useful data.

swh/storage/tests/test_storage.py
1998–2010

That's indeed specifically how it's usually done.

I thought the indirection was unneeded here as we do it a lot...
And the original code did not bother when it came to compare with None values.
So I thought, align everything on that pattern...
The test is also too big to give too much lines...

Note:
The actual value is the one we read.
The expected one is the one we build to check that actual one.
So in that front, i though read and check in one line was enough...

swh/storage/tests/test_storage.py
1998–2010

Usually it is, but Black abuses parentheses here. Compare:

assert (
    swh_storage.origin_visit_status_get_latest(
        origin.url, ov1.visit, require_snapshot=True
    )
    == visit_status_with_snapshot
)

with:

actual_visit_status = swh_storage.origin_visit_status_get_latest(
    origin.url, ov1.visit, require_snapshot=True
)
assert actual_visit_status == visit_status_with_snapshot
swh/storage/in_memory.py
931

oh don't worry, i'm gonna change it.

You fine about the implementation otherwise?
I tried to keep as much as possible the existing implementation pattern.

I initially sorted the visit_statuses as well (order by date desc) and kept the list comprehension filterings as here.
And then just returned the first visit_status of the list (since it should then be the max).

I changed it because the sorted stanza is a bit hard to read (due to the annoying copy...).

swh/storage/tests/test_storage.py
1650

Yes, but it is not useful data.

How so?
Isn't say, revision for example, concerned by this (fields committer_date, date)?

If people wants to retrieve revision data and compute the hashes themselves, won't that impact them?

1998–2010

you convinced me, thanks ;)

swh/storage/in_memory.py
931

yes, it's fine.

931

Actually, why do we need deepcopies? Aren't they immutable objects now?

swh/storage/tests/test_storage.py
1650

These are stored in a TimestampWithTimezone, which preserves exact dates (by storing them as integers in the DB), including microseconds and negative UTC.

For visits we don't need exact dates, so we use datetime.datetime (and store them as the native timestamp type of the DB)

swh/storage/in_memory.py
931

afaiui, in-memory's storage implementation regarding its data are lists, aren't they?

swh/storage/in_memory.py
931
self._origin_visits = {}
self._origin_visit_statuses: Dict[Tuple[str, int], List[OriginVisitStatus]] = {}
swh/storage/tests/test_storage.py
1650

oh right, thank you ;)

swh/storage/in_memory.py
931

but yes for origin_visit_statuses here, it's unneeded (left-over from the other implementation i told you about ;)

swh/storage/in_memory.py
931

It's also not needed for visits, because we don't modify the list nor return it

ardumont added inline comments.
swh/storage/in_memory.py
931

we sort it, don't that impact it?

swh/storage/in_memory.py
931

You're confusing sorted() with .sort(). The former makes a shallow copy, the latter sorts in-place.

swh/storage/in_memory.py
931

indeed, thanks!

Adapt according to discussions:

  • rename variable visit_sts to visit_statuses
  • Remove unneeded copy.deepcopy instructions
  • Use temporary variables to retrieve visit and visit_status to compare with expected data in tests
  • Add another test to add coverage

Build is green

Patch application report for D3620 (id=12737)

Rebasing onto 789972fd5f...

First, rewinding head to replay your work on top of it...
Applying: storage*: type origin_visit_get_latest endpoint result
Changes applied before test
commit e1caad7be2d04407f5b366afc0ff771872e0a3a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Jul 26 08:05:10 2020 +0200

    storage*: type origin_visit_get_latest endpoint result
    
    The endpoint returns an optional OriginVisit object instead of a dict:
def origin_visit_get_latest(...) -> Optional[OriginVisit]
```

Note that it also fixes the in-memory storage implementation which filtered
data too early.

Related to T645
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/562/ for more details.
  • Unify format on some missing part (origin-visit-status-add test)
  • rework commit message to unify with diff description
This revision is now accepted and ready to land.Jul 27 2020, 11:43 AM
swh/storage/tests/test_storage.py
1650

Oh actually, the answer to my original question: it wasn't needed before because we only compared some fields, so date truncation didn't matter.

Build is green

Patch application report for D3620 (id=12738)

Rebasing onto 789972fd5f...

First, rewinding head to replay your work on top of it...
Applying: storage*: type origin_visit_get_latest endpoint result
Changes applied before test
commit 1f1d23f97364f850e1056f25f389b1f9fbc15e52
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Jul 26 08:05:10 2020 +0200

    storage*: type origin_visit_get_latest endpoint result
    
    The endpoint returns an optional OriginVisit object instead of a dict:
def origin_visit_get_latest(...) -> Optional[OriginVisit]
```

It also fixes the in-memory storage implementation which filtered data too
early. It only filtered on the latest origin visit status associated to the
origin visit. So depending on filters, this could have been wrong. It was not
much of a problem as there is no longer any direct clients of this api (they
are using [1] now).

[1] swh.storage.algos.origin.origin_get_latest_origin_visit_status function

Related to T645
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/563/ for more details.
swh/storage/tests/test_storage.py
1650

oh, yeah, cool then.

Build is green

Patch application report for D3620 (id=12741)

Rebasing onto 789972fd5f...

Current branch diff-target is up to date.
Changes applied before test
commit 5344a6ff9c40290e670de9fd0d71c0b84185e4fa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Sun Jul 26 08:05:10 2020 +0200

    storage*: type origin_visit_get_latest endpoint result
    
    The endpoint returns an optional OriginVisit object instead of a dict:
def origin_visit_get_latest(...) -> Optional[OriginVisit]
```

It also fixes the in-memory storage implementation which filtered data too
early. It only filtered on the latest origin visit status associated to the
origin visit. So depending on filters, this could have been wrong. It was not
much of a problem as there is no longer any direct clients of this api (they
are using [1] now).

[1] swh.storage.algos.origin.origin_get_latest_origin_visit_status function

Related to T645
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/565/ for more details.