Details
test_person_get is now executed for the in-memory storage tests.
Diff Detail
- Repository
- rDSTO Storage manager
- Branch
- memory-storage-person_get
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 3284 Build 4233: tox-on-jenkins Jenkins Build 4232: arc lint + arc unit
Event Timeline
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/145/ for more details.
The unicity checks should really happen on the fullname, as the email and name fields are only advisory (i.e. they only exist if we succeeded in parsing them).
What should work is having both a dict mapping fullnames to full person objects (with added id), and a list of full person objects that will allow id -> person lookups. This will make the deduplication a constant time hash lookup, and object lookups a constant time list access.
And we need tests to check this behavior :)
swh/storage/tests/test_api_client.py | ||
---|---|---|
65–67 | You could use revision_add to add persons. |
*nod*, I guess the ship has sailed and the person model is part of our external interface now
swh/storage/tests/test_api_client.py | ||
---|---|---|
65–67 | Or we could just expose person_add in the remote API, as apparently the person stuff is part of our external interface now. |
Ack for the above comments regarding unicity check and optimizations.
Regarding the test_person_get test, I will rather rewrite it adding revisions to storage first, as suggested by vlorentz, before trying to call storage.person_get.
Based on what I read and tested in main storage implementation, the storage._person_add method does not check
if the fullname of a person is already present in the person table, resulting in an exception being raised when
such a case appears (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/db.py$884).
In the contrary, when a revision is added, a stored procedure adds author and committer to the person table
and checks if they already exist based on their full names (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-swh-func.sql$616).
As the storage._person_add is only used in test implementation and is not secure, I think it should not be exposed
to the main storage api and it could even be removed as persons are only added when revisions and releases are
ingested into the archive. What do you think ?
Update:
- Address olasd comments regarding fullname unicity and optimizations
- Rewrite test_person_get test to add persons from revisions ingestion instead of calling private _person_add method
- Add a new test to check person fullname unicity
- Reactivate person tests for RemoteStorage
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/146/ for more details.
I have a few stylistic comments that may be addressed before this lands. There's also a question about self._objects which confuses me a bit.
With this change I believe _person_add can be removed from the sql storage
swh/storage/in_memory.py | ||
---|---|---|
1034–1035 | could you swap the if and yield? it'd avoid needing the backslash | |
1262 | I'm not sure what the semantics of self._objects is, but using it for this check doesn't feel right: what if someone's full name ends up being a valid object identifier for another type? | |
swh/storage/tests/test_storage.py | ||
1252–1253 | I'd linebreak after list( to avoid needing backslashes | |
1261–1265 | ditto |
swh/storage/in_memory.py | ||
---|---|---|
1262 | Based on my understanding of vlorentz implementation, the _objects dict enable to easily lookup objects by sha1s Other keys in that dict corresponds to sha1s or (origin_type, origin_url) tuples so it's safe to use fullname as key here. |
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/147/ for more details.
swh/storage/in_memory.py | ||
---|---|---|
1262 | I mean, technically, a full name could match a sha1 (yeah, I know it's more than a bit far-fetched). I don't think it matters much, it's just a bit surprising. |
swh/storage/in_memory.py | ||
---|---|---|
1262 | Ok, will use a tuple key instead and land that diff |
Update: use a tuple key instead of a string one for storing person data in the _objects dict
Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/148/ for more details.