test_person_get is now executed for the in-memory storage tests.
What should work is having both a dict mapping fullnames to full person objects (with added id), and a list of full person objects that will allow id -> person lookups. This will make the deduplication a constant time hash lookup, and object lookups a constant time list access.
Ack for the above comments regarding unicity check and optimizations.
Regarding the test_person_get test, I will rather rewrite it adding revisions to storage first, as suggested by vlorentz, before trying to call storage.person_get.
Based on what I read and tested in main storage implementation, the storage._person_add method does not check
if the fullname of a person is already present in the person table, resulting in an exception being raised when
such a case appears (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/db.py$884).
In the contrary, when a revision is added, a stored procedure adds author and committer to the person table
and checks if they already exist based on their full names (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-swh-func.sql$616).
As the storage._person_add is only used in test implementation and is not secure, I think it should not be exposed
to the main storage api and it could even be removed as persons are only added when revisions and releases are
ingested into the archive. What do you think ?
- Address olasd comments regarding fullname unicity and optimizations
- Rewrite test_person_get test to add persons from revisions ingestion instead of calling private _person_add method
- Add a new test to check person fullname unicity
- Reactivate person tests for RemoteStorage
I have a few stylistic comments that may be addressed before this lands. There's also a question about self._objects which confuses me a bit.
With this change I believe _person_add can be removed from the sql storage
could you swap the if and yield? it'd avoid needing the backslash
I'm not sure what the semantics of self._objects is, but using it for this check doesn't feel right: what if someone's full name ends up being a valid object identifier for another type?
I'd linebreak after list( to avoid needing backslashes
Based on my understanding of vlorentz implementation, the _objects dict enable to easily lookup objects by sha1s
Other keys in that dict corresponds to sha1s or (origin_type, origin_url) tuples so it's safe to use fullname as key here.