Page MenuHomeSoftware Heritage

storage.in_memory: Add person_get implementation
ClosedPublic

Authored by anlambert on Jan 8 2019, 2:59 PM.

Details

Summary

Implement person_get for the in-memory storage.
I have also updated the person counter in the stat_counters method.

This is needed to remove all the storage mocks in swh-web tests.

Related T1271
Related T1377

Test Plan

test_person_get is now executed for the in-memory storage tests.

Diff Detail

Repository
rDSTO Storage manager
Branch
memory-storage-person_get
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 3308
Build 4264: tox-on-jenkinsJenkins
Build 4263: arc lint + arc unit

Event Timeline

olasd requested changes to this revision.Jan 8 2019, 3:11 PM
olasd added a subscriber: olasd.

The unicity checks should really happen on the fullname, as the email and name fields are only advisory (i.e. they only exist if we succeeded in parsing them).

This revision now requires changes to proceed.Jan 8 2019, 3:11 PM
In D897#19157, @olasd wrote:

The unicity checks should really happen on the fullname, as the email and name fields are only advisory (i.e. they only exist if we succeeded in parsing them).

What should work is having both a dict mapping fullnames to full person objects (with added id), and a list of full person objects that will allow id -> person lookups. This will make the deduplication a constant time hash lookup, and object lookups a constant time list access.

In D897#19157, @olasd wrote:

The unicity checks should really happen on the fullname, as the email and name fields are only advisory (i.e. they only exist if we succeeded in parsing them).

And we need tests to check this behavior :)

swh/storage/tests/test_api_client.py
65–67 ↗(On Diff #2833)

You could use revision_add to add persons.

In D897#19157, @olasd wrote:

The unicity checks should really happen on the fullname, as the email and name fields are only advisory (i.e. they only exist if we succeeded in parsing them).

And we need tests to check this behavior :)

*nod*, I guess the ship has sailed and the person model is part of our external interface now

swh/storage/tests/test_api_client.py
65–67 ↗(On Diff #2833)

Or we could just expose person_add in the remote API, as apparently the person stuff is part of our external interface now.

Ack for the above comments regarding unicity check and optimizations.

Regarding the test_person_get test, I will rather rewrite it adding revisions to storage first, as suggested by vlorentz, before trying to call storage.person_get.

Based on what I read and tested in main storage implementation, the storage._person_add method does not check
if the fullname of a person is already present in the person table, resulting in an exception being raised when
such a case appears (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/db.py$884).
In the contrary, when a revision is added, a stored procedure adds author and committer to the person table
and checks if they already exist based on their full names (https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-swh-func.sql$616).

As the storage._person_add is only used in test implementation and is not secure, I think it should not be exposed
to the main storage api and it could even be removed as persons are only added when revisions and releases are
ingested into the archive. What do you think ?

Update:

  • Address olasd comments regarding fullname unicity and optimizations
  • Rewrite test_person_get test to add persons from revisions ingestion instead of calling private _person_add method
  • Add a new test to check person fullname unicity
  • Reactivate person tests for RemoteStorage

I have a few stylistic comments that may be addressed before this lands. There's also a question about self._objects which confuses me a bit.

  • Rewrite test_person_get test to add persons from revisions ingestion instead of calling private _person_add method

With this change I believe _person_add can be removed from the sql storage

swh/storage/in_memory.py
1034–1035

could you swap the if and yield? it'd avoid needing the backslash

1238

I'm not sure what the semantics of self._objects is, but using it for this check doesn't feel right: what if someone's full name ends up being a valid object identifier for another type?

swh/storage/tests/test_storage.py
1252–1253

I'd linebreak after list( to avoid needing backslashes

1261–1265

ditto

anlambert added inline comments.
swh/storage/in_memory.py
1238

Based on my understanding of vlorentz implementation, the _objects dict enable to easily lookup objects by sha1s
and count object by type for the stat counters.

Other keys in that dict corresponds to sha1s or (origin_type, origin_url) tuples so it's safe to use fullname as key here.
I did not want to add another dict as class member as I knew I could use the _objects one.

olasd added inline comments.
swh/storage/in_memory.py
1238

I mean, technically, a full name could match a sha1 (yeah, I know it's more than a bit far-fetched). I don't think it matters much, it's just a bit surprising.

This revision is now accepted and ready to land.Jan 9 2019, 2:28 PM
anlambert added inline comments.
swh/storage/in_memory.py
1238

Ok, will use a tuple key instead and land that diff

Update: use a tuple key instead of a string one for storing person data in the _objects dict

This revision was automatically updated to reflect the committed changes.