Page MenuHomeSoftware Heritage

Add a sync_mailmaps management command
ClosedPublic

Authored by olasd on Feb 4 2022, 3:22 PM.

Details

Summary

The command refreshes all the active mailmaps, and disables every
mailmap that has recently been updated and marked as disabled, once.

Test Plan

new tests added for all expected behaviors of the management command

Diff Detail

Repository
rDWAPPS Web applications
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26565
Build 41549: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 41548: arc lint + arc unit

Event Timeline

Pin the tail on the pytest

Build was aborted

Patch application report for D7090 (id=25712)

Rebasing onto 4ed4512e7b...

First, rewinding head to replay your work on top of it...
Applying: Pin pytest to < 7.0.0
Using index info to reconstruct a base tree...
M	requirements-test.txt
Falling back to patching base and 3-way merge...
No changes -- Patch already applied.
Applying: Add a sync_mailmaps management command
Changes applied before test
commit e7538629501dba9c4e6e0783a111ac0510027d36
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1332/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1332/console

Build was aborted

Patch application report for D7090 (id=25710)

Rebasing onto 3f465b3025...

First, rewinding head to replay your work on top of it...
Applying: Add a sync_mailmaps management command
Using index info to reconstruct a base tree...
M	requirements-test.txt
Falling back to patching base and 3-way merge...
Auto-merging requirements-test.txt
Changes applied before test
commit a629b99aa6d469ccc715f87f6f3877f389855415
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1331/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1331/console

Build is green

Patch application report for D7090 (id=25713)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit dd2934197e12cc2d6795fb6509c309d4da4fab98
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1333/ for more details.

olasd requested review of this revision.Feb 4 2022, 3:44 PM
vlorentz requested changes to this revision.EditedFeb 4 2022, 4:13 PM

Shouldn't storage_db.cursor() be used as a context manager to make sure transactions committed/rolledback appropriately? (forget this, I wrote it before noticing @transaction.atomic)

Why transaction.atomic() in tests? if I understand pytest-postgresql correctly, this is a noop

Why the quotes around "QuerySet[UserMailmap]" type annotations? Both names are in scope

swh/web/auth/management/commands/sync_mailmaps.py
40

What about --dry-run instead, as it is more idiomatic?

110

Looks racey: If someone uses the API to update one of the record between the moment we ran the SELECT and the moment we run this line, then this run uses the old data, and the next run of the script will ignore it.

it should the timestamp at the beginning of the function.

Plus, this uses the time of the host running the script, while updates from the API use the time of the DB host. I don't know if this is an issue or what to do about it.

swh/web/tests/auth/test_mailmap.py
85–99

why fixtures instead of constants?

170–174

this feels like taking DRY a little too far, and makes it harder to understand what the test actually tests. Why not this?

expected_displaynames = {
    "Original Name V A <from_emailVA@example.com>": "Display Name V A",
    "Original Name V A 2 <from_emailVA@example.com>": "Display Name V A",
    "Original Name V A 3 <from_emailVA@example.com>": "Display Name V A",
}

(same below)

This revision now requires changes to proceed.Feb 4 2022, 4:13 PM
olasd marked 4 inline comments as done.

Apply comments from @vlorentz

Shouldn't storage_db.cursor() be used as a context manager to make sure transactions committed/rolledback appropriately?

The management command runs within a *single* swh.storage db transaction (managed manually) and within a single swh.web db transaction (using django's transaction.atomic decorator). Using the cursor as a context manager would commit multiple transactions.

Why transaction.atomic() in tests? if I understand pytest-postgresql correctly, this is a noop

Yeah, that would have been me misunderstanding what transaction=False does in pytest.mark.django_db. I want to make sure that every manual operation on the django db is committed before I'm running the management command, so I think the transaction.atomic() are needed.

Why the quotes around "QuerySet[UserMailmap]" type annotations? Both names are in scope

Because the parametric type annotation isn't part of Django itself, but of types-django, so QuerySet[UserMailmap] is actually undefined even if both names are in scope. I concede I could use from __future__ import annotations, probably.

swh/web/auth/management/commands/sync_mailmaps.py
40

I want dry-run to be the default.

110

The whole management command is run within a single django database transaction (see @transaction.atomic decorator), so no, it shouldn't be racy.

swh/web/tests/auth/test_mailmap.py
85–99

meh.

170–174

The "same below" overflows our line length limit, but point taken :-)

Build is green

Patch application report for D7090 (id=25722)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit 9b3c1e97006641c1db96550600b2e904b4444d7d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1335/ for more details.

I just checked, and the update is racy. (on postgresql only; it's not on sqlite since sqlite does not allow this kind of concurrency, at least not by default)

Here is how to replicate:

First, change the code like so:

diff --git a/swh/web/auth/management/commands/sync_mailmaps.py b/swh/web/auth/management/commands/sync_mailmaps.py
index fdfe9914..4a4a4933 100644
--- a/swh/web/auth/management/commands/sync_mailmaps.py
+++ b/swh/web/auth/management/commands/sync_mailmaps.py
@@ -98,16 +98,26 @@ def handle(self, *args, **options):
         )
 
         with psycopg2.connect(options["storage_dbconn"]) as db:
-            self.disable_mailmaps(db, mailmaps_to_disable)
-            self.refresh_mailmaps(db, mailmaps_to_refresh)
+            from django.forms.models import model_to_dict
+            mailmaps_to_disable = list(mailmaps_to_disable)
+            mailmaps_to_refresh = list(mailmaps_to_refresh)
+            print("disabling", [model_to_dict(x) for x in mailmaps_to_disable])
+            print("refreshing", [model_to_dict(x) for x in mailmaps_to_refresh])
+            print("refreshing", [x.last_update_date for x in mailmaps_to_refresh])
+            #self.disable_mailmaps(db, mailmaps_to_disable)
+            #self.refresh_mailmaps(db, mailmaps_to_refresh)
             if not options["perform"]:
                 db.rollback()
             else:
                 db.commit()
 
+        print("waiting...")
+        input("done?")
+
         if options["perform"]:
             updated = verified_mailmaps.update(
                 mailmap_last_processing_date=timezone.now()
             )
         else:
             updated = verified_mailmaps.count()

Then, create a user, and prepare an update (in shell 1):

$ django-admin shell --settings=swh.web.settings.tests

In [1]: from swh.web.auth.models import UserMailmap

In [9]: for x in UserMailmap.objects.all():
   ...:      x.delete()

In [10]: o = UserMailmap.objects.create(from_email="origemail@example.org", from_email_verified=True, display_name="DisplayName 1", display_name_activated=True)

In [11]: o.save()

In [12]: o2 = UserMailmap.objects.get()

In [13]: o2.display_name_activated = False

run the script, don't reply to done? yet (in shell 2):

$ django-admin sync_mailmaps --settings=swh.web.settings.tests service=swh-replica --perform
0 mailmaps to disable, 1 mailmaps to refresh
disabling []
refreshing [{'id': 2, 'user_id': None, 'from_email': 'origemail@example.org', 'from_email_verified': True, 'from_email_verification_request_date': None, 'display_name': 'DisplayName 1', 'display_name_activated': True, 'to_email': None, 'to_email_verified': False, 'to_email_verification_request_date': None, 'mailmap_last_processing_date': datetime.datetime(2022, 2, 4, 17, 16, 13, 182014, tzinfo=<UTC>)}]
refreshing [datetime.datetime(2022, 2, 4, 17, 16, 0, 346677, tzinfo=<UTC>)]
waiting...
done?

then, update the user (in shell 1):

In [14]: o2.save()

answer done? in shell 2, and run it again:

y
Synced 1 mailmaps to swh.storage database

$ django-admin sync_mailmaps --settings=swh.web.settings.tests service=swh-replica --perform
0 mailmaps to disable, 0 mailmaps to refresh
disabling []
refreshing []
refreshing []
waiting...
done?

the display name should be removed, but it's not.

By the way, I just realized the time filter is only on the filter_to_disable query, not on filter_to_update. How come?

swh/web/auth/management/commands/sync_mailmaps.py
40

ok then, could you change the help to "Perform actions (instead of being a dry run)"? I had to check the code to understand what "perform actions" means

110

(continuing this in the main thread)

By the way, I just realized the time filter is only on the filter_to_disable query, not on filter_to_update. How come?

We always need to set the display name for new persons with known emails.

In D7090#184595, @olasd wrote:

We always need to set the display name for new persons with known emails.

Ah yes, of course

I just checked, and the update is racy. (on postgresql only; it's not on sqlite since sqlite does not allow this kind of concurrency, at least not by default)

Not exactly.

Thanks for the replication steps, they've been useful.

In [14]: o2.save()

This resets all fields in the UserMailmap object, including the ones you haven't touched. In particular, this will reset mailmap_last_processing_date to the value received when the object was get()-ten.

I guess the simplest workaround is to tell django not to reset mailmap_last_processing_date on UserMailmap.save.

Harden against the race condition @vlorentz noticed

hello? is it me you're looking for?

Build was aborted

Patch application report for D7090 (id=25731)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit 467905ae2d05133bf11f785806263a3c61f3426b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.
    
    To do these actions as atomically as possible, we make sure to select
    all rows for update, and to defer all _date fields in the default
    queryset manager for UserMailmap objects to avoid resetting them to
    earlier values by mistake.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1336/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1336/console

Build has FAILED

Patch application report for D7090 (id=25732)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit 9b7b1d1d39260245af5ed98fe06306e1d45cefa6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.
    
    To do these actions as atomically as possible, we make sure to select
    all rows for update, and to defer all _date fields in the default
    queryset manager for UserMailmap objects to avoid resetting them to
    earlier values by mistake.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1337/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1337/console

Deferred auto_now fields don't get updated, apparently.

Appease the sphinx deities

Build was aborted

Patch application report for D7090 (id=25734)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit 9ec9e9777065eedc653118daff0267fdd85bedd6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.
    
    To do these actions as atomically as possible, we make sure to select
    all rows for update, and to defer all _date fields in the default
    queryset manager for UserMailmap objects to avoid resetting them to
    earlier values by mistake.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1338/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1338/console

Build is green

Patch application report for D7090 (id=25735)

Rebasing onto 4ed4512e7b...

Current branch diff-target is up to date.
Changes applied before test
commit 39a32f958e1cdfba90aa6ed6b918d30ab3c6d1cd
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Feb 4 15:19:27 2022 +0100

    Add a sync_mailmaps management command
    
    The command refreshes all the active mailmaps, and disables every
    mailmap that has recently been updated and marked as disabled, once.
    
    To do these actions as atomically as possible, we make sure to select
    all rows for update, and to defer all _date fields in the default
    queryset manager for UserMailmap objects to avoid resetting them to
    earlier values by mistake.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1339/ for more details.

This revision is now accepted and ready to land.Feb 4 2022, 8:57 PM
This revision was automatically updated to reflect the committed changes.