Page MenuHomeSoftware Heritage

Extract the extra_headers from metadata on the Revision model class
ClosedPublic

Authored by douardda on Jul 1 2020, 3:18 PM.

Details

Summary

Add a new extra_headers attribute on Revision and use it for computing
the revision's id instead of extract it from the metadata field.

Add a post init hook to Revision to initialize this new attribute from
given metadata, if any, for bw compat.

Related to T2423.

Diff Detail

Event Timeline

Build is green

Patch application report for D3389 (id=12022)

Rebasing onto e632abed41...

Current branch diff-target is up to date.
Changes applied before test
commit f480e1f8635df16aa52a50f2c0403f2000261215
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Related to T2423.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/84/ for more details.

I'm not sure about popping the extra_headers off of the incoming metadata field right now. This feels like something we want to do long term, but if you do that right now, it means you're putting upgrades of swh.model and swh.storage (with support with the new field) and all loaders in lockstep of one another. If you upgrade swh.model now, you'll lose the extra_headers until swh.storage will be able to store them.

For now I'd be more comfortable just populating the new field, without touching the old one.

I guess the next steps would be:

  • updating the storage backends to be able to store and retrieve data in the new field.
  • deploying that.
  • updating loaders to use the new field for writes and reads (svn loader).
  • then you can probably pop the extra_headers off of the metadata dict.

Finally, we can

  • migrate the remaining data in the postgres storage
  • stop mangling the metadata dict in swh.model

Apart from that, I'll echo the comment I made on IRC about making the "extra header value" type stricter: we should stick to bytes at some point; but that can only happen once all the legacy data is migrated.

improve bw-compat support, tests and hypothesis strategies

Build has FAILED

Patch application report for D3389 (id=12025)

Rebasing onto 40a40f508b...

First, rewinding head to replay your work on top of it...
Applying: Extract the extra_headers from metadata on the Revision model class
Changes applied before test
commit 5673b8383b7f432fbd9121f5814e2567d87e970b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/85/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/85/console

restrict extra_headers to (bytes, bytes) only

we actually do not support anything else, really (for values at least).

Build has FAILED

Patch application report for D3389 (id=12026)

Rebasing onto 40a40f508b...

First, rewinding head to replay your work on top of it...
Applying: Extract the extra_headers from metadata on the Revision model class
Changes applied before test
commit 6090209793a330063ece83b8423cb983bd48cbb5
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Only accept (bytes, bytes) as extra_header.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/86/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/86/console

make mypy happy (hopefully)

Build has FAILED

Patch application report for D3389 (id=12027)

Rebasing onto 40a40f508b...

First, rewinding head to replay your work on top of it...
Applying: Extract the extra_headers from metadata on the Revision model class
Changes applied before test
commit 210f4974b8bef8a2d53e7c1b3f434c28dc46aaf2
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Only accept (bytes, bytes) as extra_header.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/87/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/87/console

more mypy vs. attrs-strict fighting

I give up

Build is green

Patch application report for D3389 (id=12028)

Rebasing onto 40a40f508b...

First, rewinding head to replay your work on top of it...
Applying: Extract the extra_headers from metadata on the Revision model class
Changes applied before test
commit e0457febc0dbe8695a3ac87dade9e60e03ac69bf
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Only accept (bytes, bytes) as extra_header.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/88/ for more details.

Build is green

Patch application report for D3389 (id=12083)

Rebasing onto 8863b5c186...

Current branch diff-target is up to date.
Changes applied before test
commit e7fe9c6b92c08933a69f1587e4f7564a0a491660
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Only accept (bytes, bytes) as extra_header.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/97/ for more details.

Accepted, with the following caveat: I'm still not sure what the plan is to deploy this now? In the current state of this diff, as soon as we deploy this on workers, data loss will occur (as the new extra_headers field isn't supported by swh.storage).

Now that I think of it, introducing a new field in swh.model needs a coordinated deployment anyway so maybe it doesn't matter (as long as the swh.storage update to support the new field occurs before the next deployment).

swh/model/hypothesis_strategies.py
237

I'm not sure we should allow empty keys. Of course, that will make validation more annoying in the model.

swh/model/tests/test_model.py
444–455

Shouldn't these two tests be split out to separate unit tests?

This revision is now accepted and ready to land.Jul 3 2020, 2:56 PM
In D3389#83915, @olasd wrote:

Accepted, with the following caveat: I'm still not sure what the plan is to deploy this now? In the current state of this diff, as soon as we deploy this on workers, data loss will occur (as the new extra_headers field isn't supported by swh.storage).

Now that I think of it, introducing a new field in swh.model needs a coordinated deployment anyway so maybe it doesn't matter (as long as the swh.storage update to support the new field occurs before the next deployment).

Yes this needs a counterpart in storage (wip), and needs a coordinated deployment.

swh/model/hypothesis_strategies.py
237

I wasn't sure about this. But each time (almost) we forbid a "degenerated" case, we can find a degenerated repo that somewhat managed to have it...

swh/model/tests/test_model.py
444–455

could do that, sure

In D3389#83915, @olasd wrote:

Accepted, with the following caveat: I'm still not sure what the plan is to deploy this now? In the current state of this diff, as soon as we deploy this on workers, data loss will occur (as the new extra_headers field isn't supported by swh.storage).

Now that I think of it, introducing a new field in swh.model needs a coordinated deployment anyway so maybe it doesn't matter (as long as the swh.storage update to support the new field occurs before the next deployment).

Yes this needs a counterpart in storage (wip), and needs a coordinated deployment.

D3426

split the 2 test functions in parts, and fix the bwcompat hook

passing extra_headers via metadata bypassed type validation.

Build is green

Patch application report for D3389 (id=12133)

Rebasing onto 1ff05161e7...

Current branch diff-target is up to date.
Changes applied before test
commit a7d9aca2b0df15c64d4e42e263422df9908ae42b
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed Jul 1 15:13:59 2020 +0200

    Extract the extra_headers from metadata on the Revision model class
    
    Add a new extra_headers attribute on Revision and use it for computing
    the revision's id instead of extract it from the metadata field.
    
    Only accept (bytes, bytes) as extra_header.
    
    Add a post init hook to Revision to initialize this new attribute from
    given metadata, if any, for bw compat.
    
    Also amend the revision_d hyptothesis strategy to generate extra_headers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/105/ for more details.