Page MenuHomeSoftware Heritage

origin-search: Only request 'url' field
ClosedPublic

Authored by vlorentz on Nov 15 2022, 10:04 AM.

Details

Summary

By default, an extra query is sent to swh-indexer to (maybe) populate
the 'metadata' field, which is not used by the client, so it unnecessarily
increases latency to get results

Depends on D8843

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8844 (id=31866)

Could not rebase; Attempt merge onto ad8558c69d...

Updating ad8558c6..a1c58db2
Fast-forward
 swh/web/api/tests/views/test_origin.py        | 70 ++++++++++++++++++++++++++-
 swh/web/api/views/origin.py                   |  9 +++-
 swh/web/browse/assets/browse/origin-search.js |  3 ++
 swh/web/utils/archive.py                      | 41 +++++++++++++---
 4 files changed, 113 insertions(+), 10 deletions(-)
Changes applied before test
commit a1c58db2d5983abe4abb2ee4ea427aa04c687d6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 10:03:38 2022 +0100

    origin-search: Only request 'url' field
    
    By default, an extra query is sent to swh-indexer to (maybe) populate
    the 'metadata' field, which is not used by the client, so it unnecessarily
    increases latency to get results

commit f59acd6185603e66c3418e4ffeac2106b2159300
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:49:59 2022 +0100

    metadata-search: Skip query to swh-indexer when its results would be discarded
    
    The 'fields' query parameter is used by clients to indicate what fields the
    API should return.
    
    If 'metadata' is not in that field, then the 'metadata' object will be
    discarded by apiresponse, so the call to
    `idx_storage.origin_intrinsic_metadata_get` is useless.
    
    I expect no client actually uses this field, so this could save
    ressources.
    
    Additionally, I want to deprecate the field, so this may make it easier
    to figure if any client actually requests it by looking at server logs.

commit 76c64ea4dc9e158bdd5c2730340e04b213999609
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:31:11 2022 +0100

    metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata
    
    This is needed, because swh-search may now return results based on extrinsic metadata,
    in addition to intrinsic metadata.
    
    I do not want to query idx_storage.origin_extrinsic_metadata here, because it
    is not clear how to merge with the existing data structure.
    
    Additionally, I do not think anyone relies on the metadata returned by this
    endpoint because it is undocumented and rather inflexible. Instead, I would
    like to deprecate returning metadata from this endpoint altogether, as there
    is a more appropriate endpoint to get metadata once you have the origin URL.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/2134/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/2134/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 15 2022, 10:12 AM
Harbormaster failed remote builds in B32801: Diff 31866!

Build has FAILED

Patch application report for D8844 (id=31867)

Could not rebase; Attempt merge onto ad8558c69d...

Updating ad8558c6..8a8057eb
Fast-forward
 cypress/e2e/origin-search.cy.js               |  8 +--
 swh/web/api/tests/views/test_origin.py        | 70 ++++++++++++++++++++++++++-
 swh/web/api/views/origin.py                   |  9 +++-
 swh/web/browse/assets/browse/origin-search.js |  3 ++
 swh/web/utils/archive.py                      | 41 +++++++++++++---
 5 files changed, 118 insertions(+), 13 deletions(-)
Changes applied before test
commit 8a8057eb8ee5cdd314d739ab2dffff7da848f96c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 10:03:38 2022 +0100

    origin-search: Only request 'url' field
    
    By default, an extra query is sent to swh-indexer to (maybe) populate
    the 'metadata' field, which is not used by the client, so it unnecessarily
    increases latency to get results

commit f59acd6185603e66c3418e4ffeac2106b2159300
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:49:59 2022 +0100

    metadata-search: Skip query to swh-indexer when its results would be discarded
    
    The 'fields' query parameter is used by clients to indicate what fields the
    API should return.
    
    If 'metadata' is not in that field, then the 'metadata' object will be
    discarded by apiresponse, so the call to
    `idx_storage.origin_intrinsic_metadata_get` is useless.
    
    I expect no client actually uses this field, so this could save
    ressources.
    
    Additionally, I want to deprecate the field, so this may make it easier
    to figure if any client actually requests it by looking at server logs.

commit 76c64ea4dc9e158bdd5c2730340e04b213999609
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:31:11 2022 +0100

    metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata
    
    This is needed, because swh-search may now return results based on extrinsic metadata,
    in addition to intrinsic metadata.
    
    I do not want to query idx_storage.origin_extrinsic_metadata here, because it
    is not clear how to merge with the existing data structure.
    
    Additionally, I do not think anyone relies on the metadata returned by this
    endpoint because it is undocumented and rather inflexible. Instead, I would
    like to deprecate returning metadata from this endpoint altogether, as there
    is a more appropriate endpoint to get metadata once you have the origin URL.

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/2135/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/2135/console

Harbormaster returned this revision to the author for changes because remote builds failed.Nov 15 2022, 10:47 AM
Harbormaster failed remote builds in B32802: Diff 31867!

Build is green

Patch application report for D8844 (id=31868)

Could not rebase; Attempt merge onto ad8558c69d...

Updating ad8558c6..6531a365
Fast-forward
 cypress/e2e/origin-search.cy.js               |  8 +--
 swh/web/api/tests/views/test_origin.py        | 70 ++++++++++++++++++++++++++-
 swh/web/api/views/origin.py                   |  9 +++-
 swh/web/browse/assets/browse/origin-search.js |  3 ++
 swh/web/utils/archive.py                      | 41 +++++++++++++---
 5 files changed, 118 insertions(+), 13 deletions(-)
Changes applied before test
commit 6531a3653102f017d80af868dadf8d6ddaad630c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 10:03:38 2022 +0100

    origin-search: Only request 'url' field
    
    By default, an extra query is sent to swh-indexer to (maybe) populate
    the 'metadata' field, which is not used by the client, so it unnecessarily
    increases latency to get results

commit f59acd6185603e66c3418e4ffeac2106b2159300
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:49:59 2022 +0100

    metadata-search: Skip query to swh-indexer when its results would be discarded
    
    The 'fields' query parameter is used by clients to indicate what fields the
    API should return.
    
    If 'metadata' is not in that field, then the 'metadata' object will be
    discarded by apiresponse, so the call to
    `idx_storage.origin_intrinsic_metadata_get` is useless.
    
    I expect no client actually uses this field, so this could save
    ressources.
    
    Additionally, I want to deprecate the field, so this may make it easier
    to figure if any client actually requests it by looking at server logs.

commit 76c64ea4dc9e158bdd5c2730340e04b213999609
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Nov 15 09:31:11 2022 +0100

    metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata
    
    This is needed, because swh-search may now return results based on extrinsic metadata,
    in addition to intrinsic metadata.
    
    I do not want to query idx_storage.origin_extrinsic_metadata here, because it
    is not clear how to merge with the existing data structure.
    
    Additionally, I do not think anyone relies on the metadata returned by this
    endpoint because it is undocumented and rather inflexible. Instead, I would
    like to deprecate returning metadata from this endpoint altogether, as there
    is a more appropriate endpoint to get metadata once you have the origin URL.

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/2136/ for more details.

This revision is now accepted and ready to land.Nov 15 2022, 11:02 AM

not really a nice catch as it wasn't a very useful optimization before D8843, which I only noticed when the useless query caused issues ;)