Page MenuHomeSoftware Heritage

Setup storage and store last revision/release date
ClosedPublic

Authored by KShivendu on Jun 17 2021, 12:46 PM.

Details

Summary

Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

Diff Detail

Repository
rDSEA Archive search
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D5883 (id=21114)

Rebasing onto 35570b4284...

First, rewinding head to replay your work on top of it...
Applying: Store last_eventful_visit_date
Using index info to reconstruct a base tree...
M	swh/search/elasticsearch.py
M	swh/search/in_memory.py
M	swh/search/interface.py
M	swh/search/journal_client.py
M	swh/search/tests/test_journal_client.py
M	swh/search/tests/test_search.py
Falling back to patching base and 3-way merge...
Auto-merging swh/search/tests/test_search.py
Auto-merging swh/search/elasticsearch.py
No changes -- Patch already applied.
Applying: Add last_revision_daate and last_release_date
Changes applied before test
commit 2bfb347ec05e6a414b7a642a73f6f6eb196e5e2d
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Add last_revision_daate and last_release_date
    
    FRemove duplicate cls param for get_storage
    
    Fix failing tests
    
    Adjust tests related to previous commit
    
    Load swh.storage from config

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/152/ for more details.

Build is green

Patch application report for D5883 (id=21115)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit fe74c7e858310cba1232ca9885a38d3940a773e3
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Add last_revision_daate and last_release_date
    
    FRemove duplicate cls param for get_storage
    
    Fix failing tests
    
    Adjust tests related to previous commit
    
    Load swh.storage from config

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/153/ for more details.

anlambert added a subscriber: anlambert.
anlambert added inline comments.
swh/search/journal_client.py
96–134

storage.snapshot_get_branches does not return all branches, a limit of 1000 is set by default.
In order to iterate on all branches, use snapshot_get_all_branches instead.

Also you should merge the fetch_last_*_date method into a single one to avoid iterating on the whole set of branches twice.

This revision now requires changes to proceed.Jun 18 2021, 11:28 AM

(cosmetic) Also, careful on the commit message / diff description typo ;)

KShivendu retitled this revision from Add last_revision_daate and last_release_date to Add last_revision_date and last_release_date.Jun 18 2021, 1:02 PM
KShivendu edited the summary of this revision. (Show Details)
KShivendu retitled this revision from Add last_revision_date and last_release_date to Setup storage and store last revision/release date.
KShivendu edited the summary of this revision. (Show Details)
  • merge the fetch_last_*_date methods
  • Use snapshot_get_all_branches instead of snapshot_get_branches
  • Fix diff/commit description/message
  • Use in-memory backend in test_cli.py

Build is green

Patch application report for D5883 (id=21128)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/154/ for more details.

  • Include last revision/release date only when available

Build has FAILED

Patch application report for D5883 (id=21129)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 7d6c68e37f61409e2002829bdc873084194cc427
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 11:23:02 2021 +0000

    Include last revision/release date only when available

commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/155/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/155/console

swh/search/journal_client.py
127

This include last revision/release date only when snapshot_id is available. Is it okay to do this ?

swh/search/elasticsearch.py
43

Should last_release_date be added too here ?

308

no search on min release date ?

swh/search/in_memory.py
136

same here

swh/search/interface.py
64

same here

swh/search/journal_client.py
127

lgtm but you should test for None snapshot at the beginning of the function, this way you can dedent the code for non None snapshot.

  • Add test for fetch_last_revision_release_date in journal_client
  • Add missing arguments and tests related to last_release_date

Build is green

Patch application report for D5883 (id=21150)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 426a7d3d72bfd23e6fb547d40ddf43ded1507afe
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:06:34 2021 +0000

    Add test for journal_client and add tests for last_release_date

commit 7d6c68e37f61409e2002829bdc873084194cc427
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 11:23:02 2021 +0000

    Include last revision/release date only when available

commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/156/ for more details.

KShivendu marked an inline comment as done.
  • Throw error in absence of storage config
vlorentz added inline comments.
swh/search/journal_client.py
93

import at the toplevel

95–123

and this function definition too

swh/search/tests/test_journal_client.py
190–295

Is this copy-pasted from swh-model's tests? You can import data from there instead.

(And if not, move the definitions outside the function)

206–217

why is did the value of snapshot_id change? this doesn't seem related to this diff.

swh/search/tests/test_journal_client.py
206–217

You're right. It's not related to this diff. It was stored in D5878

  • Changes suggested by vlorentz

Build is green

Patch application report for D5883 (id=21153)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 2e6a34b8ed706dc9c5e03a5c062fdfd17e0739a6
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jun 21 06:11:02 2021 +0000

    Polish the code

commit e6fbdc54f916b6fd6c2bfed97b2acaeecc77fdb5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:20:26 2021 +0000

    Throw error in absence of storage config

commit 426a7d3d72bfd23e6fb547d40ddf43ded1507afe
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:06:34 2021 +0000

    Add test for journal_client and add tests for last_release_date

commit 7d6c68e37f61409e2002829bdc873084194cc427
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 11:23:02 2021 +0000

    Include last revision/release date only when available

commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/158/ for more details.

swh/search/tests/test_journal_client.py
23

Please don't mind this. I'll remove it before final squashing.

Almost there ! Some minor changes can still be added before landing this. Regarding my comment about the painless script formatting, it can be handled in another diff.

swh/search/elasticsearch.py
195–264

The code of that painless script could be improved a little for better readability.

Could you add a second commit (or create another diff once this one gets landed) formatting the code in a way like below:

update_script = dedent(
"""
// utility function to get and parse date
ZonedDateTime getDate(def ctx, String date_field) {
    String default_date = "0001-01-01T00:00:00Z";
    String date = ctx._source.getOrDefault(date_field, default_date);
    return ZonedDateTime.parse(date);
}

// backup current visit_types field value
List visit_types = ctx._source.getOrDefault("visit_types", []);
int nb_visits = ctx._source.getOrDefault("nb_visits", 0);

ZonedDateTime last_visit_date = getDate(ctx, "last_visit_date");

String snapshot_id = ctx._source.getOrDefault("snapshot_id", "");
ZonedDateTime last_eventful_visit_date =
    getDate(ctx, "last_eventful_visit_date");
ZonedDateTime last_revision_date = getDate(ctx, "last_revision_date");
ZonedDateTime last_release_date = getDate(ctx, "last_release_date");

// update origin document with new field values
ctx._source.putAll(params);

// restore previous visit types after visit_types field overriding
if (ctx._source.containsKey("visit_types")) {
    for (int i = 0; i < visit_types.length; ++i) {
        if (!ctx._source.visit_types.contains(visit_types[i])) {
            ctx._source.visit_types.add(visit_types[i]);
        }
    }
}

// Undo overwrite if incoming nb_visits is smaller
if (ctx._source.containsKey("nb_visits")) {
    int incoming_nb_visits = ctx._source.getOrDefault("nb_visits", 0);
    if(incoming_nb_visits < nb_visits){
        ctx._source.nb_visits = nb_visits;
    }
}

// Undo overwrite if incoming last_visit_date is older
if (ctx._source.containsKey("last_visit_date")) {
    ZonedDateTime incoming_last_visit_date = getDate(ctx, "last_visit_date");
    int difference =
        // returns -1, 0 or 1
        incoming_last_visit_date.compareTo(last_visit_date);
    if(difference < 0){
        ctx._source.last_visit_date = last_visit_date;
    }
}

// Undo update of last_eventful_date and snapshot_id if
// snapshot_id hasn't changed OR incoming_last_eventful_visit_date is older
if (ctx._source.containsKey("snapshot_id")) {
    String incoming_snapshot_id = ctx._source.getOrDefault("snapshot_id", "");
    ZonedDateTime incoming_last_eventful_visit_date =
        getDate(ctx, "last_eventful_visit_date");
    int difference =
        // returns -1, 0 or 1
        incoming_last_eventful_visit_date.compareTo(last_eventful_visit_date);
    if(snapshot_id == incoming_snapshot_id || difference < 0){
        ctx._source.snapshot_id = snapshot_id;
        ctx._source.last_eventful_visit_date = last_eventful_visit_date;
    }
}

// Undo overwrite if incoming last_revision_date is older
if (ctx._source.containsKey("last_revision_date")) {
    ZonedDateTime incoming_last_revision_date =
        getDate(ctx, "last_revision_date");
    int difference =
        // returns -1, 0 or 1
        incoming_last_revision_date.compareTo(last_revision_date);
    if(difference < 0){
        ctx._source.last_revision_date = last_revision_date;
    }
}

// Undo overwrite if incoming last_release_date is older
if (ctx._source.containsKey("last_release_date")) {
    ZonedDateTime incoming_last_release_date =
        getDate(ctx, "last_release_date");
    // returns -1, 0 or 1
    int difference = incoming_last_release_date.compareTo(last_release_date);
    if(difference < 0){
        ctx._source.last_release_date = last_release_date;
    }
}
"""
swh/search/tests/test_journal_client.py
55

you can remove the id parameter, it is automatically computed by the model.

71

same here and below for all models.

swh/search/tests/test_search.py
363–441

Both tests are quite similar, you should wrap the code in a private function (taking a parameter indicating if we are testing revision or release) that will be called by those.

This revision now requires changes to proceed.Jun 22 2021, 10:59 AM
  • Improve code quality (as suggested by @anlambert)

Build is green

Patch application report for D5883 (id=21213)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 790833f7b0348c0574d820e8e31d11587cc0295c
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 23 04:25:27 2021 +0000

    Improve code quality

commit 4abac24fd5e6742a5e1b2e0009f750d5eb4976a5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jun 21 16:51:59 2021 +0000

    Remove unnecessary comment

commit 2e6a34b8ed706dc9c5e03a5c062fdfd17e0739a6
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jun 21 06:11:02 2021 +0000

    Polish the code

commit e6fbdc54f916b6fd6c2bfed97b2acaeecc77fdb5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:20:26 2021 +0000

    Throw error in absence of storage config

commit 426a7d3d72bfd23e6fb547d40ddf43ded1507afe
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:06:34 2021 +0000

    Add test for journal_client and add tests for last_release_date

commit 7d6c68e37f61409e2002829bdc873084194cc427
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 11:23:02 2021 +0000

    Include last revision/release date only when available

commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/159/ for more details.

Build is green

Patch application report for D5883 (id=21214)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit b4c804a3d26f08f22f98461a84d7d9382a1cc3a1
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Wed Jun 23 04:25:27 2021 +0000

    Improve code quality

commit 4abac24fd5e6742a5e1b2e0009f750d5eb4976a5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jun 21 16:51:59 2021 +0000

    Remove unnecessary comment

commit 2e6a34b8ed706dc9c5e03a5c062fdfd17e0739a6
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Jun 21 06:11:02 2021 +0000

    Polish the code

commit e6fbdc54f916b6fd6c2bfed97b2acaeecc77fdb5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:20:26 2021 +0000

    Throw error in absence of storage config

commit 426a7d3d72bfd23e6fb547d40ddf43ded1507afe
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 15:06:34 2021 +0000

    Add test for journal_client and add tests for last_release_date

commit 7d6c68e37f61409e2002829bdc873084194cc427
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Fri Jun 18 11:23:02 2021 +0000

    Include last revision/release date only when available

commit 835edc7e5b1efd21d8486b228252a1d1056ebc92
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    Setup storage and store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters. These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage by passing the snapshot id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/160/ for more details.

Is it okay if I add .vscode in .gitignore. It often gets included by mistake. swh-indexer, swh-storage and swh-web already have it in their .gitignore files.

Is it okay if I add .vscode in .gitignore. It often gets included by mistake. swh-indexer, swh-storage and swh-web already have it in their .gitignore files.

I guess it's fine if it's already in other modules.
And we should also add it in the .gitignore template repository [1].
Can you also please also take care of updating it at the same time?

[1] https://forge.softwareheritage.org/source/swh-py-template/

Is it okay if I add .vscode in .gitignore. It often gets included by mistake. swh-indexer, swh-storage and swh-web already have it in their .gitignore files.

I guess it's fine if it's already in other modules.
And we should also add it in the .gitignore template repository [1].
Can you also please also take care of updating it at the same time?

[1] https://forge.softwareheritage.org/source/swh-py-template/

Indeed, this makes sense. Visual Studio Code is quite a popular text editor nowadays (I also use it).

@KShivendu, code looks good to me now. Before landing this, you should squash all commits into a single one and improve commit message style following that guide.

Also @vlorentz and @vsellier should validate the review too as they are your official GSOC mentors.

swh/search/tests/test_search.py
363

Nitpick: could you rename the type parameter to date_type ?

This revision now requires changes to proceed.Jun 23 2021, 11:06 AM
  • rename the type parameter to date_type
  • add .vscode to .gitignore

Build is green

Patch application report for D5883 (id=21226)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit b4675a3a1babbf257ded46d42c3d3757da3456ee
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    journal_client: Store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters.
    These changes will store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage using the snapshot_id. So these changes also configure swh.storage client in swh.search

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/161/ for more details.

Can you also please also take care of updating it at the same time?

Sure. I just created D5912

Also @vlorentz and @vsellier should validate the review too as they are your official GSOC mentors.

Both of them have already reviewed this diff.

Can you also please also take care of updating it at the same time?

Sure. I just created D5912

Also @vlorentz and @vsellier should validate the review too as they are your official GSOC mentors.

Both of them have already reviewed this diff.

Ok then. Before I can accept this, please wrap your commit message body to 80 chars max.
Something like:

journal_client: Store last revision/release date
    
Last revision/release dates are excellent candidates for sorting options 
and filters.

These changes store these values whenever an OriginVisitStatus with a valid
snapshot id is received. The dates are fetched from swh.storage so these 
changes also configure swh.storage client in swh.search.

Update commit messsage body

Build is green

Patch application report for D5883 (id=21228)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit e16952f22eda3ee5ad04f6d73ba089d56c0502fb
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    journal_client: Store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options and filters.
    
    These changes store these values whenever an OriginVisitStatus with a valid snapshot id is received. The dates are fetched from swh.storage so these changes also configure swh.storage client in swh.search.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/162/ for more details.

Limit each line in commit message to 80 chars

Build is green

Patch application report for D5883 (id=21235)

Rebasing onto 35570b4284...

Current branch diff-target is up to date.
Changes applied before test
commit 9bedaa95a39ed8781b3d665346e6ccc62066445b
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jun 17 10:39:17 2021 +0000

    journal_client: Store last revision/release date
    
    Last revision/release dates are excellent candidates for sorting options
    and filters.
    
    These changes store these values whenever an OriginVisitStatus with a valid
    snapshot id is received. The dates are fetched from swh.storage so these
    changes also configure swh.storage client in swh.search.

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/163/ for more details.

Looks good to me, thanks !

This revision is now accepted and ready to land.Jun 23 2021, 4:36 PM