Page MenuHomeSoftware Heritage

Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
ClosedPublic

Authored by vlorentz on Jun 1 2022, 2:47 PM.

Details

Summary

This will make it easier to support indexing from releases in the future (T4297),
as it will remove the strong dependency on revision ids in the database
and interfaces.

The existence of the indexer/table is mostly to deduplicate work between
origins with the same head revision, and we do not use it outside this
context, so this should have no impact.

The DB migration works by dropping both tables and re-indexing from
scratch; which is necessary as we need to replace revision ids with
directory ids.

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D7937 (id=28590)

Rebasing onto d3f68159d2...

First, rewinding head to replay your work on top of it...
Applying: Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
Changes applied before test
commit e77834e65af04ef7430c826616103f02372b745a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 14:42:37 2022 +0200

    Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
    
    This will make it easier to support indexing from releases in the future,
    as it will remove the strong dependency on revision ids in the database
    and interfaces.
    
    The existence of the indexer/table  is mostly to deduplicate work between
    origins with the same head revision, and we do not use it outside this
    context, so this should have no impact.
    
    The DB migration works by dropping both tables and re-indexing from
    scratch; which is necessary as we need to replace revision ids with
    directory ids.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/239/ for more details.

rebase + add missing renames

Build is green

Patch application report for D7937 (id=28596)

Could not rebase; Attempt merge onto d3f68159d2...

Auto-merging swh/indexer/tests/conftest.py
Merge made by the 'recursive' strategy.
 swh/indexer/indexer.py                    |  48 ++++++-----
 swh/indexer/metadata.py                   | 115 ++++++++++++-------------
 swh/indexer/sql/30-schema.sql             |  20 ++---
 swh/indexer/sql/50-func.sql               |  30 +++----
 swh/indexer/sql/60-indexes.sql            |  10 +--
 swh/indexer/sql/upgrades/134.sql          |  18 ++++
 swh/indexer/storage/__init__.py           |  38 +++++----
 swh/indexer/storage/db.py                 |  26 +++---
 swh/indexer/storage/in_memory.py          |  24 +++---
 swh/indexer/storage/interface.py          |  22 ++---
 swh/indexer/storage/model.py              |   6 +-
 swh/indexer/tests/conftest.py             |   2 +-
 swh/indexer/tests/storage/conftest.py     |   6 +-
 swh/indexer/tests/storage/test_storage.py | 134 +++++++++++++++---------------
 swh/indexer/tests/tasks.py                |  10 +--
 swh/indexer/tests/test_cli.py             |  20 ++---
 swh/indexer/tests/test_indexer.py         |  16 ++--
 swh/indexer/tests/test_metadata.py        |  59 ++++++-------
 swh/indexer/tests/test_origin_metadata.py |  76 ++++++++---------
 swh/indexer/tests/utils.py                |  52 ++++--------
 20 files changed, 368 insertions(+), 364 deletions(-)
 create mode 100644 swh/indexer/sql/upgrades/134.sql
Changes applied before test
commit c3d88585f81014b2ca603bdeddb64aab7ef30325
Merge: d3f6815 4865234
Author: Jenkins user <jenkins@localhost>
Date:   Wed Jun 1 14:36:09 2022 +0000

    Merge branch 'diff-target' into HEAD

commit 486523485203ecc9f1eaa2419753ed21badf21db
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 14:42:37 2022 +0200

    Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
    
    This will make it easier to support indexing from releases in the future,
    as it will remove the strong dependency on revision ids in the database
    and interfaces.
    
    The existence of the indexer/table  is mostly to deduplicate work between
    origins with the same head revision, and we do not use it outside this
    context, so this should have no impact.
    
    The DB migration works by dropping both tables and re-indexing from
    scratch; which is necessary as we need to replace revision ids with
    directory ids.

commit 746d601aaf64c51e4c7fdc5853df2c439317798e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 15:45:42 2022 +0200

    tests: Shorten definition of REVISION

commit a8931ea9f1c469f7cf54a2a70c7eaf87fe7bd77f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 15:35:53 2022 +0200

    tests: Simplify definition of ORIGINS list

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/242/ for more details.

ardumont added inline comments.
swh/indexer/metadata.py
197
swh/indexer/sql/upgrades/134.sql
18

good idea providing those are idempotent (i have some doubt regarding schema and higher doubts about indexes).

ardumont requested changes to this revision.Jun 2 2022, 1:35 PM

lgtm

But, please make sure the migration script are actually runnable though (i'd say with docker and swh db upgrade cli).
I'm not sure our schema and indexes sql files are idempotent enough for the migration script 134 to work with it.
^ Hence the request changes here.

This revision now requires changes to proceed.Jun 2 2022, 1:35 PM
vlorentz marked an inline comment as done.

rebase + fix docstring

Build is green

Patch application report for D7937 (id=28658)

Rebasing onto ca4e91e5b7...

Current branch diff-target is up to date.
Changes applied before test
commit 7dc09f93a7ab5bb12a80ed5d81f7ccd590752256
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 14:42:37 2022 +0200

    Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
    
    This will make it easier to support indexing from releases in the future,
    as it will remove the strong dependency on revision ids in the database
    and interfaces.
    
    The existence of the indexer/table  is mostly to deduplicate work between
    origins with the same head revision, and we do not use it outside this
    context, so this should have no impact.
    
    The DB migration works by dropping both tables and re-indexing from
    scratch; which is necessary as we need to replace revision ids with
    directory ids.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/248/ for more details.

remove SQL inclusions + bump version

you were right, they don't work at all.

I inlined the relevant stuff, so it works now, assuming I change this:

diff --git a/swh/core/cli/db.py b/swh/core/cli/db.py
index 6cd9121..eee5557 100755
--- a/swh/core/cli/db.py
+++ b/swh/core/cli/db.py
@@ -197,7 +197,7 @@ def db_init(ctx, module, dbname, flavor, initial_version):
             # (expected) db version
             datastore_factory = getattr(import_swhmodule(module), "get_datastore", None)
             if datastore_factory:
-                datastore = datastore_factory(**cfg)
+                datastore = datastore_factory(**cfg).get_db()
                 if not hasattr(datastore, "current_version"):
                     logger.warning(
                         "Datastore %s does not declare the "
@@ -300,7 +300,7 @@ def db_version(ctx, module, show_all, module_config_key=None):
     # instantiate the data source to retrieve the current (expected) db version
     datastore_factory = getattr(import_swhmodule(db_module), "get_datastore", None)
     if datastore_factory:
-        datastore = datastore_factory(**cfg)
+        datastore = datastore_factory(**cfg).get_db()
         code_version = datastore.current_version
         click.secho(
             f"current code version: {code_version}",
@@ -393,7 +393,7 @@ def db_upgrade(ctx, module, to_version, interactive, module_config_key):
         raise click.UsageError(
             "You cannot use this command on old-style datastore backend {db_module}"
         )
-    datastore = datastore_factory(**cfg)
+    datastore = datastore_factory(**cfg).get_db()
     ds_version = datastore.current_version
     if to_version is None:
         to_version = ds_version
diff --git a/swh/core/db/db_utils.py b/swh/core/db/db_utils.py
index d28a0b8..58a244f 100644
--- a/swh/core/db/db_utils.py
+++ b/swh/core/db/db_utils.py
@@ -176,6 +176,10 @@ def swh_db_upgrade(
             "The stored module of the database is different than the given one"
         )
 
+    # wat
+    assert db_version == 1, repr(db_version)
+    db_version = 133
+
     sqlfiles = [
         fname
         for fname in get_sql_for_package(modname, upgrade=True)

Build is green

Patch application report for D7937 (id=28666)

Rebasing onto ca4e91e5b7...

Current branch diff-target is up to date.
Changes applied before test
commit b88e9572f5aaee0707771f2e06b6ecb906a674c1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Jun 1 14:42:37 2022 +0200

    Replace RevisionMetadataIndexer with DirectoryMetadataIndexer
    
    This will make it easier to support indexing from releases in the future,
    as it will remove the strong dependency on revision ids in the database
    and interfaces.
    
    The existence of the indexer/table  is mostly to deduplicate work between
    origins with the same head revision, and we do not use it outside this
    context, so this should have no impact.
    
    The DB migration works by dropping both tables and re-indexing from
    scratch; which is necessary as we need to replace revision ids with
    directory ids.

See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/251/ for more details.

This revision is now accepted and ready to land.Jun 3 2022, 3:07 PM