Page MenuHomeSoftware Heritage

metadata-indexer: Configuration tool creating multiple different tools even though the same
Closed, MigratedEdits Locked

Description

At the moment, there is something off in the tool configuration for the metadata indexer.
The tool referenced for the metadata indexer associates a 'dynamic' context.
Thus, adding unnecessary new tools even though they are the same.

See below for an extract [1]

A priori, the solution would be to remove the context key from the tool_configuration column (seen with @vlorentz).

In any case, the impacts i foresee are:

  • fix the code (according to the desired solution)
  • fix the associated puppet manifest tool configuration
  • sql scripts to migrate the data from the swh-indexer db (the indexer_configuration below should be merged where it makes sense and then the revision_metadata entries should be updated to link to the right indexer_configuration_id).

[1]

    id     |        tool_name        | tool_version |                                                    tool_configuration
-----------+-------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------
  74460503 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "CodemetaMapping"]}
  74460516 | swh-metadata-translator | 0.0.2        | {"type": "local", "context": "NpmMapping"}
  74496667 | swh-metadata-translator | 0.0.2        | {"type": "local", "context": "MavenMapping"}
  74505359 | swh-metadata-translator | 0.0.2        | {"type": "local", "context": "PythonPkginfoMapping"}
  74608577 | swh-metadata-translator | 0.0.2        | {"type": "local", "context": "CodemetaMapping"}
  79181228 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "MavenMapping", "CodemetaMapping", "GemspecMapping", "PythonPkginfoMapping"]}
  79181505 | swh-metadata-translator | 0.0.2        | {"type": "local", "context": "GemspecMapping"}
  79182115 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "PythonPkginfoMapping", "MavenMapping", "CodemetaMapping", "GemspecMapping"]}
  79183761 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "NpmMapping", "MavenMapping", "PythonPkginfoMapping", "CodemetaMapping"]}
  79183783 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "MavenMapping", "NpmMapping", "GemspecMapping", "CodemetaMapping"]}
  79187154 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "PythonPkginfoMapping", "GemspecMapping", "NpmMapping", "CodemetaMapping"]}
  79187160 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "MavenMapping", "NpmMapping", "CodemetaMapping", "PythonPkginfoMapping"]}
  79187161 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "GemspecMapping", "MavenMapping", "CodemetaMapping", "NpmMapping"]}
  79187163 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "CodemetaMapping", "PythonPkginfoMapping", "MavenMapping", "GemspecMapping"]}
  79187164 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "CodemetaMapping", "GemspecMapping", "NpmMapping", "PythonPkginfoMapping"]}
  79187165 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "MavenMapping", "GemspecMapping", "PythonPkginfoMapping", "CodemetaMapping"]}
  79187166 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "CodemetaMapping", "PythonPkginfoMapping", "GemspecMapping", "MavenMapping"]}
  79187167 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "CodemetaMapping", "MavenMapping", "GemspecMapping", "NpmMapping"]}
  79187631 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "MavenMapping", "PythonPkginfoMapping", "GemspecMapping", "CodemetaMapping"]}
  79194567 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "PythonPkginfoMapping", "NpmMapping", "GemspecMapping", "MavenMapping"]}
  79202057 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "NpmMapping", "CodemetaMapping", "MavenMapping", "PythonPkginfoMapping"]}
  79211856 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "GemspecMapping", "PythonPkginfoMapping", "NpmMapping", "MavenMapping"]}
  79211865 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "MavenMapping", "NpmMapping", "GemspecMapping", "PythonPkginfoMapping"]}
  79211870 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "NpmMapping", "CodemetaMapping", "MavenMapping", "GemspecMapping"]}
  79211877 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "PythonPkginfoMapping", "MavenMapping", "GemspecMapping", "NpmMapping"]}
  79211883 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "NpmMapping", "MavenMapping", "GemspecMapping", "CodemetaMapping"]}
  79211894 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "NpmMapping", "GemspecMapping", "MavenMapping", "CodemetaMapping"]}
  79211895 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "CodemetaMapping", "GemspecMapping", "PythonPkginfoMapping", "MavenMapping"]}
  79211902 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "NpmMapping", "GemspecMapping", "CodemetaMapping", "PythonPkginfoMapping"]}
  79211940 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "PythonPkginfoMapping", "GemspecMapping", "CodemetaMapping", "NpmMapping"]}
  79211954 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "MavenMapping", "PythonPkginfoMapping", "NpmMapping", "GemspecMapping"]}
  79268197 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "PythonPkginfoMapping", "GemspecMapping", "NpmMapping", "MavenMapping"]}
  79275070 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "PythonPkginfoMapping", "CodemetaMapping", "GemspecMapping", "MavenMapping"]}
  79276366 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "MavenMapping", "PythonPkginfoMapping", "CodemetaMapping", "NpmMapping"]}
  79280294 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "GemspecMapping", "NpmMapping", "MavenMapping", "CodemetaMapping"]}
  79286793 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "NpmMapping", "MavenMapping", "PythonPkginfoMapping", "GemspecMapping"]}
  79287831 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "MavenMapping", "CodemetaMapping", "PythonPkginfoMapping", "NpmMapping"]}
  79289245 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "GemspecMapping", "CodemetaMapping", "NpmMapping", "MavenMapping"]}
  79346055 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "GemspecMapping", "CodemetaMapping", "PythonPkginfoMapping", "NpmMapping"]}
  79346957 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["NpmMapping", "PythonPkginfoMapping", "CodemetaMapping", "MavenMapping", "GemspecMapping"]}
  79347770 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["MavenMapping", "CodemetaMapping", "PythonPkginfoMapping", "GemspecMapping", "NpmMapping"]}
  79348677 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "PythonPkginfoMapping", "NpmMapping", "MavenMapping", "GemspecMapping"]}
  79349895 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "GemspecMapping", "MavenMapping", "NpmMapping", "PythonPkginfoMapping"]}
  79350848 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "NpmMapping", "PythonPkginfoMapping", "MavenMapping", "GemspecMapping"]}
  79351864 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["CodemetaMapping", "PythonPkginfoMapping", "GemspecMapping", "MavenMapping", "NpmMapping"]}
  79352739 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["PythonPkginfoMapping", "CodemetaMapping", "GemspecMapping", "MavenMapping", "NpmMapping"]}
  80838260 | swh-metadata-detector   | 0.0.2        | {"type": "local", "context": ["GemspecMapping", "NpmMapping", "MavenMapping", "CodemetaMapping", "PythonPkginfoMapping"]}
...

Event Timeline

ardumont triaged this task as Normal priority.Feb 25 2019, 9:18 AM
ardumont created this task.

A priori, the solution would be to remove the context from the tool_configuration column (seen with @vlorentz).

No, only the context key, which doesn't make sense anymore (there's a mappings column in metadata tables).

vlorentz raised the priority of this task from Normal to High.Feb 25 2019, 10:15 AM

No, only the context key, which doesn't make sense anymore (there's a mappings column in metadata tables).

Right, my bad, that's what i meant ;)

sql script to migrate (in progress):

begin;

-- origin-intrinsic-metadata

DELETE FROM origin_intrinsic_metadata
WHERE origin_intrinsic_metadata.indexer_configuration_id != (
  SELECT MAX (tmp.indexer_configuration_id)
  FROM origin_intrinsic_metadata AS tmp
  WHERE origin_intrinsic_metadata.origin_id=tmp.origin_id
);

UPDATE origin_intrinsic_metadata
SET indexer_configuration_id=110502138;  -- the right tool for the job

-- revision-metadata

DELETE FROM revision_metadata
WHERE revision_metadata.indexer_configuration_id != (
  SELECT MAX (tmp.indexer_configuration_id)
  FROM revision_metadata AS tmp
  WHERE revision_metadata.id=tmp.id
);

UPDATE revision_metadata
SET indexer_configuration_id=110502138;  -- the right tool for the job

-- softwareheritage-indexer=> select * from indexer_configuration where id=110502138;
--     id     |       tool_name       | tool_version | tool_configuration
--     -----------+-----------------------+--------------+--------------------
--      110502138 | swh-metadata-detector | 0.0.2        | {}
--      (1 row)

-- origin-intrinsic-metadata
ardumont changed the task status from Open to Work in Progress.Feb 25 2019, 4:53 PM
ardumont updated the task description. (Show Details)

More like:

-- clean up origin-intrinsic-metadata
DELETE FROM origin_intrinsic_metadata
WHERE origin_intrinsic_metadata.indexer_configuration_id != (
  SELECT MAX (tmp.indexer_configuration_id)
  FROM origin_intrinsic_metadata AS tmp
  WHERE origin_intrinsic_metadata.origin_id=tmp.origin_id
);

-- create index that will help update query below
create index on origin_intrinsic_metadata (from_revision);

-- drop constraint
alter table origin_intrinsic_metadata drop constraint origin_intrinsic_metadata_revision_metadata_fkey;

-- clean up revision-metadata
DELETE FROM revision_metadata
WHERE revision_metadata.indexer_configuration_id != (
  SELECT MAX (tmp.indexer_configuration_id)
  FROM revision_metadata AS tmp
  WHERE revision_metadata.id=tmp.id
);

-- update to the right tools
WITH src AS (
  UPDATE revision_metadata
  SET indexer_configuration_id=110502138  -- the right tool for the job
  RETURNING id
)
UPDATE origin_intrinsic_metadata dst
SET indexer_configuration_id=110502138
FROM src
WHERE dst.from_revision = src.id;

-- install back foreign key
alter table origin_intrinsic_metadata
add constraint origin_intrinsic_metadata_revision_metadata_fkey
foreign key (from_revision, indexer_configuration_id)
references revision_metadata(id, indexer_configuration_id)
not valid;

-- clean up redundant tools
delete from indexer_configuration
where tool_name='swh-metadata-detector' and
      tool_version='0.0.2'
and id != 110502138;

right

the current state is the last delete query is still running (on the indexer configuration).

The indexers have been restarted nonetheless.

the current state is the last delete query is still running (on the indexer configuration).

got stopped because too slow and also impacted the indexer speed.

For the cleanup to actually happen fast, i deactivated the constraints
(done), executed the delete (done) and reinstalled the constraints (in
progress).

In the end, migration script proposed earlier ends with:

-- drop constraints
alter table content_fossology_license drop constraint content_fossology_license_indexer_configuration_id_fkey;
alter table content_language drop constraint content_language_indexer_configuration_id_fkey;
alter table content_mimetype drop constraint content_mimetype_indexer_configuration_id_fkey;
alter table content_ctags drop constraint content_ctags_indexer_configuration_id_fkey;
alter table content_metadata drop constraint content_metadata_indexer_configuration_id_fkey;
alter table revision_metadata drop constraint revision_metadata_indexer_configuration_id_fkey;
alter table origin_intrinsic_metadata drop constraint origin_intrinsic_metadata_indexer_configuration_id_fkey;

-- cleanup
delete from indexer_configuration
where tool_name='swh-metadata-detector' and
      tool_version='0.0.2'
and id != 110502138;

-- install/validate constraints

alter table content_fossology_license add constraint content_fossology_license_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table content_language add constraint content_language_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table content_mimetype add constraint content_mimetype_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table content_ctags add constraint content_ctags_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table content_metadata add constraint content_metadata_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table revision_metadata add constraint revision_metadata_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;
alter table origin_intrinsic_metadata add constraint origin_intrinsic_metadata_indexer_configuration_id_fkey
  foreign key (indexer_configuration_id) references indexer_configuration(id) not valid;

alter table content_fossology_license validate constraint content_fossology_license_indexer_configuration_id_fkey;
alter table content_language validate constraint content_language_indexer_configuration_id_fkey;
alter table content_mimetype validate constraint content_mimetype_indexer_configuration_id_fkey;
alter table content_ctags validate constraint content_ctags_indexer_configuration_id_fkey;
alter table content_metadata validate constraint content_metadata_indexer_configuration_id_fkey;
alter table revision_metadata validate constraint revision_metadata_indexer_configuration_id_fkey;
alter table origin_intrinsic_metadata validate constraint origin_intrinsic_metadata_indexer_configuration_id_fkey;

Cheers,

ardumont updated the task description. (Show Details)

That's better:

13:45:34 softwareheritage-indexer@somerset:5434=> select * from indexer_configuration;
┌───────────┬─────────────────────────┬───────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────┐
│    id     │        tool_name        │     tool_version      │                                      tool_configuration                                      │
├───────────┼─────────────────────────┼───────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│         1 │ nomos                   │ 3.1.0rc2-31-ga2cbb8c  │ {"command_line": "nomossa <filepath>"}                                                       │
│         5 │ universal-ctags         │ ~git7859817b          │ {"command_line": "ctags --fields=+lnz --sort=no --links=no --output-format=json <filepath>"} │
│         6 │ pygments                │ 2.0.1+dfsg-1.1+deb8u1 │ {"type": "library", "debian-package": "python3-pygments"}                                    │
│         7 │ file                    │ 5.22                  │ {"command_line": "file --mime <filepath>"}                                                   │
│         8 │ pygments                │ 2.0.1+dfsg-1.1+deb8u1 │ {"type": "library", "debian-package": "python3-pygments", "max_content_size": 10240}         │
│         9 │ file                    │ 1:5.30-1+deb9u1       │ {"type": "library", "debian-package": "python3-magic"}                                       │
│  74460485 │ origin-metadata         │ 0.0.1                 │ {}                                                                                           │
│  74460516 │ swh-metadata-translator │ 0.0.2                 │ {"type": "local", "context": "NpmMapping"}                                                   │
│  74496667 │ swh-metadata-translator │ 0.0.2                 │ {"type": "local", "context": "MavenMapping"}                                                 │
│  74505359 │ swh-metadata-translator │ 0.0.2                 │ {"type": "local", "context": "PythonPkginfoMapping"}                                         │
│  74608577 │ swh-metadata-translator │ 0.0.2                 │ {"type": "local", "context": "CodemetaMapping"}                                              │
│  79181505 │ swh-metadata-translator │ 0.0.2                 │ {"type": "local", "context": "GemspecMapping"}                                               │
│ 110502138 │ swh-metadata-detector   │ 0.0.2                 │ {}                                                                                           │
│ 110502220 │ swh-metadata-translator │ 0.0.2                 │ {"context": "NpmMapping"}                                                                    │
│ 110502438 │ swh-metadata-translator │ 0.0.2                 │ {"context": "MavenMapping"}                                                                  │
│ 110502755 │ swh-metadata-translator │ 0.0.2                 │ {"context": "GemspecMapping"}                                                                │
│ 110510063 │ swh-metadata-translator │ 0.0.2                 │ {"context": "PythonPkginfoMapping"}                                                          │
│ 110632005 │ swh-metadata-translator │ 0.0.2                 │ {"context": "CodemetaMapping"}                                                               │
└───────────┴─────────────────────────┴───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┘
(18 rows)

There might be the same cleaning up to do for the metadata-translator... but that's not for now.