Page MenuHomeSoftware Heritage
Feed Advanced Search

Nov 10 2017

moranegg moved T733: add content_metadata logic to storage from Backlog to Documentation on the Metadata workflow board.
Nov 10 2017, 3:06 PM · Metadata workflow, Indexer
moranegg moved T715: create indexing strategy for metadata from Backlog to Documentation on the Metadata workflow board.
Nov 10 2017, 3:03 PM · Metadata workflow, Indexer
moranegg edited projects for T831: review all json schemas in storage for metadata objects (content_metadata, revision_metadata and origin_metadata), added: Metadata workflow; removed Metadata implementation.
Nov 10 2017, 3:01 PM · Metadata workflow, Indexer
moranegg edited projects for T733: add content_metadata logic to storage, added: Metadata workflow; removed Metadata implementation.
Nov 10 2017, 2:58 PM · Metadata workflow, Indexer
moranegg edited projects for T715: create indexing strategy for metadata, added: Metadata workflow; removed Metadata implementation.
Nov 10 2017, 2:57 PM · Metadata workflow, Indexer

Nov 6 2017

moranegg moved T715: create indexing strategy for metadata from Backlog to Done (almost done) on the Metadata implementation board.
Nov 6 2017, 12:22 PM · Metadata workflow, Indexer
moranegg created T831: review all json schemas in storage for metadata objects (content_metadata, revision_metadata and origin_metadata).
Nov 6 2017, 11:46 AM · Metadata workflow, Indexer

Oct 26 2017

ardumont created T818: indexer DB should not use bytea for mimetype and encoding columns.
Oct 26 2017, 12:37 PM · Storage manager, Indexer
zack added a comment to T817: analyze bogus mimetype values in content_mimetype table.

FTR, the query I've used to generate the stats is:


(the encoding there is needed due to T818)

Oct 26 2017, 12:37 PM · Archive content, Indexer
zack created T817: analyze bogus mimetype values in content_mimetype table.
Oct 26 2017, 12:36 PM · Archive content, Indexer

Oct 10 2017

ardumont closed T801: Indexer mimetype - Fix parsing error as Resolved by committing rDCIDXa03975e95723: swh.indexer.mimetype: Fix edge case regarding empty raw content.
Oct 10 2017, 3:24 PM · Indexer
ardumont created T803: Indexer - Retrieval error when contents is too big.
Oct 10 2017, 3:04 PM · Indexer, Object storage
ardumont added a comment to T801: Indexer mimetype - Fix parsing error.

This is not reproduced on a stretch node (python3.5).

Oct 10 2017, 1:47 PM · Indexer
ardumont added a comment to T801: Indexer mimetype - Fix parsing error.

Ok, this is the empty file:

Oct 10 2017, 10:21 AM · Indexer
ardumont added a comment to T801: Indexer mimetype - Fix parsing error.

Adding some log, this is the response from the output.

Oct 10 2017, 10:08 AM · Indexer
ardumont created T801: Indexer mimetype - Fix parsing error.
Oct 10 2017, 10:04 AM · Indexer

Oct 6 2017

zack renamed T713: Index existing contents (mimetype, language, license) from Indexing existing contents (mimetype, language, license) to Index existing contents (mimetype, language, license).
Oct 6 2017, 3:04 PM · Indexer

Sep 29 2017

olasd closed T712: Update existing contents with new hash blake2s256, a subtask of T692: worker to efficiently (re)compute content blob checksums, as Resolved.
Sep 29 2017, 1:57 PM · Indexer
olasd closed T712: Update existing contents with new hash blake2s256 as Resolved.

After "manual" computation of the remaining hashes:

Sep 29 2017, 1:57 PM · Indexer

Sep 27 2017

ardumont updated the task description for T713: Index existing contents (mimetype, language, license).
Sep 27 2017, 5:03 PM · Indexer

Sep 25 2017

olasd added a comment to T712: Update existing contents with new hash blake2s256.

After some more manual poking, we're now in the following status:

Sep 25 2017, 3:48 PM · Indexer

Sep 18 2017

olasd added a comment to T712: Update existing contents with new hash blake2s256.

I've done several things today to try to wrap this up, and we're ever so close (3000 or so objects left).

Sep 18 2017, 8:01 PM · Indexer
ardumont added a comment to T712: Update existing contents with new hash blake2s256.

Schedule the rehash once the archiver did its copy to banco/azure. IIRC, the orchestration is already possible through the director's setup.
But, it's possible some little code is needed since i believe there is a slight discrepancy between the data out from the director and the data in for the rehash job.

Sep 18 2017, 10:10 AM · Indexer
zack added a comment to T712: Update existing contents with new hash blake2s256.

Good call in pausing this to avoid uffizi hangs (assuming this was the cause).
We want the different object storage copies to converge, so I think waiting for the archiver to close the gap (possibly increasing resources to it if that helps) before restarting this is the right solution here.

Sep 18 2017, 9:37 AM · Indexer

Sep 17 2017

ardumont added a comment to T712: Update existing contents with new hash blake2s256.

We have reached a point where the remaining contents to rehash are only stored in uffizi (not on the other mirrors; azure, banco ; according to logs).

Sep 17 2017, 5:17 PM · Indexer

Sep 15 2017

ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Sep 15 2017, 10:45 AM · Indexer
ardumont added a comment to T712: Update existing contents with new hash blake2s256.

This is currently being solved (as in listing + scheduling those missed).

Sep 15 2017, 10:33 AM · Indexer
ardumont added a comment to T712: Update existing contents with new hash blake2s256.

current status: overall, the contents (3.6B) are mostly rehashed.
But, some known issue (T760) incurred some missed contents (around 5M).

Sep 15 2017, 10:23 AM · Indexer
ardumont closed T692: worker to efficiently (re)compute content blob checksums as Resolved.
Sep 15 2017, 10:17 AM · Indexer
zack assigned T712: Update existing contents with new hash blake2s256 to ardumont.
Sep 15 2017, 9:50 AM · Indexer

Jul 28 2017

ardumont created P171 test_revision_metadata_indexer failure.
Jul 28 2017, 1:02 PM · Indexer
moranegg closed T738: create Revision Indexer, a subtask of T715: create indexing strategy for metadata, as Resolved.
Jul 28 2017, 12:56 PM · Metadata workflow, Indexer

Jul 26 2017

moranegg moved T715: create indexing strategy for metadata from in progress to Backlog on the Metadata implementation board.
Jul 26 2017, 4:55 PM · Metadata workflow, Indexer

Jul 25 2017

moranegg changed the status of T738: create Revision Indexer, a subtask of T715: create indexing strategy for metadata, from Open to Work in Progress.
Jul 25 2017, 3:35 PM · Metadata workflow, Indexer

Jul 18 2017

moranegg updated the task description for T715: create indexing strategy for metadata.
Jul 18 2017, 5:09 PM · Metadata workflow, Indexer

Jul 17 2017

moranegg added a subtask for T715: create indexing strategy for metadata: T738: create Revision Indexer.
Jul 17 2017, 12:18 PM · Metadata workflow, Indexer
moranegg closed T731: Find tools to parse/translate metadata, a subtask of T715: create indexing strategy for metadata, as Wontfix.
Jul 17 2017, 12:17 PM · Metadata workflow, Indexer
moranegg closed T731: Find tools to parse/translate metadata as Wontfix.
Jul 17 2017, 12:17 PM · Metadata implementation, Indexer
moranegg added a comment to T731: Find tools to parse/translate metadata.

Closing task due to change of plan.

Jul 17 2017, 12:16 PM · Metadata implementation, Indexer

Jul 10 2017

moranegg updated the task description for T715: create indexing strategy for metadata.
Jul 10 2017, 1:47 PM · Metadata workflow, Indexer

Jul 7 2017

moranegg moved T733: add content_metadata logic to storage from in progress to Done (almost done) on the Metadata implementation board.
Jul 7 2017, 3:52 PM · Metadata workflow, Indexer
moranegg updated the task description for T733: add content_metadata logic to storage.
Jul 7 2017, 3:36 PM · Metadata workflow, Indexer

Jul 6 2017

moranegg updated the task description for T733: add content_metadata logic to storage.
Jul 6 2017, 4:15 PM · Metadata workflow, Indexer

Jun 29 2017

moranegg added a revision to T733: add content_metadata logic to storage: D219: Added content_metadata logic to the storage.
Jun 29 2017, 3:06 PM · Metadata workflow, Indexer
moranegg updated the task description for T733: add content_metadata logic to storage.
Jun 29 2017, 12:10 PM · Metadata workflow, Indexer
moranegg updated the task description for T733: add content_metadata logic to storage.
Jun 29 2017, 11:50 AM · Metadata workflow, Indexer

Jun 27 2017

moranegg moved T733: add content_metadata logic to storage from Backlog to in progress on the Metadata implementation board.
Jun 27 2017, 3:14 PM · Metadata workflow, Indexer
moranegg updated the task description for T733: add content_metadata logic to storage.
Jun 27 2017, 2:21 PM · Metadata workflow, Indexer
moranegg created T733: add content_metadata logic to storage.
Jun 27 2017, 12:21 PM · Metadata workflow, Indexer

Jun 16 2017

moranegg added a revision to T715: create indexing strategy for metadata: D215: First draft of the metadata content indexer for npm (package.json) T715.
Jun 16 2017, 5:27 PM · Metadata workflow, Indexer
moranegg updated the task description for T715: create indexing strategy for metadata.
Jun 16 2017, 5:26 PM · Metadata workflow, Indexer

Jun 13 2017

moranegg created T731: Find tools to parse/translate metadata.
Jun 13 2017, 1:50 PM · Metadata implementation, Indexer

Jun 8 2017

moranegg moved T715: create indexing strategy for metadata from Backlog to in progress on the Metadata implementation board.
Jun 8 2017, 2:17 PM · Metadata workflow, Indexer
moranegg added a project to T715: create indexing strategy for metadata: Metadata implementation.
Jun 8 2017, 2:12 PM · Metadata workflow, Indexer

Jun 6 2017

ardumont closed T721: Improve license indexer's unknown license policy as Resolved.
Jun 6 2017, 6:26 PM · Indexer, Storage manager
ardumont updated the task description for T721: Improve license indexer's unknown license policy.
Jun 6 2017, 2:26 PM · Indexer, Storage manager
zack renamed T728: normalize encoding values across mimetype and language indexers from Reuse encoding detected in mimetype indexer for language indexer to normalize encoding values across mimetype and language indexers.
Jun 6 2017, 1:53 PM · Indexer
ardumont created T728: normalize encoding values across mimetype and language indexers.
Jun 6 2017, 1:29 PM · Indexer
ardumont closed T722: Improve language indexer performance as Resolved.
Jun 6 2017, 10:59 AM · Indexer
ardumont added a comment to T722: Improve language indexer performance.
  • take only the first 10k of the raw contents (as a possible configuration option).
Jun 6 2017, 10:58 AM · Indexer

May 30 2017

ardumont added a comment to T722: Improve language indexer performance.

I created a paste P163 instead of directly commenting the results here.

May 30 2017, 11:40 AM · Indexer
ardumont updated the task description for T722: Improve language indexer performance.
May 30 2017, 11:40 AM · Indexer

May 29 2017

ardumont renamed T722: Improve language indexer performance from Make the language Indexer faster to Improve language indexer performance.
May 29 2017, 1:48 PM · Indexer
ardumont triaged T722: Improve language indexer performance as High priority.
May 29 2017, 1:15 PM · Indexer
ardumont created T722: Improve language indexer performance.
May 29 2017, 12:22 PM · Indexer
ardumont created T721: Improve license indexer's unknown license policy.
May 29 2017, 11:01 AM · Indexer, Storage manager

May 17 2017

moranegg updated the task description for T715: create indexing strategy for metadata.
May 17 2017, 4:43 PM · Metadata workflow, Indexer
moranegg updated the task description for T715: create indexing strategy for metadata.
May 17 2017, 4:43 PM · Metadata workflow, Indexer

May 16 2017

moranegg added a watcher for Indexer: moranegg.
May 16 2017, 1:55 PM

May 11 2017

moranegg updated the task description for T715: create indexing strategy for metadata.
May 11 2017, 5:09 PM · Metadata workflow, Indexer
moranegg updated the task description for T715: create indexing strategy for metadata.
May 11 2017, 4:27 PM · Metadata workflow, Indexer
moranegg created T715: create indexing strategy for metadata.
May 11 2017, 4:25 PM · Metadata workflow, Indexer

May 5 2017

ardumont updated the task description for T713: Index existing contents (mimetype, language, license).
May 5 2017, 2:43 PM · Indexer
ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
May 5 2017, 2:42 PM · Indexer
ardumont changed the status of T712: Update existing contents with new hash blake2s256 from Open to Work in Progress.
May 5 2017, 2:37 PM · Indexer
ardumont changed the status of T712: Update existing contents with new hash blake2s256, a subtask of T692: worker to efficiently (re)compute content blob checksums, from Open to Work in Progress.
May 5 2017, 2:37 PM · Indexer
ardumont changed the status of T713: Index existing contents (mimetype, language, license) from Open to Work in Progress.
May 5 2017, 2:37 PM · Indexer

May 2 2017

ardumont updated the task description for T713: Index existing contents (mimetype, language, license).
May 2 2017, 5:44 PM · Indexer
ardumont updated the task description for T713: Index existing contents (mimetype, language, license).
May 2 2017, 4:02 PM · Indexer
ardumont added a comment to T712: Update existing contents with new hash blake2s256.

So, it turns out that sending all contents to rehash in one shot was dumb...
It cluttered the rabbitmq machine's disk (saatchi).

May 2 2017, 3:31 PM · Indexer

Apr 27 2017

ardumont added a comment to T712: Update existing contents with new hash blake2s256.

3262961641 contents sent by batch of 1000 so ~3.26 billion messages in swh_indexer_content_rehash queue.

Apr 27 2017, 6:34 PM · Indexer
ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Apr 27 2017, 6:25 PM · Indexer
ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Apr 27 2017, 4:08 PM · Indexer
ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Apr 27 2017, 4:07 PM · Indexer
ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Apr 27 2017, 4:01 PM · Indexer

Apr 26 2017

ardumont updated the task description for T712: Update existing contents with new hash blake2s256.
Apr 26 2017, 3:20 PM · Indexer
ardumont updated the task description for T713: Index existing contents (mimetype, language, license).
Apr 26 2017, 2:09 PM · Indexer
ardumont created T713: Index existing contents (mimetype, language, license).
Apr 26 2017, 11:50 AM · Indexer
ardumont closed T703: Make the loaders compute blake2s256 hash for new contents, a subtask of T692: worker to efficiently (re)compute content blob checksums, as Resolved.
Apr 26 2017, 11:43 AM · Indexer
ardumont added subtasks for T692: worker to efficiently (re)compute content blob checksums: T712: Update existing contents with new hash blake2s256, T703: Make the loaders compute blake2s256 hash for new contents.
Apr 26 2017, 11:43 AM · Indexer
ardumont added a parent task for T712: Update existing contents with new hash blake2s256: T692: worker to efficiently (re)compute content blob checksums.
Apr 26 2017, 11:43 AM · Indexer
ardumont created T712: Update existing contents with new hash blake2s256.
Apr 26 2017, 11:42 AM · Indexer

Mar 3 2017

zack edited projects for T692: worker to efficiently (re)compute content blob checksums, added: Indexer; removed Object storage.
Mar 3 2017, 11:26 AM · Indexer

Dec 6 2016

ardumont closed T610: Update indexers' information about the name and version of the tools used for their computations as Resolved.
Dec 6 2016, 12:27 PM · Indexer, General
ardumont closed T610: Update indexers' information about the name and version of the tools used for their computations, a subtask of T574: Pipeline copy content to azure and then compute multiple indexes independently (meta task), as Resolved.
Dec 6 2016, 12:27 PM · Indexer, General
ardumont added a comment to T610: Update indexers' information about the name and version of the tools used for their computations.

Remains to update the azure workers with the latest indexer.
I'm on it.

Dec 6 2016, 11:43 AM · Indexer, General

Dec 2 2016

ardumont created T610: Update indexers' information about the name and version of the tools used for their computations.
Dec 2 2016, 1:29 PM · Indexer, General

Nov 22 2016

ardumont closed T574: Pipeline copy content to azure and then compute multiple indexes independently (meta task) as Resolved.
Nov 22 2016, 10:38 AM · Indexer, General

Nov 18 2016

ardumont closed T596: Add license indexer as Resolved.
Nov 18 2016, 3:53 PM · Indexer, General
ardumont closed T596: Add license indexer, a subtask of T574: Pipeline copy content to azure and then compute multiple indexes independently (meta task), as Resolved.
Nov 18 2016, 3:53 PM · Indexer, General