Page MenuHomeSoftware Heritage
Feed Advanced Search

Mar 12 2020

douardda updated the diff for D2819: model: use attrs_static to enforce type validation of model objects.

add missing mypy deps in requirements-test

Mar 12 2020, 4:22 PM
douardda added inline comments to D2824: model: improve a bit the TimestampWithTimezone model.
Mar 12 2020, 4:19 PM
douardda created D2824: model: improve a bit the TimestampWithTimezone model.
Mar 12 2020, 4:09 PM
douardda created D2823: tests: add low level tests for the Timestamp model entity.
Mar 12 2020, 4:09 PM
douardda updated the diff for D2819: model: use attrs_static to enforce type validation of model objects.

add support for the 'validator' argument in attrib_typecheck

Mar 12 2020, 4:07 PM
douardda created P613 (An Untitled Masterwork).
Mar 12 2020, 3:33 PM
douardda closed D2818: tests/identifiers: fix 'target', 'directory' and 'parents' object types.
Mar 12 2020, 2:53 PM
douardda committed rDMOD56ae59c5ddbd: test/model: do not test direct instanciation of model objects (authored by douardda).
test/model: do not test direct instanciation of model objects
Mar 12 2020, 2:53 PM
douardda committed rDMOD97af8866ebaf: tests/identifiers: fix 'target', 'directory' and 'parents' object types (authored by douardda).
tests/identifiers: fix 'target', 'directory' and 'parents' object types
Mar 12 2020, 2:53 PM
douardda closed D2817: test/model: do not test direct instanciation of model objects.
Mar 12 2020, 2:53 PM
douardda committed rDMODc74696036e97: tests/models: use d.copy() instead of dict(d) (authored by douardda).
tests/models: use d.copy() instead of dict(d)
Mar 12 2020, 2:53 PM
douardda closed D2816: tests/models: use d.copy() instead of dict(d).
Mar 12 2020, 2:53 PM
douardda closed D2815: model: kill Origin.type attribute.
Mar 12 2020, 2:53 PM
douardda committed rDMODf533f62bbf11: model: kill Origin.type attribute (authored by douardda).
model: kill Origin.type attribute
Mar 12 2020, 2:53 PM
douardda committed rDMOD0a6d7e050d2c: Extract the dictify() function from BaseModel.to_dict() (authored by douardda).
Extract the dictify() function from BaseModel.to_dict()
Mar 12 2020, 2:53 PM
douardda closed D2814: Extract the dictify() function from BaseModel.to_dict().
Mar 12 2020, 2:53 PM
douardda requested review of D2814: Extract the dictify() function from BaseModel.to_dict().
Mar 12 2020, 2:31 PM
douardda updated the diff for D2819: model: use attrs_static to enforce type validation of model objects.

rebase + add missing plugin file

Mar 12 2020, 2:30 PM
douardda updated the diff for D2818: tests/identifiers: fix 'target', 'directory' and 'parents' object types.

replace bhex() by _x() and other stuff reported by olasd

Mar 12 2020, 2:29 PM
douardda added inline comments to D2811: scanner: added test for the model.
Mar 12 2020, 2:15 PM
douardda requested changes to D2769: Fix crash on None snapshot..

The annotation part should be done on the whole module and, most importantly, in a dedicated revision.

Mar 12 2020, 1:57 PM
douardda accepted D2771: Make release_add support adding the same object twice in the same call.
Mar 12 2020, 1:53 PM
douardda added a comment to D2771: Make release_add support adding the same object twice in the same call.

I really think we should either have it for all object types or none at all.

Mar 12 2020, 1:53 PM
douardda added a comment to D2814: Extract the dictify() function from BaseModel.to_dict().

What about adding tests on this or do you rely on BaseModel's?

Mar 12 2020, 1:49 PM
douardda triaged T2309: Add support for other hash algo than sha1 in current objstorage implementation as Normal priority.
Mar 12 2020, 1:43 PM · Object storage
douardda created P612 (An Untitled Masterwork).
Mar 12 2020, 10:52 AM

Mar 11 2020

douardda created D2819: model: use attrs_static to enforce type validation of model objects.
Mar 11 2020, 5:56 PM
douardda created D2818: tests/identifiers: fix 'target', 'directory' and 'parents' object types.
Mar 11 2020, 5:55 PM
douardda created D2817: test/model: do not test direct instanciation of model objects.
Mar 11 2020, 5:55 PM
douardda created D2816: tests/models: use d.copy() instead of dict(d).
Mar 11 2020, 5:54 PM
douardda created D2815: model: kill Origin.type attribute.
Mar 11 2020, 5:54 PM
douardda created D2814: Extract the dictify() function from BaseModel.to_dict().
Mar 11 2020, 5:53 PM
douardda updated the task description for T2308: Better Validation in swh.model .
Mar 11 2020, 4:07 PM · Data Model
douardda triaged T2308: Better Validation in swh.model as Normal priority.
Mar 11 2020, 4:06 PM · Data Model
douardda created T2308: Better Validation in swh.model .
Mar 11 2020, 4:06 PM · Data Model
douardda created P611 (An Untitled Masterwork).
Mar 11 2020, 2:11 PM
douardda committed rDSTOaa39be1b3b77: storage/writer: refactor JournalWriter.content_add to send model objects (authored by douardda).
storage/writer: refactor JournalWriter.content_add to send model objects
Mar 11 2020, 10:40 AM
douardda closed D2803: storage/writer: refactor JournalWriter.content_add to send model objects.
Mar 11 2020, 10:39 AM
douardda added a comment to D2803: storage/writer: refactor JournalWriter.content_add to send model objects.
In D2803#67209, @olasd wrote:
In D2803#67024, @olasd wrote:

My main doubt was whether we stopped explicitly converting model objects to dicts altogether (going through the swh.core model serializer instead). But even in that case contents will still be deserializable (as Content.from_dict(d) still works even when d['data'] is None).

What swh.core model serializer do you refer to? The ones in swh.core.api?

Yes. And now that you've pointed it out, I've remembered that it's the swh.storage RPC layer that adds a hook to support model objects.

Mar 11 2020, 10:38 AM
douardda added a comment to D2803: storage/writer: refactor JournalWriter.content_add to send model objects.
In D2803#67024, @olasd wrote:

My main doubt was whether we stopped explicitly converting model objects to dicts altogether (going through the swh.core model serializer instead). But even in that case contents will still be deserializable (as Content.from_dict(d) still works even when d['data'] is None).

Mar 11 2020, 10:35 AM

Mar 10 2020

douardda closed D2801: kafka: normalize KafkaJournalWriter.write_addition[s] API.
Mar 10 2020, 5:35 PM
douardda committed rDJNL82df6acedbb1: kafka: normalize KafkaJournalWriter.write_addition[s] API (authored by douardda).
kafka: normalize KafkaJournalWriter.write_addition[s] API
Mar 10 2020, 5:35 PM
douardda updated the diff for D2801: kafka: normalize KafkaJournalWriter.write_addition[s] API.

remove extra parameter 'anon' mistakenly included in the diff

Mar 10 2020, 5:29 PM
douardda created D2803: storage/writer: refactor JournalWriter.content_add to send model objects.
Mar 10 2020, 4:46 PM
douardda committed rDSTOa97781d21131: storage/validate: small code formatting (authored by douardda).
storage/validate: small code formatting
Mar 10 2020, 4:43 PM
douardda created D2801: kafka: normalize KafkaJournalWriter.write_addition[s] API.
Mar 10 2020, 4:41 PM

Mar 6 2020

douardda created P605 (An Untitled Masterwork).
Mar 6 2020, 5:36 PM
douardda added inline comments to D2777: journal.replay: Batch insert contents/skipped_contents in storage backend.
Mar 6 2020, 1:40 PM
douardda committed rDSTO3b8b718aa0c5: sql: do not attempt to create the plpgsql lang if already exists (authored by douardda).
sql: do not attempt to create the plpgsql lang if already exists
Mar 6 2020, 1:39 PM
douardda closed D2776: sql: do not attempt to create the plpgsql lang if already exists.
Mar 6 2020, 1:39 PM
douardda added a comment to D2776: sql: do not attempt to create the plpgsql lang if already exists.
In D2776#66377, @olasd wrote:

This looks sound but the tests are hanging on the initialization of the postgresql database now... (at least on jenkins)

Mar 6 2020, 1:37 PM
douardda accepted D2778: Add install instructions for Cassandra..
Mar 6 2020, 1:25 PM
douardda accepted D2777: journal.replay: Batch insert contents/skipped_contents in storage backend.

ok (besides my remark).

Mar 6 2020, 11:54 AM
douardda added inline comments to D2777: journal.replay: Batch insert contents/skipped_contents in storage backend.
Mar 6 2020, 11:54 AM
douardda created D2776: sql: do not attempt to create the plpgsql lang if already exists.
Mar 6 2020, 9:31 AM

Mar 4 2020

douardda accepted D2767: Add some tenacity to checking whether an object is in the destination.
Mar 4 2020, 5:35 PM
douardda created P603 (An Untitled Masterwork).
Mar 4 2020, 3:20 PM
douardda created P602 (An Untitled Masterwork).
Mar 4 2020, 2:04 PM

Mar 3 2020

douardda committed rCDFPcc2ae5af9877: images/base: add support for the LOG_LEVEL env var for replayer services (authored by douardda).
images/base: add support for the LOG_LEVEL env var for replayer services
Mar 3 2020, 10:53 AM
douardda committed rCDFPb730b619299c: Update a bit the README file (authored by douardda).
Update a bit the README file
Mar 3 2020, 10:53 AM
douardda committed rCDFP05dde3bf616d: grafana: fix the datasource config (authored by douardda).
grafana: fix the datasource config
Mar 3 2020, 10:53 AM
douardda committed rCDFP6748a2080ca2: grafana: add a backend statistics dashboard, tune a bit the graph replayer one (authored by douardda).
grafana: add a backend statistics dashboard, tune a bit the graph replayer one
Mar 3 2020, 10:53 AM
douardda committed rCDFPa7d896f05aa2: Move nginx listening port to 5081 (authored by douardda).
Move nginx listening port to 5081
Mar 3 2020, 10:53 AM
douardda committed rCDFP9ecc8aa09974: update images entrypoint files (authored by douardda).
update images entrypoint files
Mar 3 2020, 10:53 AM
douardda committed rCDFPefd4b4496e46: images/web: use a better 'shell' CMD support in web's entrypoint (authored by douardda).
images/web: use a better 'shell' CMD support in web's entrypoint
Mar 3 2020, 10:53 AM
douardda committed rCDFP698e861a7056: example: fix the content-replayer.yml.example file (authored by douardda).
example: fix the content-replayer.yml.example file
Mar 3 2020, 10:53 AM
douardda committed rCDFPd4c658bf1a6a: mirror: update the mirror deployment compose file (authored by douardda).
mirror: update the mirror deployment compose file
Mar 3 2020, 10:53 AM
douardda committed rCDFP9fd8cdc38af7: images/web: reduce swh-web image size (authored by douardda).
images/web: reduce swh-web image size
Mar 3 2020, 10:53 AM
douardda committed rCDFP59ae8d7374b0: postgres: improve a bit the Postgresql configuration (authored by douardda).
postgres: improve a bit the Postgresql configuration
Mar 3 2020, 10:53 AM
douardda committed rCDFP390be1b78a7a: Dockerfile: update to buster and add the pgsql.sh utils file (authored by douardda).
Dockerfile: update to buster and add the pgsql.sh utils file
Mar 3 2020, 10:53 AM
douardda committed rCDFPd9078f56c6ed: Add prometheus, statsd and grafana services (authored by douardda).
Add prometheus, statsd and grafana services
Mar 3 2020, 10:53 AM
douardda committed rCDFPa0021f7e1bbb: web: add missing config entries (authored by douardda).
web: add missing config entries
Mar 3 2020, 10:53 AM
douardda committed rCDFP524e7ee6d410: Add a pre-commit config file (authored by douardda).
Add a pre-commit config file
Mar 3 2020, 10:53 AM
douardda committed rCDFP1e603cd4fda5: README: update the README file (authored by douardda).
README: update the README file
Mar 3 2020, 10:53 AM
douardda added a comment to D2751: Add support for the static consumer group feature to journal client.

This is nice.

Mar 3 2020, 9:48 AM

Feb 17 2020

douardda created D2680: Add a paragraph in the README file about installing azure-cli from pip.
Feb 17 2020, 12:57 PM

Feb 12 2020

douardda requested changes to D2651: JournalClient: split main loop in three functions.

You should give a hint in your commit message on why you do this refactoring.

Feb 12 2020, 9:55 AM

Feb 6 2020

douardda requested changes to D2614: scheduler.backend_es: Leave index opened when streaming bulk.

okay-ish but lifecycle of ES related services/objects is unclear to me.

Feb 6 2020, 11:11 AM
douardda requested changes to D2619: in-memory storage: compute all counters.

Thanks for the contribution.
You must however ensure tests pass ok before we can accept it. Note that the tests you modify (in tests/test_storage.py) are executed by all the storage backends (postgres, cassandra and the in_memory one you really are targeting here). So make sure they are still OK with all the backends.

Feb 6 2020, 10:57 AM

Feb 3 2020

douardda added inline comments to D2614: scheduler.backend_es: Leave index opened when streaming bulk.
Feb 3 2020, 11:54 AM

Jan 31 2020

douardda accepted D2566: Add Cassandra backend..

Looks good to me, but it would really be nice to have a bit more documentation/explanation on how stuff work and are organized in Cassandra, be it in the code itself and as docu material in doc/

Jan 31 2020, 2:09 PM

Jan 29 2020

douardda committed rDMOD57a0e08925d4: cli: add support for reading a file content from stdin in 'swh identify' command (authored by douardda).
cli: add support for reading a file content from stdin in 'swh identify' command
Jan 29 2020, 3:49 PM
douardda closed D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 3:49 PM
douardda updated the diff for D2599: cli: add support for reading a file content from stdin in 'swh identify' command.

typos

Jan 29 2020, 3:23 PM
douardda added inline comments to D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 3:22 PM
douardda created D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 2:57 PM
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

One question could be 'what is the definitive source of truth in our stack?'

I assumed we wanted to aim for Kafka to be the source of truth

Jan 29 2020, 2:00 PM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41456, @olasd wrote:

Now that I think of it, we can decompose this in stages in the storage pipeline:

  • add an input validating proxy high up the stack
  • replace the journal writer calls sprinkled in all methods with a journal writing proxy
  • add a "don't insert objects" filter low down the stack

so we'd end up with the following pipeline for workers:

  • input validation proxy
  • object bundling proxy
  • object deduplication against read-only proxy
  • journal writer proxy
  • addition-blocking filter
  • underlying read-only storage

and the following pipeline for the "main storage replayer":

  • underlying read-write storage

(it's a very short pipeline... a pipedash?)

Jan 29 2020, 11:45 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

We already discussed this at the time we replaced the journal-publisher with journal-writer. Adding to Kafka after inserting to the DB means that Kafka will be missing some messages, and we would need to run a backfiller on a regular basis to fix it.

Jan 29 2020, 11:40 AM · Journal

Jan 28 2020

douardda added inline comments to D2582: Web API endpoint /known/.
Jan 28 2020, 12:12 PM
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41428, @olasd wrote:

This component would centralize the "has this object already appeared?" logic, as well as the queueing+retry logic, and would replace the current kafka mirror component.

How does that sound?

Jan 28 2020, 9:37 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41429, @olasd wrote:

Key metrics for the filter component:

  • kafka consumer offset
  • min(latest_attempt) where in_flight = true (time it takes for a message from submission in the buffer to (re-)processing by the filter; should stay close to the current time)
  • count(*) where given_up = false group by topic (number of objects pending a retry, should be small)
  • count(*) where in_flight = true group by topic (number of objects buffered for reprocessing, should be small)
  • max(latest_attempt) (last processing time by the requeuing process)
  • count(*) where given_up = true (checks whether the housekeeping process)
Jan 28 2020, 9:30 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

Note: haven't read the other comment below, just reacting at this one as I am reading it.

Jan 28 2020, 9:28 AM · Journal

Jan 23 2020

douardda created P586 (An Untitled Masterwork).
Jan 23 2020, 4:36 PM
douardda added a project to T846: Some objects from the original GitHub import have never actually been imported.: Roadmap 2020.
Jan 23 2020, 2:01 PM · Roadmap 2020, Restricted Project, Archive content
douardda added a subtask for T2207: Improve ingestion efficiency : T846: Some objects from the original GitHub import have never actually been imported..
Jan 23 2020, 2:01 PM · Origin-GitLab, Origin-GitHub, Roadmap 2020
douardda added a parent task for T846: Some objects from the original GitHub import have never actually been imported.: T2207: Improve ingestion efficiency .
Jan 23 2020, 2:01 PM · Roadmap 2020, Restricted Project, Archive content
douardda added a comment to T757: Memory leak in swh.storage.api.server.

Is this still "a thing"?

Jan 23 2020, 1:58 PM · Storage manager
douardda raised the priority of T2003: Content replayer may try to copy objects before they are available from an objstorage from Normal to High.

Since T1914 is high priority, this one is too.

Jan 23 2020, 1:53 PM · Journal
douardda added a comment to T2034: Unbreak journal clients.

What is the status of this issue? Do we still face this bug?

Jan 23 2020, 11:20 AM · Journal