Page MenuHomeSoftware Heritage
Feed Advanced Search

Feb 3 2020

douardda added inline comments to D2614: scheduler.backend_es: Leave index opened when streaming bulk.
Feb 3 2020, 11:54 AM

Jan 31 2020

douardda accepted D2566: Add Cassandra backend..

Looks good to me, but it would really be nice to have a bit more documentation/explanation on how stuff work and are organized in Cassandra, be it in the code itself and as docu material in doc/

Jan 31 2020, 2:09 PM

Jan 29 2020

douardda committed rDMOD57a0e08925d4: cli: add support for reading a file content from stdin in 'swh identify' command (authored by douardda).
cli: add support for reading a file content from stdin in 'swh identify' command
Jan 29 2020, 3:49 PM
douardda closed D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 3:49 PM
douardda updated the diff for D2599: cli: add support for reading a file content from stdin in 'swh identify' command.

typos

Jan 29 2020, 3:23 PM
douardda added inline comments to D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 3:22 PM
douardda created D2599: cli: add support for reading a file content from stdin in 'swh identify' command.
Jan 29 2020, 2:57 PM
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

One question could be 'what is the definitive source of truth in our stack?'

I assumed we wanted to aim for Kafka to be the source of truth

Jan 29 2020, 2:00 PM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41456, @olasd wrote:

Now that I think of it, we can decompose this in stages in the storage pipeline:

  • add an input validating proxy high up the stack
  • replace the journal writer calls sprinkled in all methods with a journal writing proxy
  • add a "don't insert objects" filter low down the stack

so we'd end up with the following pipeline for workers:

  • input validation proxy
  • object bundling proxy
  • object deduplication against read-only proxy
  • journal writer proxy
  • addition-blocking filter
  • underlying read-only storage

and the following pipeline for the "main storage replayer":

  • underlying read-write storage

(it's a very short pipeline... a pipedash?)

Jan 29 2020, 11:45 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

We already discussed this at the time we replaced the journal-publisher with journal-writer. Adding to Kafka after inserting to the DB means that Kafka will be missing some messages, and we would need to run a backfiller on a regular basis to fix it.

Jan 29 2020, 11:40 AM · Journal

Jan 28 2020

douardda added inline comments to D2582: Web API endpoint /known/.
Jan 28 2020, 12:12 PM
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41428, @olasd wrote:

This component would centralize the "has this object already appeared?" logic, as well as the queueing+retry logic, and would replace the current kafka mirror component.

How does that sound?

Jan 28 2020, 9:37 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.
In T2003#41429, @olasd wrote:

Key metrics for the filter component:

  • kafka consumer offset
  • min(latest_attempt) where in_flight = true (time it takes for a message from submission in the buffer to (re-)processing by the filter; should stay close to the current time)
  • count(*) where given_up = false group by topic (number of objects pending a retry, should be small)
  • count(*) where in_flight = true group by topic (number of objects buffered for reprocessing, should be small)
  • max(latest_attempt) (last processing time by the requeuing process)
  • count(*) where given_up = true (checks whether the housekeeping process)
Jan 28 2020, 9:30 AM · Journal
douardda added a comment to T2003: Content replayer may try to copy objects before they are available from an objstorage.

Note: haven't read the other comment below, just reacting at this one as I am reading it.

Jan 28 2020, 9:28 AM · Journal

Jan 23 2020

douardda created P586 (An Untitled Masterwork).
Jan 23 2020, 4:36 PM
douardda added a project to T846: Some objects from the original GitHub import have never actually been imported.: Roadmap 2020.
Jan 23 2020, 2:01 PM · Roadmap 2020, Restricted Project, Archive content
douardda added a subtask for T2207: Improve ingestion efficiency : T846: Some objects from the original GitHub import have never actually been imported..
Jan 23 2020, 2:01 PM · Origin-GitLab, Origin-GitHub, Roadmap 2020
douardda added a parent task for T846: Some objects from the original GitHub import have never actually been imported.: T2207: Improve ingestion efficiency .
Jan 23 2020, 2:01 PM · Roadmap 2020, Restricted Project, Archive content
douardda added a comment to T757: Memory leak in swh.storage.api.server.

Is this still "a thing"?

Jan 23 2020, 1:58 PM · Storage manager
douardda raised the priority of T2003: Content replayer may try to copy objects before they are available from an objstorage from Normal to High.

Since T1914 is high priority, this one is too.

Jan 23 2020, 1:53 PM · Journal
douardda added a comment to T2034: Unbreak journal clients.

What is the status of this issue? Do we still face this bug?

Jan 23 2020, 11:20 AM · Journal
douardda lowered the priority of T2158: Properly deal with removed endpoints from High to Low.

Agreed, this no longer need to be a high priority task.

Jan 23 2020, 11:16 AM · Web app
douardda triaged T2119: Monitoring of workers as Normal priority.
Jan 23 2020, 11:12 AM · Scheduling utilities, Sprint 2019/12 (Monitor and Conquer)

Jan 20 2020

douardda created T2235: Clean existing documentation.
Jan 20 2020, 2:37 PM · Documentation, Roadmap 2020
douardda created T2234: Write use case-specific documentation.
Jan 20 2020, 2:35 PM · Roadmap 2021, meta-task, Documentation
douardda created T2233: Update documentation to current project status.
Jan 20 2020, 2:35 PM · Documentation, meta-task, Roadmap 2020
douardda created T2232: More robust virtualization layer.
Jan 20 2020, 2:34 PM · System administration, Roadmap 2020
douardda created T2231: Continuous deployment.
Jan 20 2020, 2:34 PM · meta-task, Roadmap 2022, Staging environment, Roadmap 2020
douardda created T2230: Uniform deployment for dev/production.
Jan 20 2020, 2:34 PM · Roadmap 2020
douardda created T2229: Backups.
Jan 20 2020, 2:33 PM · System administration, Roadmap 2020
douardda created T2228: Metrics and monitoring.
Jan 20 2020, 2:32 PM · Metrics/monitoring, Roadmap 2020
douardda created T2227: Automatic Puppet deployment.
Jan 20 2020, 2:32 PM · System administration, Roadmap 2020
douardda created T2226: Sysadm / devops.
Jan 20 2020, 2:31 PM · meta-task, Roadmap 2020
douardda created T2225: Migrate to GitLab.
Jan 20 2020, 2:31 PM · meta-task, Roadmap 2022, GitLab migration, Roadmap 2020
douardda created T2224: Development workflow self-assessment.
Jan 20 2020, 2:30 PM · Roadmap 2020
douardda created T2223: Type checking.
Jan 20 2020, 2:30 PM · Roadmap 2020
douardda created T2222: Sprints.
Jan 20 2020, 2:30 PM · Roadmap 2020
douardda created T2221: Development workflow & code quality.
Jan 20 2020, 2:29 PM · meta-task, Roadmap 2020
douardda created T2220: swh-graph in production.
Jan 20 2020, 2:28 PM · Roadmap 2022, meta-task, Roadmap 2021, Compressed graph service
douardda created T2219: Authentication / authorization.
Jan 20 2020, 2:28 PM · Roadmap 2020
douardda created T2218: Orchestration.
Jan 20 2020, 2:27 PM · Scheduling utilities, Roadmap 2020
douardda created T2217: Plumbings.
Jan 20 2020, 2:26 PM · meta-task, Roadmap 2020
douardda created T2216: Packing object storage.
Jan 20 2020, 2:26 PM · Object storage, Roadmap 2020
douardda created T2215: Streaming support everywhere.
Jan 20 2020, 2:25 PM · meta-task, Web app, Object storage, Storage manager, Roadmap 2020
douardda created T2214: Scale-out graph and database storage in production.
Jan 20 2020, 2:24 PM · meta-task, Roadmap 2022, Roadmap 2021, Storage manager
douardda created T2213: Storage.
Jan 20 2020, 2:23 PM · meta-task, Roadmap 2020
douardda created T2212: Specification for swh:2+: identifiers.
Jan 20 2020, 2:23 PM · Data Model, Roadmap 2020
douardda created T2211: Go beyond git expressivity.
Jan 20 2020, 2:22 PM · Mercurial loader, Storage manager, Data Model, Roadmap 2020
douardda created T2210: Data Model.
Jan 20 2020, 2:20 PM · Data Model, Roadmap 2020
douardda created T2209: At least 2 full mirrors up and running.
Jan 20 2020, 2:20 PM · Mirror, Roadmap 2020
douardda created T2208: Durability.
Jan 20 2020, 2:19 PM · meta-task, Roadmap 2020
douardda created T2207: Improve ingestion efficiency .
Jan 20 2020, 2:18 PM · Origin-GitLab, Origin-GitHub, Roadmap 2020
douardda created T2206: Quality of Service.
Jan 20 2020, 2:17 PM · meta-task, Roadmap 2020
douardda created T2205: SLOC-level provenance tracking (prototype).
Jan 20 2020, 2:16 PM · Roadmap 2020
douardda renamed T2204: Full-text search on source code (prototype) from Full-text search on a sizeable archive subset to Full-text search (prototype).
Jan 20 2020, 2:16 PM · Roadmap 2021
douardda created T2204: Full-text search on source code (prototype).
Jan 20 2020, 2:15 PM · Roadmap 2021
douardda created T2203: Intrinsic metadata.
Jan 20 2020, 2:13 PM · Intrinsic metadata, Roadmap 2020
douardda created T2202: Collect extrinsic metadata.
Jan 20 2020, 2:12 PM · Roadmap 2022, meta-task, Roadmap 2021, Extrinsic metadata
douardda created T2201: Indexing / mining.
Jan 20 2020, 2:11 PM · meta-task, Roadmap 2020
douardda created T2200: Automatic forge discovery.
Jan 20 2020, 2:10 PM · Lister, Roadmap 2020
douardda created T2199: "Save Forge Now".
Jan 20 2020, 2:08 PM · Roadmap 2020
douardda created T2198: Robust SVN import.
Jan 20 2020, 2:07 PM · SVN Loader, Roadmap 2020
douardda created T2197: Ingestion / coverage.
Jan 20 2020, 2:06 PM · meta-task, Roadmap 2020
douardda created T2196: Batch APIs.
Jan 20 2020, 2:05 PM · Roadmap 2020
douardda created T2195: Web API 2.
Jan 20 2020, 2:04 PM · Roadmap 2020
douardda created T2194: Archive Integration (Web API).
Jan 20 2020, 2:03 PM · Roadmap 2021, meta-task
douardda renamed T2190: Archive Navigation (Web UI) from Archive Navigation to Archive Navigation (Web UI).
Jan 20 2020, 2:03 PM · Web app, meta-task, Roadmap 2020
douardda triaged T2193: Add provenance feature as Normal priority.
Jan 20 2020, 2:02 PM · Provenance database, Roadmap 2020
douardda triaged T2192: UX improvements as Normal priority.
Jan 20 2020, 2:00 PM · Web app, Roadmap 2020
douardda triaged T2191: Metadata Views as Normal priority.
Jan 20 2020, 1:59 PM · Metadata workflow, Web app, Roadmap 2020
douardda triaged T2190: Archive Navigation (Web UI) as Normal priority.
Jan 20 2020, 1:57 PM · Web app, meta-task, Roadmap 2020
douardda added a watcher for Roadmap 2020: douardda.
Jan 20 2020, 1:55 PM
douardda edited Description on Roadmap 2020.
Jan 20 2020, 1:53 PM
douardda set the icon for Roadmap 2020 to Goal.
Jan 20 2020, 11:47 AM
douardda created Roadmap 2020.
Jan 20 2020, 11:46 AM
douardda committed rDENV2feaa55ecc3e: docker/test: add a pytest based test for the vault stack (authored by douardda).
docker/test: add a pytest based test for the vault stack
Jan 20 2020, 11:25 AM
douardda closed D2553: docker/test: add a pytest based test for the vault stack.
Jan 20 2020, 11:25 AM
douardda committed rDENV60551ce76112: docker/test: add a pytest based test for the git loading stack (authored by douardda).
docker/test: add a pytest based test for the git loading stack
Jan 20 2020, 11:25 AM
douardda closed D2552: docker/test: add a pytest based test for the git loading stack.
Jan 20 2020, 11:25 AM
douardda updated the diff for D2553: docker/test: add a pytest based test for the vault stack.

Fix typos and address ardumont's comments

Jan 20 2020, 11:22 AM
douardda closed D2508: Add a tox.ini for docker tests.

closed by 490c2454749679186ffca9cdd3f480e50d2147c2

Jan 20 2020, 11:16 AM

Jan 17 2020

douardda added a comment to D2552: docker/test: add a pytest based test for the git loading stack.

Why does setup_pip call scheduler_host.check_output()?

Jan 17 2020, 5:26 PM
douardda created D2553: docker/test: add a pytest based test for the vault stack.
Jan 17 2020, 4:12 PM
douardda created D2552: docker/test: add a pytest based test for the git loading stack.
Jan 17 2020, 4:12 PM
douardda committed rDENV7fd9330b3fd2: docker/tox: install pdbpp in py3 environment (authored by douardda).
docker/tox: install pdbpp in py3 environment
Jan 17 2020, 2:13 PM
douardda committed rDENV5357798c55a6: docker/tox: add {posargs} to the pytest cli (authored by douardda).
docker/tox: add {posargs} to the pytest cli
Jan 17 2020, 2:13 PM
douardda updated the task description for T1805: Public API v2.
Jan 17 2020, 10:21 AM · meta-task, Web app
douardda updated the task description for T1805: Public API v2.
Jan 17 2020, 10:21 AM · meta-task, Web app

Jan 16 2020

douardda committed rDSTOCbc0e81c3d140: pre-commit: explicitely whitelist 'iff' when running codespell (authored by douardda).
pre-commit: explicitely whitelist 'iff' when running codespell
Jan 16 2020, 5:22 PM
douardda committed rDSTOC29eb5489580f: Fix a few typos reported by codespell (authored by douardda).
Fix a few typos reported by codespell
Jan 16 2020, 5:22 PM
douardda committed rDSTOC1472c8e89334: fix trailing ws reported by pre-commit (authored by douardda).
fix trailing ws reported by pre-commit
Jan 16 2020, 5:22 PM
douardda committed rDSTOCa97db9319dda: Fix swh-storage-add-dir (authored by douardda).
Fix swh-storage-add-dir
Jan 16 2020, 5:22 PM
douardda committed rDSTOC264cd33b77aa: Add a pre-commit-hooks.yaml config file (authored by douardda).
Add a pre-commit-hooks.yaml config file
Jan 16 2020, 5:22 PM
douardda committed rDSTOCc7878084ac48: Remove utils/(dump|fix)_revisions scripts (authored by douardda).
Remove utils/(dump|fix)_revisions scripts
Jan 16 2020, 5:22 PM
douardda committed rDSTOCea9aa4736fa7: and not only for an existing origin visit. (authored by douardda).
and not only for an existing origin visit.
Jan 16 2020, 5:22 PM
douardda accepted D2532: cran.loader: Align cran loader with other package loaders.

Not a definitive solution but as discussed IRL, let's quick fix the cran lister then refactor it "the proper way"

Jan 16 2020, 11:04 AM

Jan 15 2020

douardda added inline comments to D2524: cran.lister: Use CRAN's canonical url as origin url.
Jan 15 2020, 4:17 PM
douardda created P583 (An Untitled Masterwork).
Jan 15 2020, 3:26 PM
douardda accepted D2514: Add env var SWH_MAIN_PACKAGE to initialize sentry_sdk with a release..

I'm fine with the code, but as I already said, I'd really like the commit message to have a paragraph on why this is needed and what problem it solves.

Jan 15 2020, 11:13 AM

Jan 14 2020

douardda created P582 Errors when starting swh-web with django 2 (in docker-dev).
Jan 14 2020, 11:35 AM