Page MenuHomeSoftware Heritage

olasd (Nicolas Dandrimont)
UserAdministrator

Projects (8)

User Details

User Since
Sep 7 2015, 3:25 PM (319 w, 22 h)
Roles
Administrator

Recent Activity

Today

olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

Sent a summary of this discussion to the swh-devel list for input:

Tue, Oct 19, 11:36 AM · Git loader

Yesterday

olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

B3 I am not convinced a "synthetic" flag on the Snapshot branch makes sense, or at least I find this name confusing, especially considering we already have a synthetic flag on Revision: it's not synthetic in the sense of it's not object crafted by SWH, it comes from the origin.

Mon, Oct 18, 4:42 PM · Git loader
olasd added a comment to T3661: docs: Activate build on docs diff.

I think the main challenge here will be doing this in such a way that we don't have to do a fresh clone of swh-environment (and all associated repos) every time we build.

Mon, Oct 18, 11:53 AM · Documentation
olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

I would like us to conclude this discussion soon.

Mon, Oct 18, 11:29 AM · Git loader

Fri, Oct 15

olasd added a comment to T3633: staging/production - Kafka access for ENEA mirror.

The permissions were missing for consumer groups, so no consumer could get started at all.

Fri, Oct 15, 6:53 PM · System administration
olasd accepted D6473: Improve PostgreSQL storage scheme for the `with-path-denormalized` flavor.

Looks sensible to me, thanks.

Fri, Oct 15, 3:58 PM
olasd accepted D6478: Activate sentry for counter journal client.
Fri, Oct 15, 10:23 AM
olasd accepted D6479: bib/install: disable pip self ugprade.
Fri, Oct 15, 10:22 AM
olasd added a project to T3660: Nodes with missing ancestors in SWH DAG / SWH-graph: Archive content.
Fri, Oct 15, 9:17 AM · Archive content
olasd added a comment to T3656: Survey revisions/releases with partially loaded history.
21:57 guest@softwareheritage => select count(distinct id) from revision_history where not exists (select 1 from revision where id=parent_id);
 count 
───────
  2218
(1 ligne)
Fri, Oct 15, 8:50 AM · Archive content

Thu, Oct 14

olasd added a comment to T3487: Installation of the new provenance server.

I've run alter system commands to bump these configuration variables in $DATADIR/postgresql.auto.conf, then ran a pg_reload_config():

Thu, Oct 14, 5:44 PM · System administration
olasd added a comment to T3487: Installation of the new provenance server.

The log is flooded with

2021-10-14 15:24:54.422 UTC [3951720] LOG:  checkpoints are occurring too frequently (28 seconds apart)
2021-10-14 15:24:54.422 UTC [3951720] HINT:  Consider increasing the configuration parameter "max_wal_size".
Thu, Oct 14, 5:27 PM · System administration
olasd added a comment to T3487: Installation of the new provenance server.
17:19:13     +olasd ╡ the postgresql tuning hasn't happened yet, afaict? effective_cache_size isn't set, and shared_buffers is tiny
17:19:46          ⤷ ╡ I'd bump shared_buffers to 128 GB and effective_cache_size to 256 GB, see where that gets you
17:20:19          ⤷ ╡ and probably maintenance_work_mem to something like 16 or 32 GB
17:20:54          ⤷ ╡ as well as random_page_cost to something lower like 1.5
Thu, Oct 14, 5:22 PM · System administration
olasd added a comment to D6477: staging/journal: Declare a new kafka node to migrate journal0.

Ah, now that I read through this again; would it make sense for the zookeeper server to be called using the CNAME instead of the host FQDN ?

Thu, Oct 14, 4:14 PM
olasd accepted D6477: staging/journal: Declare a new kafka node to migrate journal0.

Looks good, except for a missing new TLS certificate, I think.

Thu, Oct 14, 4:13 PM
olasd updated subscribers of T3658: Reference bitbucket mercurial origins.
In T3658#72284, @olasd wrote:

We could argue that adding a separate, "virtual" lister instance for these bulk archived origins would make sense, but I don't know if it's worth the bother.

Thu, Oct 14, 3:20 PM · System administration, Mercurial loader
olasd added a comment to T3658: Reference bitbucket mercurial origins.

I was thinking of something ad-hoc such as:

Thu, Oct 14, 3:17 PM · System administration, Mercurial loader
olasd added inline comments to D6472: Add a script for a 'monthly roadmap report' bot email.
Thu, Oct 14, 2:13 PM
olasd added a comment to T1957: Handling missing DAG nodes.

In SWHIDv2, instead of having a hardcoded "pointer to another revision" directory entry type, we could enable pointers to more generic "unresolved external entities". When possible, we should make these pointers compatible with the current ExtID table, so that users of the data can look the contents of the pointed objects up lazily.

Thu, Oct 14, 12:06 PM · Data Model
olasd updated subscribers of T1617: Experiment with generation numbers to improve revisions walk performance.

@vlorentz mentioned this idea in the context of T3655 (git loader global deduplication).

Thu, Oct 14, 12:00 PM · Storage manager
olasd added a comment to T3635: git loader: enable "partial" global deduplication of revisions via the extid mapping table.

Then I don't really get how this can help if we don't load revisions in topological order.

Thu, Oct 14, 11:54 AM · Git loader
olasd updated the task description for T3655: loader git: enable global deduplication of head branches before fetching them.
Thu, Oct 14, 11:41 AM · Git loader
olasd triaged T3656: Survey revisions/releases with partially loaded history as Low priority.
Thu, Oct 14, 11:40 AM · Archive content
olasd added a parent task for T3635: git loader: enable "partial" global deduplication of revisions via the extid mapping table: T3655: loader git: enable global deduplication of head branches before fetching them.
Thu, Oct 14, 11:18 AM · Git loader
olasd added subtasks for T3655: loader git: enable global deduplication of head branches before fetching them: T3635: git loader: enable "partial" global deduplication of revisions via the extid mapping table, T3654: loader git: load revisions in topological order.
Thu, Oct 14, 11:18 AM · Git loader
olasd added a parent task for T3654: loader git: load revisions in topological order: T3655: loader git: enable global deduplication of head branches before fetching them.
Thu, Oct 14, 11:18 AM · Git loader
olasd triaged T3655: loader git: enable global deduplication of head branches before fetching them as Normal priority.
Thu, Oct 14, 11:18 AM · Git loader
olasd renamed T3635: git loader: enable "partial" global deduplication of revisions via the extid mapping table from Reduce git loader work (use extid mapping table) to git loader: enable "partial" global deduplication of revisions via the extid mapping table.
Thu, Oct 14, 11:15 AM · Git loader
olasd added a comment to T3654: loader git: load revisions in topological order.

(I've removed T3653 as parent as this is a somewhat longer term endeavour. Not the topological sorting itself, but making sure that (most) existing revisions aren't dangling, before we can use this topological guarantee)

Thu, Oct 14, 11:13 AM · Git loader
olasd removed a parent task for T3654: loader git: load revisions in topological order: T3653: Stabilize loader git.
Thu, Oct 14, 11:12 AM · Git loader
olasd removed a subtask for T3653: Stabilize loader git: T3654: loader git: load revisions in topological order.
Thu, Oct 14, 11:12 AM · Git loader
olasd triaged T3654: loader git: load revisions in topological order as Low priority.
Thu, Oct 14, 11:11 AM · Git loader
olasd accepted D6458: tests: Turn origin* hypothesis strategies into pytest fixtures.
In D6458#167772, @olasd wrote:
In D6458#167771, @olasd wrote:

Yeah, sure, I don't have a problem with that.

(That is, I don't have a problem with these changes landing first, as long as we make sure that eventually we have a proper way of reproducing test failures that have come out of "random" fixtures.)

I might have a proper solution to reproduce test failures involving random fixtures. I am going to land that pile of diffs and submit a new one for reproducibility afterwards.

Thu, Oct 14, 11:03 AM

Wed, Oct 13

olasd added a comment to D6458: tests: Turn origin* hypothesis strategies into pytest fixtures.
In D6458#167771, @olasd wrote:

Yeah, sure, I don't have a problem with that.

Wed, Oct 13, 5:34 PM
olasd added a comment to D6458: tests: Turn origin* hypothesis strategies into pytest fixtures.

@olasd Could you open a task, so anlambert can land this stack of diffs now before we discuss the next step?

Wed, Oct 13, 5:31 PM
olasd accepted D6464: sysadm: Fix remaining warning on sysadm docs.
Wed, Oct 13, 10:14 AM

Tue, Oct 12

olasd added a comment to D6458: tests: Turn origin* hypothesis strategies into pytest fixtures.

Thanks for working on reducing the number of hypothesis fixtures!

Tue, Oct 12, 6:46 PM
olasd accepted D6462: sphinx: update the plantuml version installed by the debian package.
Tue, Oct 12, 5:43 PM
olasd committed rDDOC8efe43320d21: sysadm: add stub to the data silo pages (authored by olasd).
sysadm: add stub to the data silo pages
Tue, Oct 12, 5:10 PM
olasd committed rDDOCb1d41f68fba1: sysadm: Add some meat to the PostgreSQL section (authored by olasd).
sysadm: Add some meat to the PostgreSQL section
Tue, Oct 12, 5:10 PM
olasd committed rDDOC85752ba1f728: Make sure `make clean` cleans images too (authored by olasd).
Make sure `make clean` cleans images too
Tue, Oct 12, 4:17 PM
olasd committed rDDOCab66ccb0b6b1: Add stubs for data silos (authored by olasd).
Add stubs for data silos
Tue, Oct 12, 4:17 PM
olasd committed rCJSWHcb5bb02aa0e4: swh-docs/dev: publish all built docs in jenkins (authored by olasd).
swh-docs/dev: publish all built docs in jenkins
Tue, Oct 12, 3:19 PM
olasd committed rDDOCdffd9cd4449b: sysadm/server-architecture: add work in progress markers (authored by olasd).
sysadm/server-architecture: add work in progress markers
Tue, Oct 12, 2:42 PM
olasd committed rDDOC615367ea7336: sysadm/puppet: add work in progress markers (authored by olasd).
sysadm/puppet: add work in progress markers
Tue, Oct 12, 2:33 PM
olasd committed rDDOCe816b5e8b429: sysadm: Add backup section to index (authored by olasd).
sysadm: Add backup section to index
Tue, Oct 12, 2:33 PM
olasd committed rDDOCf6a788eef7be: sysadm: bootstrap backup section (authored by olasd).
sysadm: bootstrap backup section
Tue, Oct 12, 12:29 PM
olasd committed rDDOCffbb8dde36dd: sysadm: consistency change: s/puppet/Puppet/g (authored by olasd).
sysadm: consistency change: s/puppet/Puppet/g
Tue, Oct 12, 12:21 PM
olasd committed rDDOC42679ad3f1e5: Bootstrap server architecture documentation (authored by olasd).
Bootstrap server architecture documentation
Tue, Oct 12, 12:19 PM
olasd committed rDDOC2708163004a8: Bootstrap puppet documentation (authored by olasd).
Bootstrap puppet documentation
Tue, Oct 12, 12:19 PM
olasd committed rDDOC96c92d7d6a44: sysadm: proper alignment in toctree directive (authored by olasd).
sysadm: proper alignment in toctree directive
Tue, Oct 12, 11:54 AM

Fri, Oct 8

olasd added a comment to T3621: Create a production read-only objstorage.

Hmm, do we really want this to be open to the world with no authentication whatsoever? (which is what D6448 seems to be doing)

Fri, Oct 8, 5:32 PM · System administration
olasd closed D6447: buffer: add some debug logging for number of objects sent.
Fri, Oct 8, 5:27 PM
olasd committed rDSTO3441f68985ae: buffer: add some debug logging for number of objects sent (authored by olasd).
buffer: add some debug logging for number of objects sent
Fri, Oct 8, 5:27 PM
olasd closed D6446: buffer: add a threshold for the estimated size of revision and release batches.
Fri, Oct 8, 5:05 PM
olasd committed rDSTOb6040142fe72: buffer: add a threshold for the estimated size of revision and release batches (authored by olasd).
buffer: add a threshold for the estimated size of revision and release batches
Fri, Oct 8, 5:05 PM
olasd updated the diff for D6447: buffer: add some debug logging for number of objects sent.

rebase

Fri, Oct 8, 4:54 PM
olasd updated the diff for D6446: buffer: add a threshold for the estimated size of revision and release batches.

Fix revision -> release typo in release_add flush call

Fri, Oct 8, 4:54 PM
olasd closed D6445: buffer: add a threshold for the number of revision parents in one batch.
Fri, Oct 8, 4:53 PM
olasd committed rDSTO7c5b0ec15e40: buffer: add a threshold for the number of revision parents in one batch (authored by olasd).
buffer: add a threshold for the number of revision parents in one batch
Fri, Oct 8, 4:53 PM
olasd requested review of D6447: buffer: add some debug logging for number of objects sent.
Fri, Oct 8, 4:23 PM
olasd requested review of D6446: buffer: add a threshold for the estimated size of revision and release batches.
Fri, Oct 8, 4:08 PM
olasd requested review of D6445: buffer: add a threshold for the number of revision parents in one batch.
Fri, Oct 8, 4:08 PM
olasd added a comment to T3625: Reduce git loader memory footprint.
In T3625#71799, @olasd wrote:

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

  • cumulated length of messages for revisions and releases
Fri, Oct 8, 4:02 PM · Git loader
olasd added a revision to T3625: Reduce git loader memory footprint: D6445: buffer: add a threshold for the number of revision parents in one batch.
Fri, Oct 8, 4:01 PM · Git loader
olasd added a revision to T3625: Reduce git loader memory footprint: D6446: buffer: add a threshold for the estimated size of revision and release batches.
Fri, Oct 8, 3:58 PM · Git loader
olasd closed D6443: buffer: add a threshold for the number of directory entries in one batch.
Fri, Oct 8, 3:56 PM
olasd committed rDSTO5edc0ba7ac12: buffer: add a threshold for the number of directory entries in one batch (authored by olasd).
buffer: add a threshold for the number of directory entries in one batch
Fri, Oct 8, 3:56 PM
olasd committed rDSTOabe95b34a2f9: filter: add filtering for release_add (authored by olasd).
filter: add filtering for release_add
Fri, Oct 8, 3:56 PM
olasd committed rDSTOc52b7b667911: filter: do not call the underlying functions if there's nothing to add (authored by olasd).
filter: do not call the underlying functions if there's nothing to add
Fri, Oct 8, 3:56 PM
olasd closed D6427: swh.storage filter/buffer improvements.
Fri, Oct 8, 3:56 PM
olasd committed rDSTO5d5d4c941eac: buffer: Ensure that we don't send data from empty buffers (authored by olasd).
buffer: Ensure that we don't send data from empty buffers
Fri, Oct 8, 3:56 PM
olasd requested review of D6443: buffer: add a threshold for the number of directory entries in one batch.
Fri, Oct 8, 3:21 PM
olasd added a revision to T3625: Reduce git loader memory footprint: D6443: buffer: add a threshold for the number of directory entries in one batch.
Fri, Oct 8, 3:06 PM · Git loader
olasd added inline comments to D6442: Extract the path slicing logic in a dedicated PathSlicer class.
Fri, Oct 8, 2:46 PM
olasd published D6427: swh.storage filter/buffer improvements for review.

I'll split off the new buffer thresholds in a new diff. This diff now only contains the (small) improvements to the buffer/filter proxies

Fri, Oct 8, 2:15 PM
olasd accepted D6439: Make workers send task events only when required.

Thanks!

Fri, Oct 8, 10:56 AM

Thu, Oct 7

olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

Ah, another question I've been thinking about: should we go back to existing visits of git repositories and give them a new, pruned snapshot? Our data model now allows it: we can just append a new final OriginVisitStatus pointing at a pruned snapshot.

Thu, Oct 7, 12:46 PM · Git loader
olasd added a comment to D6424: Perfect hashmap C implementation.

@dachary It'd be nice if you could describe what this is about in the commit message and
the diff description (if you actually provide a commit description, then when you create
the diff, the commit message is used as a description bootstrap). I know it's more work
for you but it happens that:

  1. it helps the reviewers to have some context directly here (without having to follow

between a multitude of tasks. FYI, I've followed through the task but it's not enough,
i need to also dig in that arborescence of tasks).

  1. is also how we are doing that in every other modules ;)
  1. the curious could learn a thing or 2 even if they don't do a proper review.

Please and thanks in advance.

Cheers,

Thu, Oct 7, 12:39 PM
olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.
In T3627#71809, @zack wrote:

Thanks for your feedback @olasd. I see three main arguments raised there: (1) the raciness of archiving those data via other means (= related forks), (2) the completeness of our canvassing of synthetic refs, (3) annotating rather than not archiving "synthetic" refs.

For (1), sure, it's racy, hence we could lose stuff that gets removed from GitHub before we have the time to archive it. But this is a drop in the ocean in comparison with our lag/backlog.

Thu, Oct 7, 12:15 PM · Git loader
olasd added a comment to T3487: Installation of the new provenance server.

rSPSITE6a233452cd48 fixed the prometheus node exporter.

Thu, Oct 7, 11:17 AM · System administration
olasd added a comment to T3608: Deprecate most of the /browse/origin/.* URLs.

Awesome, thanks for confirming this!

Thu, Oct 7, 10:54 AM · Web app
olasd committed rSPSITE6a233452cd48: Drop netdev ignored device from prometheus config (authored by olasd).
Drop netdev ignored device from prometheus config
Thu, Oct 7, 10:50 AM
olasd committed rSPSITE09689dd703c7: Add missing newline to data/common/common.yaml (authored by olasd).
Add missing newline to data/common/common.yaml
Thu, Oct 7, 10:50 AM
olasd added a comment to T3608: Deprecate most of the /browse/origin/.* URLs.

I'm asking this because using predictable origin-centric URLs is generally much more user friendly than having to use multiple APIs to look up the SWHID of a given object before being able to construct the URL, and one would have to always to dynamic API calls to generate the URL for browsing the "latest archival" of a given origin.

Thu, Oct 7, 10:22 AM · Web app
olasd added a comment to T3608: Deprecate most of the /browse/origin/.* URLs.

Just to be clear, you're looking to keep these URL working, but turn them into redirects over to swhid-centric URLs with context parameters (and drop the original view code from these URLs), correct?

Thu, Oct 7, 10:19 AM · Web app
olasd added a comment to T3625: Reduce git loader memory footprint.

While we're at it, we should probably be adding some thresholds in the buffer proxy for:

  • cumulated length of messages for revisions and releases
  • cumulated number of parents for revisions
Thu, Oct 7, 10:11 AM · Git loader
olasd added a comment to T3625: Reduce git loader memory footprint.

(this also matches the fact that we've seen, on our main ingestion database, directory_add operations that would take multiple hours, and have knock-on effects on backups and replications because of the long-running insertion transactions)

Thu, Oct 7, 10:09 AM · Git loader
olasd added a comment to T3625: Reduce git loader memory footprint.

So, after doing some more analysis of memory usage patterns on these edge case repositories, my suspicion is that the high memory usage is generally being caused by the loader processing batches of large directories, closely packed together, at the same time.

Thu, Oct 7, 10:08 AM · Git loader
olasd requested changes to D6401: Filter out pull request related branches.

This should stay pending until we resolve the archiving policy discussion in T3627, so I'm marking it as such.

Thu, Oct 7, 9:57 AM
olasd added a comment to T3627: Consider dropping pull request references from the git loader ingestion.

Yes, we must filter this stuff out (we discussed this issue with @zack some time ago)

Thu, Oct 7, 9:53 AM · Git loader
olasd added a comment to D6405: Respect task configuration to allow ignoring task result event.

This looks like an okay thing to do, but instead of only ignoring results (which would only cut down a third of the messages), we should probably be deactivating events completely on these workers.

Yes, I started with that config because i did not initially found the way to configure the send_events to False (or something).

Thu, Oct 7, 9:49 AM
olasd added a revision to T3625: Reduce git loader memory footprint: D6427: swh.storage filter/buffer improvements.
Thu, Oct 7, 9:23 AM · Git loader

Wed, Oct 6

olasd committed rCDFJ9486c2dd1559: Just run a full apt dist-upgrade first (authored by olasd).
Just run a full apt dist-upgrade first
Wed, Oct 6, 7:00 PM
olasd committed rCDFJ97ec136332eb: base-buster: Make sure to upgrade apt and dpkg first (authored by olasd).
base-buster: Make sure to upgrade apt and dpkg first
Wed, Oct 6, 6:57 PM
olasd committed rCDFJ8aa3af08f8b7: base-buster: add libcmph-dev for swh.perfecthash (authored by olasd).
base-buster: add libcmph-dev for swh.perfecthash
Wed, Oct 6, 6:50 PM
olasd accepted D6420: Rename imports of swh.model.identifiers to fix deprecation warnings..

Looks fine (i.e. the identifiers DeprecationWarnings are gone in tox, except for one that gets triggered by some pytest internal assertion rewrite).

Wed, Oct 6, 6:47 PM
olasd added a comment to D6408: Stop sending next-gen scheduled task results to scheduler listener.

Rather than doing this, we should probably disable worker task events altogether (that is, run celery worker without the --events/--task-events flag)

Wed, Oct 6, 4:51 PM
olasd accepted D6405: Respect task configuration to allow ignoring task result event.

This looks like an okay thing to do, but instead of only ignoring results (which would only cut down a third of the messages), we should probably be deactivating events completely on these workers.

Wed, Oct 6, 4:41 PM
olasd committed rDENV5003c8e917b6: Add swh-perfecthash (authored by olasd).
Add swh-perfecthash
Wed, Oct 6, 4:38 PM