Page MenuHomeSoftware Heritage
Feed Advanced Search

Sep 14 2021

douardda committed rDENVb0f07795ddff: docker: Document how to consume kafka topics from the host (authored by douardda).
docker: Document how to consume kafka topics from the host
Sep 14 2021, 11:40 AM
douardda closed D6248: docker: allow kafka to be consumed from the host.
Sep 14 2021, 11:40 AM
douardda committed rDENVf612427f663d: docker: allow kafka to be consumed from the host (authored by douardda).
docker: allow kafka to be consumed from the host
Sep 14 2021, 11:40 AM
douardda closed D6247: Commit kafka messages which offset has reach the high limit.

closed by 94be817f869409c64415b181824071d2998e33d5

Sep 14 2021, 11:38 AM
douardda closed D6246: Add a JournalClientOffsetRanges.unsubscribe() method.

closed by a3c1f39013bae1a6982140d51d8bb443dc1b5c9c

Sep 14 2021, 11:37 AM
douardda updated the diff for D6248: docker: allow kafka to be consumed from the host.

Keep port 5092 exposed on host

Sep 14 2021, 11:35 AM
douardda committed rDDATASET94be817f8694: Commit kafka messages which offset has reach the high limit (authored by douardda).
Commit kafka messages which offset has reach the high limit
Sep 14 2021, 11:23 AM
douardda committed rDDATASETa3c1f39013ba: Add a JournalClientOffsetRanges.unsubscribe() method (authored by douardda).
Add a JournalClientOffsetRanges.unsubscribe() method
Sep 14 2021, 11:22 AM
douardda added inline comments to D6248: docker: allow kafka to be consumed from the host.
Sep 14 2021, 11:21 AM

Sep 13 2021

douardda committed rDDATASET0425bdea0789: Fix a missing f-string prefix (authored by douardda).
Fix a missing f-string prefix
Sep 13 2021, 5:17 PM
douardda updated the diff for D6248: docker: allow kafka to be consumed from the host.

Add a bit of documentation in the README file on how to consume kafka from the host

Sep 13 2021, 5:13 PM
douardda requested review of D6248: docker: allow kafka to be consumed from the host.
Sep 13 2021, 4:51 PM
douardda abandoned D6234: Add a --reset option to export_graph cli tool.

It's not worth the trouble, and there is a better solution (server-side)

Sep 13 2021, 4:23 PM
douardda added a comment to D6234: Add a --reset option to export_graph cli tool.

You could also add a command in swh-dataset's entrypoint.sh that calls whatever Kafka's script does

Sep 13 2021, 4:20 PM
douardda added a comment to D6234: Add a --reset option to export_graph cli tool.

So either I kill this diff or it stays "intricate" with the setup of the consumer (so the whole journalprocessor.py)

Note: this feature is mainly useful for testing purpose IMHO, so I suppose it's not that critical to keep it, I just find it handy when "playing" with swh dataset export

Meh. How much easier does it make testing, compared to using Kafka's CLI (from the linked comment)?

Sep 13 2021, 4:11 PM
douardda updated the diff for D6234: Add a --reset option to export_graph cli tool.

rebase

Sep 13 2021, 4:05 PM
douardda requested review of D6247: Commit kafka messages which offset has reach the high limit.
Sep 13 2021, 4:04 PM
douardda abandoned D6235: Commit kafka messages wich offset has reach the high limit.

in favor of D6247 because phab/arcanist won't let me update this later any more (sorry)

Sep 13 2021, 4:04 PM
douardda requested review of D6246: Add a JournalClientOffsetRanges.unsubscribe() method.
Sep 13 2021, 4:02 PM
douardda committed rDDATASET358d84938d01: Reduce the size of the progress bar (authored by douardda).
Reduce the size of the progress bar
Sep 13 2021, 3:33 PM
douardda closed D6233: Make sure the progress bar for the export reaches 100%.
Sep 13 2021, 3:33 PM
douardda committed rDDATASET47713ee38c94: Make sure the progress bar for the export reaches 100% (authored by douardda).
Make sure the progress bar for the export reaches 100%
Sep 13 2021, 3:33 PM
douardda committed rDDATASET2760e322af7c: Simplify the lo/high partition offset computation (authored by douardda).
Simplify the lo/high partition offset computation
Sep 13 2021, 3:33 PM
douardda committed rDDATASETd07b2a632256: Explicitly close the temporary kafka consumer in `get_offsets` (authored by douardda).
Explicitly close the temporary kafka consumer in `get_offsets`
Sep 13 2021, 3:33 PM
douardda closed D6232: Simplify the lo/high partition offset computation.
Sep 13 2021, 3:33 PM
douardda committed rDDATASETe47a3db1287b: Use proper signature for JournalClientOffsetRanges.process() (authored by douardda).
Use proper signature for JournalClientOffsetRanges.process()
Sep 13 2021, 3:33 PM
douardda updated the diff for D6233: Make sure the progress bar for the export reaches 100%.

attempt to trick phab/arcanist

Sep 13 2021, 3:31 PM
douardda updated the diff for D6234: Add a --reset option to export_graph cli tool.

rebase

Sep 13 2021, 3:15 PM
douardda updated the diff for D6235: Commit kafka messages wich offset has reach the high limit.

Rebase (remove D6234 from dependencies)

Sep 13 2021, 3:14 PM
douardda added a comment to D6234: Add a --reset option to export_graph cli tool.

Can we keep the reset stuff outside the journalprocessor.py logic? It's already complex enough

I'll give it a try

Sep 13 2021, 2:59 PM

Sep 10 2021

douardda updated the diff for D6235: Commit kafka messages wich offset has reach the high limit.

rebase, fix typos, squash revisions

Sep 10 2021, 5:54 PM
douardda updated the diff for D6234: Add a --reset option to export_graph cli tool.

rebase and fix --reset help messsage

Sep 10 2021, 5:52 PM
douardda updated the diff for D6233: Make sure the progress bar for the export reaches 100%.

rebase

Sep 10 2021, 5:52 PM
douardda updated the diff for D6232: Simplify the lo/high partition offset computation.

Add an explicit "skipped" message if a nothin is to be consumed for a topic

Sep 10 2021, 5:51 PM
douardda added inline comments to D6235: Commit kafka messages wich offset has reach the high limit.
Sep 10 2021, 5:42 PM
douardda added a comment to D6234: Add a --reset option to export_graph cli tool.

Can we keep the reset stuff outside the journalprocessor.py logic? It's already complex enough

Sep 10 2021, 5:37 PM
douardda added a comment to D6235: Commit kafka messages wich offset has reach the high limit.

lags reported by cmak was completely inconsistent

only because you have a small dataset, right?
With a larger one, the last batch of each partition should have a negligeable size.

Sep 10 2021, 5:26 PM
douardda added a comment to D6235: Commit kafka messages wich offset has reach the high limit.

There's a bunch of typos in your commit/diff msg: "wich", "oef", "ony", "ALL offsets that needs to be", "stash" -> "squash"


this is necessary to ensure these messages are committed in kafka,
otherwise, since the (considered) empty partition is unsubscribed from,
it never gets committed in JournalClient.handle_messages() (since this
later only commit assigned partitions).

Why is this a problem?

Sep 10 2021, 4:35 PM

Sep 9 2021

douardda updated the summary of D6235: Commit kafka messages wich offset has reach the high limit.
Sep 9 2021, 6:01 PM
douardda requested review of D6235: Commit kafka messages wich offset has reach the high limit.
Sep 9 2021, 6:01 PM
douardda requested review of D6234: Add a --reset option to export_graph cli tool.
Sep 9 2021, 5:58 PM
douardda updated the diff for D6233: Make sure the progress bar for the export reaches 100%.

add forgotten revision: Reduce the size of the progress bar

Sep 9 2021, 5:56 PM
douardda requested review of D6233: Make sure the progress bar for the export reaches 100%.
Sep 9 2021, 5:56 PM
douardda requested review of D6232: Simplify the lo/high partition offset computation.
Sep 9 2021, 5:54 PM
douardda accepted D6215: docker/conf: Fix search journal client configurations.
Sep 9 2021, 10:11 AM
douardda requested changes to D6220: Added test only method info in the interface doc strings.

Please use imperative style in the got commit message
https://chris.beams.io/posts/git-commit/

Sep 9 2021, 9:03 AM

Sep 3 2021

douardda abandoned D5648: Add a bit of logging in the buffer proxy storage.
Sep 3 2021, 10:50 AM
douardda abandoned D4920: Randomize last_update in generated ListedOrigins in fill_test_data.
Sep 3 2021, 10:49 AM

Sep 1 2021

douardda added a comment to T3542: Decide what metadata we want to / can collect from GitHub.

do we need the "list of forks" if we keep the "fork of what"? I mean these are the 2 ends of the fork relation, right?

Sep 1 2021, 12:06 PM · Origin-GitHub, Extrinsic metadata

Aug 30 2021

douardda added a comment to T3487: Installation of the new provenance server.

yes the idea is to have a beefy enough machine to perform full-size experiments on, that can then be (part of) the production infrastructure dedicated to the provenance index.

Aug 30 2021, 11:28 AM · System administration

Aug 13 2021

douardda accepted D6087: Remove shell scripts from setup.py.
Aug 13 2021, 4:54 PM
douardda added inline comments to D5818: send-to-celery: Add more options to allow scheduling of edge case origins.
Aug 13 2021, 3:35 PM
douardda added a comment to D6084: Rename PostgreSQL backend and code styling.

For the fix of revision_get, there should be a test.

The test is coming later from @jayeshv mongodb branch.

Aug 13 2021, 12:15 PM
douardda resigned from D6084: Rename PostgreSQL backend and code styling.
Aug 13 2021, 12:05 PM
douardda requested changes to D6084: Rename PostgreSQL backend and code styling.

Please don't mix fixes with codestyling/renaming revisions in a single diff, it makes the review much harder.

Aug 13 2021, 11:35 AM
douardda added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

And we could also use zfs-backed thin provisionning for the / of workers to save storage space (and possibly help to ensure consistency of deployed workers... not extra convinced of this later point)

Aug 13 2021, 10:31 AM · System administration
douardda added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

but that requires some more storage on hypervisors we currently don't have

Don't the hypervisors also serve as OSDs? We could just get a disk per hypervisor (partially?) out of the ceph cluster and use it for the workers' /tmp, or even their whole disk.

Aug 13 2021, 10:25 AM · System administration
douardda accepted D6073: bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID.

but anyway, it looks fine to me

Aug 13 2021, 9:55 AM
douardda added inline comments to D6073: bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID.
Aug 13 2021, 9:54 AM
douardda added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

one other improvement may be to modify a bit the profile of the workers (to reduce the load on the ceph cluster):

  • lower the replication factor for workers' volumes (or even use local storage, but that requires some more storage on hypervisors we currently don't have),
  • (probably not very relevant but) stop having swap on workers (since this swap end up being on the ceph volume, so replicated etc.) (oh this has been done already, good)
Aug 13 2021, 9:24 AM · System administration

Aug 12 2021

douardda added inline comments to D6073: bytes_to_str: Format strings directly, instead of constructing ExtendedSWHID.
Aug 12 2021, 1:57 PM
douardda accepted D6071: Revisited history graph implementation.
Aug 12 2021, 10:04 AM
douardda added a comment to D6071: Revisited history graph implementation.
  • the use of newly introduced as_dict() methods seems unrelated here; unless I'm mistaken, the purpose if this change is better assertion reports by pytest on failure; if so, it should be presented as this in a dedicated revision

This method is only used for test purposes but it doesn't make sense without the refactoring (the complete HistoryGraph class was not even present prior to the refactoring),

Aug 12 2021, 10:04 AM

Aug 11 2021

douardda requested changes to D6071: Revisited history graph implementation.
Aug 11 2021, 12:37 PM
douardda added a comment to D6071: Revisited history graph implementation.

A few remarks:

Aug 11 2021, 12:37 PM

Aug 10 2021

douardda updated the task description for T3085: Complete and updated copy of the archive on S3 (objects+graph).
Aug 10 2021, 4:00 PM · Roadmap 2022, meta-task, Roadmap 2021, System administration, Object storage
douardda added a parent task for T1954: Up-to-date objstorage mirror on S3: T3477: Add alerting when the copy to S3 starts lagging.
Aug 10 2021, 3:59 PM · System administration, Object storage
douardda added a subtask for T3477: Add alerting when the copy to S3 starts lagging: T1954: Up-to-date objstorage mirror on S3.
Aug 10 2021, 3:59 PM · Roadmap 2021, System administration
douardda triaged T3477: Add alerting when the copy to S3 starts lagging as High priority.
Aug 10 2021, 3:58 PM · Roadmap 2021, System administration
douardda updated the task description for T3085: Complete and updated copy of the archive on S3 (objects+graph).
Aug 10 2021, 3:56 PM · Roadmap 2022, meta-task, Roadmap 2021, System administration, Object storage
douardda added a comment to T1954: Up-to-date objstorage mirror on S3.

well this task should be closed, and a new subtask could be added for the alerting

Aug 10 2021, 3:55 PM · System administration, Object storage
douardda added a comment to T1954: Up-to-date objstorage mirror on S3.

unless I'm mistaken, this task can be closed now, it looks to have reached a steady state where the lag is near 0

Aug 10 2021, 2:18 PM · System administration, Object storage

Aug 9 2021

douardda accepted D6067: cassandra: Fix crash when using _missing() functions with more than 100 ids with ScyllaDB..
Aug 9 2021, 11:33 AM
douardda accepted D6069: from_disk: Do not drop tags with missing tagger or date.
Aug 9 2021, 11:32 AM

Aug 6 2021

douardda created P1116 (An Untitled Masterwork).
Aug 6 2021, 3:21 PM
douardda added a comment to T3453: Refactor the backend to make it scale better.

I've been thinking a bit about the refactoring of the ProvenanceStorageServer as described in the doc, with a series of queues between the public API and the backend database.

Aug 6 2021, 11:08 AM · Provenance database
douardda updated subscribers of T3453: Refactor the backend to make it scale better.
Aug 6 2021, 11:04 AM · Provenance database
douardda accepted D6054: Add test for the different `ProvenanceStorageInterface` implementations.
Aug 6 2021, 10:59 AM
douardda closed D6031: Add a quick start section in the documentation and simplify the configuration file loading mechanism in the cli.
Aug 6 2021, 10:58 AM
douardda committed rDPROV058ed19b0100: Simplify the configuration file loading mechanism in the cli (authored by douardda).
Simplify the configuration file loading mechanism in the cli
Aug 6 2021, 10:58 AM
douardda committed rDPROV3b145f15c2db: Add a quick start section in the documentation (authored by douardda).
Add a quick start section in the documentation
Aug 6 2021, 10:58 AM
douardda closed D6015: Use stored SQL functions for content_find_{all,one}() and merge Provenance*DB classes in a single ProvenanceDB.
Aug 6 2021, 10:58 AM
douardda committed rDPROVfbc5499eb0e2: Use stored SQL functions for content_find_{all,one}() (authored by douardda).
Use stored SQL functions for content_find_{all,one}()
Aug 6 2021, 10:58 AM
douardda committed rDPROVf5e6c283b08e: Merge Provenance*DB classes in a single ProvenanceDB (authored by douardda).
Merge Provenance*DB classes in a single ProvenanceDB
Aug 6 2021, 10:58 AM
douardda closed D5843: Add support for a denormalized version of the provenance DB.
Aug 6 2021, 10:58 AM
douardda committed rDPROV1c3d6426ebd2: Add support for a denormalized version of the provenance DB (authored by douardda).
Add support for a denormalized version of the provenance DB
Aug 6 2021, 10:58 AM
douardda updated the diff for D6031: Add a quick start section in the documentation and simplify the configuration file loading mechanism in the cli.

typos

Aug 6 2021, 10:55 AM

Aug 5 2021

douardda added a comment to D6046: elasticsearch.py: Integrate query langauge translator.

there is a typo in the commit message

Aug 5 2021, 1:34 PM
douardda accepted D6051: changelog: Reference first completion of sourceforge hg origins.
Aug 5 2021, 1:33 PM
douardda requested changes to D6054: Add test for the different `ProvenanceStorageInterface` implementations.

overall ok, but I'd like to see the comments about fixtures addressed first.

Aug 5 2021, 12:27 PM
douardda accepted D6053: Refactor the use of archive `Storage` object for testing.

nice job, thx

Aug 5 2021, 12:11 PM
douardda accepted D6026: Add test for origin-revision layer.
Aug 5 2021, 12:08 PM
douardda added a comment to D5843: Add support for a denormalized version of the provenance DB.

I agree some more tests and validations needs to be done on this storage schema, but can we please land it for now as is? I've put a warning in the documentation (in D6031) to point the fact this flavor is not "production ready". cc @aeviso

Aug 5 2021, 12:06 PM

Aug 2 2021

douardda added a comment to T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem.

FTR I've tried to investigate a bit to find clues of what the origin of the outage was, but I did not find any obvious culprit.

Aug 2 2021, 10:06 AM · System administration

Jul 30 2021

douardda added a comment to P1110 bad stream_results_optional.

ok then

return itertools.chain([res], stream_results(f, page_token = res.page_token, **kwargs))
Jul 30 2021, 3:44 PM
douardda added a comment to P1110 bad stream_results_optional.

why not something like:

Jul 30 2021, 3:36 PM
douardda triaged T3453: Refactor the backend to make it scale better as High priority.
Jul 30 2021, 2:21 PM · Provenance database

Jul 28 2021

douardda updated the diff for D6031: Add a quick start section in the documentation and simplify the configuration file loading mechanism in the cli.

rebase

Jul 28 2021, 2:44 PM
douardda updated the diff for D6015: Use stored SQL functions for content_find_{all,one}() and merge Provenance*DB classes in a single ProvenanceDB.

rebase

Jul 28 2021, 2:43 PM
douardda updated the diff for D5843: Add support for a denormalized version of the provenance DB.

rebase

Jul 28 2021, 2:43 PM