Page MenuHomeSoftware Heritage
Feed Advanced Search

Apr 7 2021

seirl added a comment to D5427: NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID.

The new way of doing things is a lot more natural thing to do since we already have the MPH and the .order file

Newcomers aren't aware of this. I had no idea we had those before reading this diff.

Apr 7 2021, 3:02 PM
seirl added a comment to D5427: NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID.

I'm not saying the current state of the docs is good enough, I'm saying this commit message doesn't explain the design but why we're moving away from the old binary search solution. The new way of doing things is a lot more natural thing to do since we already have the MPH and the .order file, so there's no need to document why the old solution was bad in the main docs.

Apr 7 2021, 12:04 PM
seirl added a comment to D5427: NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID.

Where are the .order and MPH computed?

Apr 7 2021, 10:39 AM
seirl added a comment to D5427: NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID.

Thanks for the review. I don't think this needs to be documented elsewhere, it just describes why we're doing the change. What should be documented instead is why we're using these data structures in the first place. Right now this is done sparsely in the different source files, and this commit updates the already existing documentation.

Apr 7 2021, 10:38 AM

Apr 6 2021

seirl committed rDGRPH15c2da0f084f: java: fix formatting (authored by seirl).
java: fix formatting
Apr 6 2021, 9:09 PM
seirl requested review of D5427: NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID.
Apr 6 2021, 3:46 PM
seirl added a comment to D5411: return a 400 error when accessing endpoints without the arguments.

There's a problem with this diff, it's on an old java-only backend that isn't the one we use when we run swh graph rpc-serve. The one that is currently used is in python, at swh/graph/server/app.py

Apr 6 2021, 12:35 PM

Apr 2 2021

seirl committed rDGRPHf055c4eaf016: Recompress test graph with byte array MPH (authored by seirl).
Recompress test graph with byte array MPH
Apr 2 2021, 3:59 PM
seirl committed rDGRPH7eef7cb3f94b: Compress graph with byte arrays instead of strings (authored by seirl).
Compress graph with byte arrays instead of strings
Apr 2 2021, 3:59 PM

Mar 26 2021

seirl closed D5315: Add LevelDB backend for exporter node sets.
Mar 26 2021, 2:29 PM
seirl committed rDDATASETe16e9c5bb271: Add LevelDB backend for exporter node sets (authored by seirl).
Add LevelDB backend for exporter node sets
Mar 26 2021, 2:29 PM
seirl committed rDDATASETf43ce97371ba: ORC exporter: handle releases with empty authors/dates (authored by seirl).
ORC exporter: handle releases with empty authors/dates
Mar 26 2021, 2:29 PM
seirl updated the diff for D5315: Add LevelDB backend for exporter node sets.

Rebase + fix phabricator incorrect ID

Mar 26 2021, 2:27 PM
seirl reopened D5315: Add LevelDB backend for exporter node sets.
Mar 26 2021, 2:27 PM
seirl closed D5316: Model test data: add Release with no author/date.

Merged in https://forge.softwareheritage.org/rDMOD9523be0552d822be617da77bf0d2ca2f479da572

Mar 26 2021, 12:10 PM
seirl updated the diff for D5316: Model test data: add Release with no author/date.

kSJFGSLHDFSKJGHDKFJHGDKFJHG

Mar 26 2021, 12:09 PM
seirl closed D5315: Add LevelDB backend for exporter node sets.
Mar 26 2021, 12:07 PM
seirl committed rDMOD9523be0552d8: Model test data: add Release with no author/date (authored by seirl).
Model test data: add Release with no author/date
Mar 26 2021, 12:07 PM
seirl updated the diff for D5315: Add LevelDB backend for exporter node sets.

Remove phabricator garbage

Mar 26 2021, 12:07 PM
seirl closed T1847: fully automate export of the graph dataset as Resolved.

The ORC exporter is done, and it's likely that we won't provide CSV exports in the future, or we'll generate them from the ORC format.

Mar 26 2021, 12:04 PM · Compressed graph service, Datasets
seirl closed T1847: fully automate export of the graph dataset, a subtask of T1848: refresh graph dataset export, as Resolved.
Mar 26 2021, 12:04 PM · Datasets

Mar 25 2021

seirl placed T3167: Add a --version option to all the CLI commands up for grabs.
Mar 25 2021, 2:05 PM · Easy hack

Mar 24 2021

seirl updated the task description for T3170: Revisions in the journal with out of range dates.
Mar 24 2021, 6:56 PM · Data Model, Journal
seirl updated the task description for T3170: Revisions in the journal with out of range dates.
Mar 24 2021, 4:11 PM · Data Model, Journal
seirl updated the task description for T3170: Revisions in the journal with out of range dates.
Mar 24 2021, 4:11 PM · Data Model, Journal
seirl updated the task description for T3170: Revisions in the journal with out of range dates.
Mar 24 2021, 4:10 PM · Data Model, Journal
seirl triaged T3170: Revisions in the journal with out of range dates as Normal priority.
Mar 24 2021, 1:13 PM · Data Model, Journal
seirl created P984 (An Untitled Masterwork).
Mar 24 2021, 11:04 AM
seirl requested review of D5316: Model test data: add Release with no author/date.
Mar 24 2021, 12:46 AM

Mar 23 2021

seirl updated the summary of D5315: Add LevelDB backend for exporter node sets.
Mar 23 2021, 10:13 PM
seirl requested review of D5315: Add LevelDB backend for exporter node sets.
Mar 23 2021, 10:13 PM
seirl committed rDGRPH92f810a36bc7: Add permissions on edge labels (authored by haltode).
Add permissions on edge labels
Mar 23 2021, 6:15 PM
seirl closed D4006: WIP: add permissions on edge labels.
Mar 23 2021, 6:15 PM
seirl committed rDGRPH6592ab3fb067: DirEntry: allow for empty permission field (authored by seirl).
DirEntry: allow for empty permission field
Mar 23 2021, 6:15 PM
seirl committed rDGRPHe0be35f0f59e: labels: use -label prefix for all edge labels, instead of -filename-labels (authored by seirl).
labels: use -label prefix for all edge labels, instead of -filename-labels
Mar 23 2021, 6:15 PM
seirl committed rDGRPH5a3d60748fd1: ReadLabelledGraph: use FCL instead of PFCL (authored by seirl).
ReadLabelledGraph: use FCL instead of PFCL
Mar 23 2021, 6:15 PM
seirl committed rDGRPH188608b87753: java: add subdataset exporting functions (authored by seirl).
java: add subdataset exporting functions
Mar 23 2021, 6:15 PM
seirl committed rDGRPH9a20f2e9bc2c: LabelMapBuilder: use low-level scanning of the input file (authored by seirl).
LabelMapBuilder: use low-level scanning of the input file
Mar 23 2021, 6:15 PM
seirl committed rDGRPH278517865425: LabelMapBuilder: restructure in functions (authored by seirl).
LabelMapBuilder: restructure in functions
Mar 23 2021, 6:15 PM
seirl committed rDGRPH7b31937a4715: LabelMapBuilder: non-static builder function (authored by seirl).
LabelMapBuilder: non-static builder function
Mar 23 2021, 6:15 PM
seirl committed rDGRPH2fcd96d7bb21: LabelMapBuilder: remove need for hashtable, sync streams (authored by seirl).
LabelMapBuilder: remove need for hashtable, sync streams
Mar 23 2021, 6:15 PM
seirl committed rDGRPH19f7da78aa54: Use MPH functions operating on byte arrays (authored by seirl).
Use MPH functions operating on byte arrays
Mar 23 2021, 6:15 PM
seirl committed rDGRPH4e2fedc3bce8: LabelMapBuilder: refactor logic in separate line iterators (authored by seirl).
LabelMapBuilder: refactor logic in separate line iterators
Mar 23 2021, 6:15 PM
seirl committed rDGRPH0aa061682e95: LabelMapBuilder: support both sorting methods (authored by seirl).
LabelMapBuilder: support both sorting methods
Mar 23 2021, 6:15 PM
seirl committed rDGRPH968f9c6c2d0e: LabelMapBuilder: add TextualEdgeLabelLineIterator, fix BSort (authored by seirl).
LabelMapBuilder: add TextualEdgeLabelLineIterator, fix BSort
Mar 23 2021, 6:15 PM
seirl committed rDGRPH469d75616934: Merge branch 'label_permissions' (authored by seirl).
Merge branch 'label_permissions'
Mar 23 2021, 6:15 PM
seirl assigned T3168: Proper deployment of swh-graph with debian package to olasd.
Mar 23 2021, 12:24 PM · Compressed graph service, Puppet recipes
seirl placed T3167: Add a --version option to all the CLI commands up for grabs.
Mar 23 2021, 12:24 PM · Easy hack
seirl assigned T3167: Add a --version option to all the CLI commands to olasd.
Mar 23 2021, 12:23 PM · Easy hack
seirl triaged T3168: Proper deployment of swh-graph with debian package as High priority.
Mar 23 2021, 12:19 PM · Compressed graph service, Puppet recipes
seirl updated the task description for T3167: Add a --version option to all the CLI commands.
Mar 23 2021, 12:18 PM · Easy hack
seirl triaged T3167: Add a --version option to all the CLI commands as Low priority.
Mar 23 2021, 12:16 PM · Easy hack
seirl created T3167: Add a --version option to all the CLI commands.
Mar 23 2021, 12:16 PM · Easy hack

Mar 4 2021

seirl created P968 swhgraph.sh.
Mar 4 2021, 12:36 PM

Feb 24 2021

seirl committed rDGRPH9f8c6de06556: Add FindEarliestRevision tool (authored by seirl).
Add FindEarliestRevision tool
Feb 24 2021, 3:29 PM
seirl created P963 FindEarliestRevision.
Feb 24 2021, 3:03 PM

Feb 15 2021

seirl committed rDDATASETcf125983309e: Add ORC exporter (authored by seirl).
Add ORC exporter
Feb 15 2021, 5:45 PM
seirl committed rDDATASET35253c89a722: ORC exporter: Add unit tests (authored by seirl).
ORC exporter: Add unit tests
Feb 15 2021, 5:45 PM
seirl committed rDDATASETbf8d2625d3b3: Refactor export paths in the base Exporter class (authored by seirl).
Refactor export paths in the base Exporter class
Feb 15 2021, 5:45 PM
seirl closed D4762: Add ORC exporter.
Feb 15 2021, 5:45 PM
seirl committed rDDATASET40f068d648d2: ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead (authored by seirl).
ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead
Feb 15 2021, 5:45 PM

Feb 12 2021

seirl requested review of D4762: Add ORC exporter.

I added unit tests and reworked the logic, and also addressed @olasd 's comment. Could you please rereview? :-)

Feb 12 2021, 10:05 PM
seirl updated the diff for D4762: Add ORC exporter.

typo

Feb 12 2021, 10:04 PM
seirl updated the diff for D4762: Add ORC exporter.

ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead

Feb 12 2021, 10:03 PM
seirl updated the diff for D4762: Add ORC exporter.
  • Add unit tests
  • Refactor export paths in the base Exporter class
Feb 12 2021, 9:54 PM

Feb 2 2021

seirl triaged T3021: Investigate why reading the journal of the content table takes so long as Normal priority.
Feb 2 2021, 2:00 PM · Journal, Datasets

Jan 8 2021

seirl updated the diff for D4006: WIP: add permissions on edge labels.

Rebase on master, include webgraph files

Jan 8 2021, 4:38 PM
seirl commandeered D4006: WIP: add permissions on edge labels.
Jan 8 2021, 4:38 PM
seirl added inline comments to D4821: Add LLP compression to the WebGraph pipeline.
Jan 8 2021, 4:19 PM
seirl planned changes to D4821: Add LLP compression to the WebGraph pipeline.

I'm realizing that this is missing the "simplified" step and needs more changes.

Jan 8 2021, 4:17 PM
seirl closed T2595: Add a default configuration based on graph size (eg: batch_size) as Resolved by committing rDGRPH5a987aae6e93: config: sane default for batch_size using a heuristic on ram size.
Jan 8 2021, 3:27 PM · Compressed graph service
seirl closed D4820: config: sane default for batch_size using a heuristic on ram size.
Jan 8 2021, 3:27 PM
seirl committed rDGRPH5a987aae6e93: config: sane default for batch_size using a heuristic on ram size (authored by seirl).
config: sane default for batch_size using a heuristic on ram size
Jan 8 2021, 3:27 PM
seirl updated the diff for D4820: config: sane default for batch_size using a heuristic on ram size.

rebase

Jan 8 2021, 3:24 PM
seirl updated the diff for D4820: config: sane default for batch_size using a heuristic on ram size.

Add task name to commit message

Jan 8 2021, 3:24 PM
seirl committed rDGRPH85da2e78d681: cli: compression: fix weird bug when using ranges of steps (authored by seirl).
cli: compression: fix weird bug when using ranges of steps
Jan 8 2021, 3:22 PM
seirl committed rDGRPH317205722b65: Compression: set custom temporary directory at the java level (authored by seirl).
Compression: set custom temporary directory at the java level
Jan 8 2021, 3:22 PM
seirl committed rDGRPHa4b6570e16ec: Compression: read only src/dst from the labelled edge file (authored by seirl).
Compression: read only src/dst from the labelled edge file
Jan 8 2021, 3:22 PM

Jan 7 2021

seirl requested review of D4821: Add LLP compression to the WebGraph pipeline.
Jan 7 2021, 5:46 PM
seirl requested review of D4820: config: sane default for batch_size using a heuristic on ram size.
Jan 7 2021, 5:45 PM
seirl committed rDGRPH07a8f25eae5e: java: bump unimi dependencies (authored by seirl).
java: bump unimi dependencies
Jan 7 2021, 5:41 PM

Jan 6 2021

seirl accepted D4810: FUSE: tests: remove temporary sleep(2) hack.
Jan 6 2021, 4:32 PM

Jan 5 2021

seirl accepted D4805: FUSE: cache: put by-date/ entries in direntry cache.
Jan 5 2021, 6:01 PM

Dec 17 2020

seirl retitled D4762: Add ORC exporter from Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3. to Add ORC exporter.
Dec 17 2020, 7:40 PM
seirl retitled D4762: Add ORC exporter from Add ORC exporter This adds a new exporter in columnar format (Apache ORC) using the PyORC library. The output can be used on various clouds like AWS S3. to Add ORC exporterThis adds a new exporter in columnar format (Apache ORC) using the PyORClibrary. The output can be used on various clouds like AWS S3..
Dec 17 2020, 7:40 PM
seirl created D4762: Add ORC exporter.
Dec 17 2020, 7:40 PM
seirl committed rDDATASETe439aa686f22: Edge exporter: use common remove_pull_requests() function (authored by seirl).
Edge exporter: use common remove_pull_requests() function
Dec 17 2020, 7:37 PM

Dec 16 2020

seirl committed rDDATASETcb71cea14def: journalprocessor: be resilient to exporter errors (authored by seirl).
journalprocessor: be resilient to exporter errors
Dec 16 2020, 5:13 PM
seirl committed rDDATASET6577f653f3c6: Export CLI: add a way to exclude specific object types (authored by seirl).
Export CLI: add a way to exclude specific object types
Dec 16 2020, 5:07 PM
seirl committed rDDATASETf3b156598000: journalprocessor: fix hashing of origin_visit_status objects (authored by seirl).
journalprocessor: fix hashing of origin_visit_status objects
Dec 16 2020, 4:41 PM
seirl closed D4750: journalprocessor: fix hashing of origin_visit_status objects.
Dec 16 2020, 4:41 PM
seirl updated the diff for D4750: journalprocessor: fix hashing of origin_visit_status objects.

Normalize .timestamp()

Dec 16 2020, 4:40 PM

Dec 15 2020

seirl added a reviewer for D4750: journalprocessor: fix hashing of origin_visit_status objects: Reviewers.
Dec 15 2020, 10:03 PM
seirl created D4750: journalprocessor: fix hashing of origin_visit_status objects.
Dec 15 2020, 10:03 PM
seirl closed D4718: Rewrite of the export pipeline using Exporters.

Landed, but phabricator doesn't seem to see it.

Dec 15 2020, 6:49 PM
seirl updated the diff for D4718: Rewrite of the export pipeline using Exporters.
  • journalprocessor: remove comment about deserialize_message overload being a 'hack'
Dec 15 2020, 6:48 PM
seirl updated the diff for D4718: Rewrite of the export pipeline using Exporters.
  • journalprocessor: also partition sqlite files by first byte
  • SQLite on-disk set: disable journalling and synchronous mode
  • tests: fix test_export_origin
Dec 15 2020, 6:46 PM

Dec 14 2020

seirl committed rMSLD4ffe1a8d0780: Add 2020-12-07 coregraphie (authored by seirl).
Add 2020-12-07 coregraphie
Dec 14 2020, 7:58 AM

Dec 11 2020

seirl updated the diff for D4718: Rewrite of the export pipeline using Exporters.
  • Exporter documentation fixes
  • Journal processor: fetch offsets in parallel
Dec 11 2020, 6:07 PM
seirl updated the diff for D4718: Rewrite of the export pipeline using Exporters.

Fix various coding errors and minor improvements

Dec 11 2020, 5:39 PM