Page MenuHomeSoftware Heritage
Feed Advanced Search

Oct 18 2022

anlambert committed rDLDG66413a1552f4: tests: Simplify git loading tasks creation tests implementation (authored by anlambert).
tests: Simplify git loading tasks creation tests implementation
Oct 18 2022, 3:45 PM
anlambert added a comment to T4641: Jenkins jobs for swh-graph time out while cloning repository.

Passing --single-branch option to git clone do not result in a timeout:

$ git clone https://forge.softwareheritage.org/source/swh-graph.git
Cloning into 'swh-graph'...
error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504
fatal: the remote end hung up unexpectedly
Oct 18 2022, 2:32 PM · Compressed graph service, System administration, Continuous Integration
anlambert requested review of D8702: tests: Simplify mercurial loading task creation tests implementation.
Oct 18 2022, 2:14 PM
anlambert requested review of D8701: tests: Simplify bzr loading task creation test implementation.
Oct 18 2022, 2:06 PM
anlambert updated the diff for D8700: tests: Simplify cvs loading task creation test implementation.

Parametrize test with extra loader arguments

Oct 18 2022, 1:54 PM
anlambert updated the diff for D8699: tests: Simplify svn loading tasks creation tests implementation.

Parametrize tests with extra loader arguments

Oct 18 2022, 1:52 PM
anlambert updated the diff for D8698: tests: Simplify git loading tasks creation tests implementation.

Parametrize tests with extra loader aguments

Oct 18 2022, 1:46 PM
anlambert requested review of D8700: tests: Simplify cvs loading task creation test implementation.
Oct 18 2022, 1:37 PM
anlambert added a comment to T4625: staging: ingest netbsd.org cvs forge.

@vsellier, I landed all optimizations for the CVS loader and tagged a new version v0.5.0 so you can retry the NetBSD repository loading on staging.

Oct 18 2022, 1:32 PM · System administration, Archive coverage
anlambert requested review of D8699: tests: Simplify svn loading tasks creation tests implementation.
Oct 18 2022, 1:28 PM
anlambert requested review of D8698: tests: Simplify git loading tasks creation tests implementation.
Oct 18 2022, 1:26 PM
anlambert accepted D8694: Improve directory entry name filtering uisng casefolded strings.

Looks good to me.

Oct 18 2022, 12:17 PM
anlambert added inline comments to D8694: Improve directory entry name filtering uisng casefolded strings.
Oct 18 2022, 12:01 PM
anlambert requested changes to D8694: Improve directory entry name filtering uisng casefolded strings.
Oct 18 2022, 11:49 AM
anlambert committed rDDOC14a934fd06df: user/listers,loaders: Remove links to apidocs (authored by anlambert).
user/listers,loaders: Remove links to apidocs
Oct 18 2022, 11:42 AM
anlambert committed rDDOC25e2498c265f: user/loaders: Add new loaders and update some links and statuses (authored by anlambert).
user/loaders: Add new loaders and update some links and statuses
Oct 18 2022, 11:42 AM
anlambert closed D8669: Update user documentation for listers and loaders.
Oct 18 2022, 11:42 AM
anlambert committed rDDOC2b88e3344edf: user/listers: Add conda lister and update some links and statuses (authored by anlambert).
user/listers: Add conda lister and update some links and statuses
Oct 18 2022, 11:42 AM
anlambert added a comment to D8669: Update user documentation for listers and loaders.

apidoc links are sometimes useful. eg. https://docs.softwareheritage.org/devel/apidoc/swh.lister.crates.html documents the lister's design

Oct 18 2022, 11:41 AM
anlambert closed D8693: replay: Use swh.model.from_disk.Directory.collect.
Oct 18 2022, 11:37 AM
anlambert committed rDLDSVNaaa82617befc: replay: Use swh.model.from_disk.Directory.collect (authored by anlambert).
replay: Use swh.model.from_disk.Directory.collect
Oct 18 2022, 11:37 AM

Oct 17 2022

anlambert committed rDLDCVS3bf543ba318f: test_cvsclient: Mock subprocess (authored by anlambert).
test_cvsclient: Mock subprocess
Oct 17 2022, 7:48 PM
anlambert committed rDLDCVSc23d42501546: loader: Yield only modified objects in process_cvs_changesets (authored by anlambert).
loader: Yield only modified objects in process_cvs_changesets
Oct 17 2022, 7:26 PM
anlambert closed D8682: Improve CVS loader performances.
Oct 17 2022, 7:26 PM
anlambert committed rDLDCVSb976aa6a1f80: loader: Reconstruct repo filesystem incrementally at each revision (authored by anlambert).
loader: Reconstruct repo filesystem incrementally at each revision
Oct 17 2022, 7:26 PM
anlambert committed rDLDCVS4c72bb7e3e89: debian/control: Bump python3-swh.model (authored by anlambert).
debian/control: Bump python3-swh.model
Oct 17 2022, 7:25 PM
anlambert requested review of D8693: replay: Use swh.model.from_disk.Directory.collect.
Oct 17 2022, 7:25 PM
anlambert closed D8689: docs: Add info about CPAN extrinsic metadata format.
Oct 17 2022, 7:22 PM
anlambert committed rDSTO17d9ad235c3e: docs: Add info about CPAN extrinsic metadata format (authored by anlambert).
docs: Add info about CPAN extrinsic metadata format
Oct 17 2022, 7:22 PM
anlambert updated the diff for D8682: Improve CVS loader performances.

Add assert fallback

Oct 17 2022, 7:19 PM
anlambert updated the diff for D8682: Improve CVS loader performances.

Bump swh.model

Oct 17 2022, 7:12 PM
anlambert closed D8688: model: Fix hypothesis integration with attr < 21.3.0.
Oct 17 2022, 7:05 PM
anlambert committed rDMODdd3bab81af29: model: Fix hypothesis integration with attr < 21.3.0 (authored by anlambert).
model: Fix hypothesis integration with attr < 21.3.0
Oct 17 2022, 7:05 PM
anlambert updated the diff for D8689: docs: Add info about CPAN extrinsic metadata format.

Remove double pasted line

Oct 17 2022, 7:04 PM
anlambert added inline comments to D8689: docs: Add info about CPAN extrinsic metadata format.
Oct 17 2022, 7:03 PM
anlambert requested review of D8689: docs: Add info about CPAN extrinsic metadata format.
Oct 17 2022, 5:59 PM
anlambert added a revision to T2833: cpan.loader - archive Perl modules from CPAN: D8689: docs: Add info about CPAN extrinsic metadata format.
Oct 17 2022, 5:43 PM · CPAN lister, Archive coverage
anlambert requested review of D8688: model: Fix hypothesis integration with attr < 21.3.0.
Oct 17 2022, 5:32 PM
anlambert closed D8652: cpan: Collect extrinsic metadata for each module release.
Oct 17 2022, 5:32 PM
anlambert committed rDLDBASE85963318aab6: cpan: Collect extrinsic metadata for each module release (authored by anlambert).
cpan: Collect extrinsic metadata for each module release
Oct 17 2022, 5:32 PM
anlambert closed D8651: cpan: Do not parse intrinsic metadata for getting module author.
Oct 17 2022, 5:32 PM
anlambert committed rDLDBASE7b929606a78f: cpan: Do not parse intrinsic metadata for getting module author (authored by anlambert).
cpan: Do not parse intrinsic metadata for getting module author
Oct 17 2022, 5:32 PM
anlambert closed D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 17 2022, 5:20 PM
anlambert closed T4633: Make MerkleNode.collect return a set of MerkleNode instead of a dict as Resolved by committing rDMOD13e7adc3e854: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 17 2022, 5:20 PM · Data Model
anlambert committed rDMOD13e7adc3e854: merkle: Make MerkleNode.collect return a set of nodes instead of a dict (authored by anlambert).
merkle: Make MerkleNode.collect return a set of nodes instead of a dict
Oct 17 2022, 5:20 PM
anlambert added a comment to D8682: Improve CVS loader performances.

That's a surprisingly small diff for such a change, nice!

What speedup do you get with this?

Oct 17 2022, 2:49 PM
anlambert updated the diff for D8652: cpan: Collect extrinsic metadata for each module release.

Update: s/cpan-module-json/cpan-release-json/

Oct 17 2022, 2:06 PM
anlambert updated the diff for D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.

Update:

  • use hash builtin instead of adding a new hash_to_int method
  • update tests
Oct 17 2022, 2:03 PM
anlambert added inline comments to D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 17 2022, 1:39 PM
anlambert added a comment to D8652: cpan: Collect extrinsic metadata for each module release.
Oct 17 2022, 1:26 PM
anlambert added inline comments to D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 17 2022, 12:04 PM
anlambert closed Restricted Maniphest Task, a subtask of T4625: staging: ingest netbsd.org cvs forge, as Resolved.
Oct 17 2022, 10:55 AM · System administration, Archive coverage
anlambert closed D8684: rlog: Skip rlog entry with missing header in RlogConv.parse_rlog.
Oct 17 2022, 10:55 AM
anlambert committed rDLDCVS734207ba5847: rlog: Skip rlog entry with missing header in RlogConv.parse_rlog (authored by anlambert).
rlog: Skip rlog entry with missing header in RlogConv.parse_rlog
Oct 17 2022, 10:55 AM
anlambert updated the diff for D8684: rlog: Skip rlog entry with missing header in RlogConv.parse_rlog.

Rebase

Oct 17 2022, 10:51 AM
anlambert closed D8683: loader, cvsclient: Read files line by line to reduce memory consumption.
Oct 17 2022, 10:50 AM
anlambert committed rDLDCVScfe7507a7366: loader, cvsclient: Read files line by line to reduce memory consumption (authored by anlambert).
loader, cvsclient: Read files line by line to reduce memory consumption
Oct 17 2022, 10:50 AM

Oct 14 2022

anlambert added a comment to D8682: Improve CVS loader performances.

Build has FAILED

Patch application report for D8682 (id=31362)

Rebasing onto 965c3de498...

Current branch diff-target is up to date.
Changes applied before test
commit b47790a2c8260e5b4e1c3ef8981a76db6563c139
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 18:00:37 2022 +0200

    loader: Yield only modified objects in process_cvs_changesets
    
    Previously, after each revision replay all files and directories of the
    CVS repository being loaded were collected and sent to the storage.
    This is a real bottleneck in terms of loading performances as it delegates
    the filtering of new objects to archive to the storage filtering proxy.
    
    As we known exactly the set of paths that have been modified in a CVS
    revision, prefer to do that filtering on the loader side and only
    send modified objects to storage instead of the whole set of contents
    and directories from the reconstructed filesystem.
    
    This should greatly improve loading performance for large repositories
    but also reduce loader memory consumption.

commit 76a19ee665b39e6ec31399d1c814b95264b26912
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 13 17:30:51 2022 +0200

    loader: Reconstruct repo filesystem incrementally at each revision
    
    Instead of creating a from_disk.Directory instance after each replayed
    CVS revision by recursively scanning all directories of the repository,
    prefer to have a single one as class member kept synchronized with the
    recontructed filesystem after each revision replay.
    
    This should improve loader in terms of performance, especially when
    delaing with large repositories.

Link to build: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDCVS/job/tests-on-diff/134/console

Oct 14 2022, 4:29 PM
anlambert updated the diff for D8682: Improve CVS loader performances.

Use from_disk.Directory.collect to get added/modified objects instead of maintaining a set of paths.

Oct 14 2022, 4:25 PM
anlambert added a comment to D8682: Improve CVS loader performances.
In D8682#226118, @olasd wrote:
In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oh nice, I did not know we already have such feature in swh-model. I will try to use it and adapt implementation if needed.

Oct 14 2022, 4:06 PM
anlambert requested review of D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 14 2022, 4:03 PM
anlambert added a revision to T4633: Make MerkleNode.collect return a set of MerkleNode instead of a dict: D8686: merkle: Make MerkleNode.collect return a set of nodes instead of a dict.
Oct 14 2022, 4:00 PM · Data Model
anlambert closed T3858: Add diff features for class from_disk.Directory as Invalid.

Closing this as invalid as it already exists a method named collect in the merkle.MerkleNode class (base of from_disk.Directory) that does exactly what it is detailed in the task description.
Nevertheless, that method could be improved to give more flexibility in client code (T4633).

Oct 14 2022, 3:50 PM · Data Model
anlambert triaged T4633: Make MerkleNode.collect return a set of MerkleNode instead of a dict as Normal priority.
Oct 14 2022, 3:45 PM · Data Model
anlambert requested review of D8684: rlog: Skip rlog entry with missing header in RlogConv.parse_rlog.
Oct 14 2022, 1:53 PM
anlambert added a comment to T4625: staging: ingest netbsd.org cvs forge.

@vsellier, I found other memory consumption issues in CVS loader implementation while testing the parsing of the huge rlog output of NetBSD (script got OOM killed). I fixed those in D8683.

Oct 14 2022, 11:44 AM · System administration, Archive coverage
anlambert requested review of D8683: loader, cvsclient: Read files line by line to reduce memory consumption.
Oct 14 2022, 11:40 AM
anlambert added a comment to D8682: Improve CVS loader performances.
In D8682#226118, @olasd wrote:
In D8682#226117, @olasd wrote:

swh.model.from_disk.Directory has a collect method which is supposed to do the change tracking by itself (it only returns the nodes that have changed since the last time .collect() was called). This should allow you to drop the modified_paths tracking altogether.

Ah, collect uses get_data which yields a bunch of dicts. Meh. That should probably be updated to just yield the nodes themselves.

Oct 14 2022, 11:05 AM

Oct 13 2022

anlambert added a comment to T4625: staging: ingest netbsd.org cvs forge.

The loader got killed after it starts to consume a lot of memory...

Warning: Permanently added 'anoncvs.netbsd.org,199.233.217.198' (RSA) to the list of known hosts.
DEBUG:swh.loader.cvs.loader.CvsLoader:Fetching CVS rlog from anoncvs.netbsd.org:/cvsroot/src
Killed
swh@loader-cvs-manual:~$
Oct 13 2022, 7:36 PM · System administration, Archive coverage
anlambert requested review of D8682: Improve CVS loader performances.
Oct 13 2022, 7:35 PM
anlambert closed Restricted Maniphest Task, a subtask of T4625: staging: ingest netbsd.org cvs forge, as Resolved.
Oct 13 2022, 7:31 PM · System administration, Archive coverage
anlambert closed D8677: loader: Raise NotFound for missing CVS module when using pserver or ssh.
Oct 13 2022, 7:31 PM
anlambert committed rDLDCVS965c3de498f5: loader: Raise NotFound for missing CVS module when using pserver or ssh (authored by anlambert).
loader: Raise NotFound for missing CVS module when using pserver or ssh
Oct 13 2022, 7:31 PM
anlambert updated the diff for D8677: loader: Raise NotFound for missing CVS module when using pserver or ssh.

Rebase

Oct 13 2022, 7:27 PM
anlambert closed D8675: cvsclient: Handle error in fetch_rlog when path does not exist.
Oct 13 2022, 7:26 PM
anlambert committed rDLDCVS356dfa27f71d: cvsclient: Handle error in fetch_rlog when path does not exist (authored by anlambert).
cvsclient: Handle error in fetch_rlog when path does not exist
Oct 13 2022, 7:26 PM
anlambert updated the diff for D8675: cvsclient: Handle error in fetch_rlog when path does not exist.

Remove not used test archive

Oct 13 2022, 7:22 PM
anlambert updated the diff for D8675: cvsclient: Handle error in fetch_rlog when path does not exist.

Use endswith

Oct 13 2022, 7:18 PM
anlambert accepted D8681: maven: Use real data from github API + rely on requests_mock_datadir.
Oct 13 2022, 7:15 PM
anlambert accepted D8679: maven: Make assertions more useful.
Oct 13 2022, 5:52 PM
anlambert committed rDLS82b936a277af: rubygems: Fix debug log (authored by anlambert).
rubygems: Fix debug log
Oct 13 2022, 4:41 PM
anlambert accepted D8673: packagist: Canonicalize github origins.
Oct 13 2022, 4:34 PM
anlambert accepted D8674: Fix _sanitize_github_url removing suffixes too greedily.
Oct 13 2022, 4:32 PM
anlambert requested review of D8677: loader: Raise NotFound for missing CVS module when using pserver or ssh.
Oct 13 2022, 4:26 PM
anlambert requested review of D8675: cvsclient: Handle error in fetch_rlog when path does not exist.
Oct 13 2022, 3:42 PM
anlambert added inline comments to D8673: packagist: Canonicalize github origins.
Oct 13 2022, 1:51 PM
anlambert requested changes to D8673: packagist: Canonicalize github origins.
Oct 13 2022, 1:21 PM
anlambert accepted D8672: packagist: Actually test listed origins.
Oct 13 2022, 12:11 PM
anlambert added inline comments to D8671: Add a job running swh-mirror tests.
Oct 13 2022, 11:26 AM
anlambert added inline comments to D8616: cpan: Align loader implementation with latest lister improvements.
Oct 13 2022, 10:47 AM

Oct 12 2022

anlambert requested review of D8669: Update user documentation for listers and loaders.
Oct 12 2022, 5:28 PM
anlambert added a revision to T3117: Publish status of existing listers and loaders: D8669: Update user documentation for listers and loaders.
Oct 12 2022, 5:11 PM · Documentation, Roadmap 2022, meta-task, Community Building, Roadmap 2021
anlambert committed rDWAPPS328198ab410e: archive_coverage: Move related assets in application folder (authored by anlambert).
archive_coverage: Move related assets in application folder
Oct 12 2022, 3:05 PM
anlambert committed rDWAPPSf9c4becd2950: package.json: Upgrade dependencies (authored by anlambert).
package.json: Upgrade dependencies
Oct 12 2022, 3:05 PM
anlambert closed D8667: archive_coverage: Add new origin types.
Oct 12 2022, 3:05 PM
anlambert committed rDWAPPS1938d73c2510: archive_coverage: Add new origin types (authored by anlambert).
archive_coverage: Add new origin types
Oct 12 2022, 3:05 PM
anlambert requested review of D8667: archive_coverage: Add new origin types.
Oct 12 2022, 2:08 PM
anlambert renamed T4614: Deploy swh-search v0.16.4 from Deploy swh-search v0.16.3 to Deploy swh-search v0.16.4.
Oct 12 2022, 1:35 PM · System administration, Archive search
anlambert added inline comments to D8665: Pubdev: Add raw_extrinsic_metadata.
Oct 12 2022, 1:33 PM
anlambert closed D8616: cpan: Align loader implementation with latest lister improvements.
Oct 12 2022, 11:38 AM