Page MenuHomeSoftware Heritage

Archive coverageFolder
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

stuff related to extend the coverage of the Software Heritage archive

Recent Activity

Fri, May 7

zack added a subtask for T3315: archive SourceForge: T735: SourceForge lister.
Fri, May 7, 5:25 PM · Archive coverage
zack triaged T3315: archive SourceForge as Normal priority.
Fri, May 7, 5:25 PM · Archive coverage

Thu, May 6

zack added a comment to T3311: Use .gitmodules to discover origins.

I think the only issue with (3) is not being retroactive

Thu, May 6, 6:49 PM · Archive coverage, Git loader
vlorentz added a project to T3311: Use .gitmodules to discover origins: Archive coverage.
Thu, May 6, 6:34 PM · Archive coverage, Git loader

Apr 12 2021

vlorentz added a comment to T3235: Add archival of bug tracker databases as well as an unofficial bug tracker per-project.

You are likely doing a git pull on a periodic basis. Just add git bug bridge pull [<name>] next to it.

Apr 12 2021, 3:37 PM · Archive coverage, Data Model
libEqualizer added a comment to T3235: Add archival of bug tracker databases as well as an unofficial bug tracker per-project.

However, this would require considerable work

Apr 12 2021, 2:48 PM · Archive coverage, Data Model
vlorentz triaged T3235: Add archival of bug tracker databases as well as an unofficial bug tracker per-project as Wishlist priority.

Hi, thanks for the suggestion.

Apr 12 2021, 11:31 AM · Archive coverage, Data Model

Mar 30 2021

zack added a comment to T2833: cpan.loader - preserver Perl modules from CPAN.

awesome, thanks @joenio ! you can also drop by our other devel communication channel if you want to discuss this in other ways: https://www.softwareheritage.org/community/developers/

Mar 30 2021, 3:29 PM · Archive coverage
joenio added a comment to T2833: cpan.loader - preserver Perl modules from CPAN.

Thanks @zack for the info, I'll start learning the SWH dev stack following the instructions I found in the wiki[1].

Mar 30 2021, 2:27 PM · Archive coverage
zack renamed T2833: cpan.loader - preserver Perl modules from CPAN from [feature request] cpan.loader - preserver Perl modules from CPAN to cpan.loader - preserver Perl modules from CPAN.
Mar 30 2021, 8:22 AM · Archive coverage
zack raised the priority of T2833: cpan.loader - preserver Perl modules from CPAN from Wishlist to Normal.
Mar 30 2021, 8:22 AM · Archive coverage
zack added a comment to T2833: cpan.loader - preserver Perl modules from CPAN.

Hey, yes, we want to have one, but nobody is working it at the moment, and we rather have someone knowledgeable with that ecosystem to work on it. So, if you're interested, you're more than welcome to help there! (And thank you in advance.)

Mar 30 2021, 8:21 AM · Archive coverage
joenio added a comment to T2833: cpan.loader - preserver Perl modules from CPAN.

Hi SWH devs,

Mar 30 2021, 1:56 AM · Archive coverage

Mar 17 2021

rdicosmo added a comment to T1724: Maven Central repository Lister.

After recent exchanges with @hboutemy and Charles Sabourdin, here is a clarification of the scope of this task.
We need a Maven repository lister that addresses the following issues:

Mar 17 2021, 10:40 AM · GSoC 2019, Archive coverage

Mar 15 2021

ardumont added a comment to T3095: Add LIP6 gitlab instance to regular crawling list.

Listing deployed in production:

swhscheduler@saatchi:~$ swh scheduler --url http://saatchi.internal.softwareheritage.org:5008/ task add list-gitlab-incremental url="https://gitlab.lip6.fr/api/v4/" instance=lip6
Created 1 tasks
Mar 15 2021, 11:52 AM · Scientific Community Building, Archive coverage
ardumont added a comment to T3095: Add LIP6 gitlab instance to regular crawling list.

Everything went fine:

worker1.internal.staging.swh.network: Mar 15 10:37:01 worker1 python3[2277003]: [2021-03-15 10:37:01,800: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.IncrementalGitLabLister[86f12806-f321-4ea1-8438-83c6fd0c457b]
worker1.internal.staging.swh.network: Mar 15 10:37:06 worker1 python3[2277067]: [2021-03-15 10:37:06,017: INFO/ForkPoolWorker-4] Task swh.lister.gitlab.tasks.IncrementalGitLabLister[86f12806-f321-4ea1-8438-83c6fd0c457b] succeeded in 4.2116
25785101205s: {'pages': 5, 'origins': 64}
Mar 15 2021, 11:42 AM · Scientific Community Building, Archive coverage
ardumont added a comment to T3095: Add LIP6 gitlab instance to regular crawling list.

Checking in staging first, with:

Mar 15 2021, 11:39 AM · Scientific Community Building, Archive coverage

Mar 11 2021

rdicosmo added a comment to T1724: Maven Central repository Lister.

@hboutemy : I wonder if you are aware that we have now in place a grant program that allows to fund development of listers like this one.
All the information is available at https://www.softwareheritage.org/grants and you can mail me for more info if needed.

Mar 11 2021, 8:32 PM · GSoC 2019, Archive coverage

Mar 8 2021

rdicosmo renamed T3095: Add LIP6 gitlab instance to regular crawling list from Ad LIP6 gitlab instance to regular crawling list to Add LIP6 gitlab instance to regular crawling list.
Mar 8 2021, 7:02 PM · Scientific Community Building, Archive coverage
rdicosmo updated subscribers of T3095: Add LIP6 gitlab instance to regular crawling list.
Mar 8 2021, 7:02 PM · Scientific Community Building, Archive coverage
rdicosmo raised the priority of T3095: Add LIP6 gitlab instance to regular crawling list from Normal to High.

We would like to see this in prod as soon as reasonably possible.

Mar 8 2021, 5:43 PM · Scientific Community Building, Archive coverage
rdicosmo updated the task description for T3098: Save VLC's forge/repositories.
Mar 8 2021, 4:03 PM · Archive coverage
anlambert added a comment to T3098: Save VLC's forge/repositories.

There is also the VideoLAN Gitlab instance (that will replace the cgit forge) to archive located at https://code.videolan.org/.

Mar 8 2021, 3:42 PM · Archive coverage
anlambert added a project to T3098: Save VLC's forge/repositories: Archive coverage.
Mar 8 2021, 3:39 PM · Archive coverage

Mar 7 2021

rdicosmo triaged T3095: Add LIP6 gitlab instance to regular crawling list as Normal priority.
Mar 7 2021, 8:40 AM · Scientific Community Building, Archive coverage

Feb 10 2021

ardumont added a comment to T376: ingest git.eclipse.org repositories.

new listers

Feb 10 2021, 1:38 PM · Archive coverage
rdicosmo added a comment to T376: ingest git.eclipse.org repositories.

Note that does not mean this is or will be ingested anytime soon though.
We are still missing at least the one cog to actually schedule those listed origins.

More details in T2345#58247

Feb 10 2021, 12:31 PM · Archive coverage
ardumont placed T376: ingest git.eclipse.org repositories up for grabs.
Feb 10 2021, 9:20 AM · Archive coverage
ardumont added a comment to T376: ingest git.eclipse.org repositories.

Note that does not mean this is or will be ingested anytime soon though.
We are still missing at least the one cog to actually schedule those listed origins.

Feb 10 2021, 9:20 AM · Archive coverage

Feb 8 2021

olasd added a comment to T2345: Improve handling of recurrent loading tasks in scheduler.

Here's my understanding of the status of the migration to the next generation scheduler as of today:

Feb 8 2021, 12:01 PM · Sprint 2021 01, Archive coverage, Scheduling utilities
vlorentz reassigned T2973: Implement a scheduler simulator from vlorentz to olasd.
Feb 8 2021, 12:00 PM · Sprint 2021 01, Archive coverage, Scheduling utilities

Feb 4 2021

ardumont added a comment to T376: ingest git.eclipse.org repositories.

Instance cgit scheduled [1]

Feb 4 2021, 9:29 AM · Archive coverage

Feb 2 2021

anlambert closed T2442: Provide a unified API for listers to interact with the scheduler, a subtask of T2345: Improve handling of recurrent loading tasks in scheduler, as Resolved.
Feb 2 2021, 4:08 PM · Sprint 2021 01, Archive coverage, Scheduling utilities

Feb 1 2021

rdicosmo added a comment to T376: ingest git.eclipse.org repositories.

Thanks @ardumont , that's great! If you think this does not need any more support on the Eclipse side, may you let them know?

Feb 1 2021, 5:59 PM · Archive coverage
rdicosmo added a comment to T376: ingest git.eclipse.org repositories.

Thanks @ardumont , that's great! If you think this does not need any more support on the Eclipse side, may you let them know?

Feb 1 2021, 5:58 PM · Archive coverage
ardumont added a comment to T376: ingest git.eclipse.org repositories.

With the latest improvment, we listed the instance in one request [1]

Feb 1 2021, 5:25 PM · Archive coverage

Jan 29 2021

ardumont closed T2999: Optimize the number of HTTP requests sent by the cgit lister, a subtask of T376: ingest git.eclipse.org repositories, as Resolved.
Jan 29 2021, 5:36 PM · Archive coverage
ardumont added a revision to T376: ingest git.eclipse.org repositories: D4968: cgit: Compute origin urls out of a base git url when provided..
Jan 29 2021, 12:21 PM · Archive coverage
ardumont added a comment to T376: ingest git.eclipse.org repositories.

The 500 seems normal

Jan 29 2021, 11:51 AM · Archive coverage
ardumont added a comment to T376: ingest git.eclipse.org repositories.

yes, agreed.

Jan 29 2021, 10:34 AM · Archive coverage
rdicosmo added a comment to T376: ingest git.eclipse.org repositories.

Thanks @ardumont for experimenting with this. The 500 seems normal: we need to tell Eclipse about us first, I'll put you in touch. So maybe it's still a no-brainer, and we just need to document the "contant the owner to get whitelisted" human step :-)

Jan 29 2021, 10:04 AM · Archive coverage
ardumont added a subtask for T376: ingest git.eclipse.org repositories: T2999: Optimize the number of HTTP requests sent by the cgit lister.
Jan 29 2021, 9:24 AM · Archive coverage

Jan 28 2021

ardumont added a comment to T376: ingest git.eclipse.org repositories.

In the context of deploying the next gen lister in staging (T2998), i also tried the eclipse cgit instance

Jan 28 2021, 5:09 PM · Archive coverage

Jan 25 2021

rdicosmo assigned T376: ingest git.eclipse.org repositories to ardumont.
Jan 25 2021, 9:03 PM · Archive coverage
rdicosmo raised the priority of T376: ingest git.eclipse.org repositories from Low to High.

Now that we have a cgit lister, this should be a no brainer.
If that's the case, we need it up and running quickly.

Jan 25 2021, 9:03 PM · Archive coverage
ardumont closed T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, a subtask of T2345: Improve handling of recurrent loading tasks in scheduler, as Resolved.
Jan 25 2021, 8:42 AM · Sprint 2021 01, Archive coverage, Scheduling utilities

Jan 20 2021

vlorentz added a comment to T2974: Define (and implement) scheduler performance metrics.
  • "'outdatedest' origin": excluding disabled origins and origins visited after their last_activity (if any), the min(current_time - last_visit) (lower is better)
Jan 20 2021, 5:33 PM · Sprint 2021 01, Archive coverage, Scheduling utilities

Jan 18 2021

douardda added a comment to T2974: Define (and implement) scheduler performance metrics.

thanks, looks a good starting point.

Jan 18 2021, 4:36 PM · Sprint 2021 01, Archive coverage, Scheduling utilities
olasd added a comment to T2974: Define (and implement) scheduler performance metrics.
  • "origins with pending changes": Number of origins where last_visit < last_activity (lower is better)
Jan 18 2021, 2:29 PM · Sprint 2021 01, Archive coverage, Scheduling utilities
olasd added a comment to T2974: Define (and implement) scheduler performance metrics.

Some potentially interesting and "easy" metrics:

Jan 18 2021, 2:27 PM · Sprint 2021 01, Archive coverage, Scheduling utilities