Page MenuHomeSoftware Heritage
Feed Advanced Search

Jun 2 2019

hboutemy renamed T1724: Maven Central repository support from Maven Lister to Maven Central repository Lister.
Jun 2 2019, 12:10 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage

May 31 2019

nahimilega added a comment to T1724: Maven Central repository support.

Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.

May 31 2019, 7:21 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega added a comment to T1724: Maven Central repository support.

It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.

May 31 2019, 4:47 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega added a comment to T1722: GNU Lister.

In my view, the best method to check for tarball would be to break the filename on " . " and check if the word between last and second last " . " is "tar" or not?. If it is "tar" then the file is useful, else the file does not have source code.

May 31 2019, 12:27 PM · Archive coverage
nahimilega updated subscribers of T1776: packagist (PHP) Lister.

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

May 31 2019, 12:14 PM · Lister, Archive coverage
nahimilega triaged T1776: packagist (PHP) Lister as Normal priority.
May 31 2019, 12:04 PM · Lister, Archive coverage

May 30 2019

nahimilega added a comment to T1722: GNU Lister.

Here are the extensions which have tar in their name

May 30 2019, 5:29 PM · Archive coverage
nahimilega added a comment to T1722: GNU Lister.

Here is the list of all the different extensions that are present on gnu website with a link to one example. I found only one way to know about the extensions, that is by using the way mentioned here https://stackoverflow.com/a/35188296/10424705 , but as there are a lot of gnu uses . in filenames to denote version number, hence there is no good way to uniquely find all the extensions. Although I have optimised the way to reduce redundancy, you may still find some extensions appearing more than once.

May 30 2019, 5:00 PM · Archive coverage

May 29 2019

nahimilega closed T1774: Create a lister for x.org as Invalid.
May 29 2019, 12:46 PM · Archive coverage
nahimilega added a comment to T1774: Create a lister for x.org.

@zack I agree, to the fact that archiving https://www.x.org/releases/individual/ is virtually not required because it is a git repo. However, I was concerned about archiving tarballs of other projects which are only present on https://www.x.org/releases/ like x.org/releases/X11R6.8.0/.
However, as you mention about

May 29 2019, 12:45 PM · Archive coverage
zack added a comment to T1774: Create a lister for x.org.

I don't like the idea of this lister.

May 29 2019, 11:47 AM · Archive coverage
nahimilega added a comment to T1734: Create a Lister for launchpad.net.

In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.

May 29 2019, 10:22 AM · Lister, Archive coverage
nahimilega triaged T1774: Create a lister for x.org as Normal priority.
May 29 2019, 12:01 AM · Archive coverage

May 28 2019

nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 28 2019, 8:36 PM · Lister, Archive coverage
nahimilega added a comment to T1734: Create a Lister for launchpad.net.

Launchpadlib
Pros
The library is available on the Debian stretch.
Easier and faster to get all the branches of a project as it returns at one go whereas bare API returns in an indexing fashion.

May 28 2019, 8:35 PM · Lister, Archive coverage
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 28 2019, 8:05 PM · Lister, Archive coverage

May 27 2019

nahimilega updated subscribers of T1718: Implement a NuGet(.NET) lister.
May 27 2019, 2:16 PM · Archive coverage
anlambert closed T1378: Ingest npm into the Software Heritage archive (meta task) as Resolved.

npm is now archived continuously by Software Heritage (see for instance: https://archive.softwareheritage.org/browse/origin/https://www.npmjs.com/package/webpack/).

May 27 2019, 1:58 PM · Origin-npm, Archive coverage
anlambert closed T1761: Add npm logo to the archive coverage page, a subtask of T1378: Ingest npm into the Software Heritage archive (meta task), as Resolved.
May 27 2019, 1:41 PM · Origin-npm, Archive coverage
anlambert closed T1761: Add npm logo to the archive coverage page as Resolved by committing rDWAPPS26f94f6244a7: misc/coverage: Add npm origin type.
May 27 2019, 1:41 PM · Origin-npm, Archive coverage
anlambert triaged T1761: Add npm logo to the archive coverage page as Normal priority.
May 27 2019, 10:51 AM · Origin-npm, Archive coverage

May 25 2019

zack added a comment to T1378: Ingest npm into the Software Heritage archive (meta task).

I think the only thing missing here is adding the NPM logo to the archive coverage page.

May 25 2019, 5:37 PM · Origin-npm, Archive coverage
zack renamed T1002: ingest Hackage, the Haskell package repository (meta task) from ingest Hackage (Haskell package repository) into the Software Heritage archive (meta task) to ingest Hackage, the Haskell package repository (meta task).
May 25 2019, 5:22 PM · Hackage loader, Hackage lister, Archive coverage
zack closed T328: svn / subversion loader, a subtask of T617: ingest Google Code Subversion repositories, as Resolved.
May 25 2019, 5:01 PM · Archive coverage, Origin-GoogleCode, SVN Loader
zack added a comment to T1246: pypi loader: Analyze existing errors.

how many are left? can we close this as well as T419 now that the PyPI listers/loaders have been in production for a while?

May 25 2019, 5:00 PM · Archive coverage, Origin-Pypi
zack renamed T561: ingest bitbucket (meta task) from ingest bitbucket repositories (meta task) to ingest bitbucket (meta task).
May 25 2019, 4:58 PM · Archive coverage, Origin-Bitbucket

May 24 2019

nahimilega added a comment to T1724: Maven Central repository support.

Extending on what I wrote in the previous comment, I did a bit more research about this.

May 24 2019, 4:13 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 24 2019, 12:09 PM · Lister, Archive coverage
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 24 2019, 11:49 AM · Lister, Archive coverage
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 24 2019, 11:48 AM · Lister, Archive coverage

May 23 2019

nahimilega added a comment to T1724: Maven Central repository support.

As recommended by @olasd I checkout out Maven Central index ( https://repo.maven.apache.org/maven2/.index/) this is a

May 23 2019, 3:00 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage

May 22 2019

nahimilega updated the task description for T1724: Maven Central repository support.
May 22 2019, 11:02 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega updated subscribers of T1724: Maven Central repository support.

Comment by @olasd

May 22 2019, 11:02 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega renamed T1724: Maven Central repository support from Maven Central (JAVA) lister to Maven Lister.
May 22 2019, 11:01 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega updated subscribers of T1734: Create a Lister for launchpad.net.
May 22 2019, 10:37 PM · Lister, Archive coverage
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 22 2019, 10:26 PM · Lister, Archive coverage
nahimilega triaged T1734: Create a Lister for launchpad.net as Normal priority.
May 22 2019, 7:57 PM · Lister, Archive coverage

May 21 2019

nahimilega added a revision to T1724: Maven Central repository support: D1497: Maven Lister.
May 21 2019, 3:31 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega added a revision to T1709: implement an R-cran lister: D1492: CRAN Lister.
May 21 2019, 1:55 PM · GSoC 2019, Archive coverage

May 20 2019

olasd updated subscribers of T1389: Implement a base "package" loader for package managers.

We've discussed a plausible plan for a "base package manager loader" with @ardumont and, to some extent, @anlambert.

May 20 2019, 6:03 PM · Origin-Debian, Origin-CRAN, Origin-GNU, Origin-npm, Origin-Pypi, Archive coverage

May 17 2019

nahimilega triaged T1724: Maven Central repository support as Normal priority.
May 17 2019, 11:26 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
ardumont triaged T1723: GNU Loader as Normal priority.
May 17 2019, 3:11 PM · Archive coverage
ardumont triaged T1722: GNU Lister as Normal priority.
May 17 2019, 3:10 PM · Archive coverage

May 16 2019

nahimilega updated subscribers of T1351: (periodically) ingest GNU package releases.

As suggested by @olasd, what was done in 2015 to ingest packages -

  1. Create origins for all the folders indiscriminately
  2. Only import things that look like tarballs (i.e. that end with .tar.something)
May 16 2019, 6:12 PM · Archive coverage
nahimilega added a comment to T1718: Implement a NuGet(.NET) lister.

@olasd recommended trying the listing approach for NuGET lister we discussed(to fetch for repository key in the api response), As recommended, I tried the approach on small dataset. I tried it on 1412 repositories are all of them were quite latest. I found 0 repository URL in them and in 900 of them repository key was empty(ie they were blank string). I think we need to change our approach.

May 16 2019, 4:32 PM · Archive coverage
faux added a project to T1721: Implementation of Gogs Lister: Archive coverage.
May 16 2019, 2:50 PM · Gogs lister, Origin-Gitea/Gogs, Archive coverage, Lister
ardumont updated the task description for T1351: (periodically) ingest GNU package releases.
May 16 2019, 1:44 PM · Archive coverage
nahimilega updated subscribers of T1718: Implement a NuGet(.NET) lister.
May 16 2019, 12:39 PM · Archive coverage
nahimilega added a comment to T1718: Implement a NuGet(.NET) lister.

As discussed on IRC the source code link for the repository is in very few of the repositories and the version control system used by repositories is not mentioned in the API response.
One way is Repository URL and the repository type field are present in .nuspec file for each project, so we have to download that file for each project and get source URL but the problem with this is downloading all binary packages to get a small chance to find a link to a source repository sounds like a lot of work, bandwidth and computing power for not much gain and that would only cover one of the ways package maintainers can set the source code information; the aforementioned blog post listed at least four

May 16 2019, 12:38 PM · Archive coverage

May 15 2019

eddelbuettel added a comment to T1709: implement an R-cran lister.

Yes, for history I do not believe we have an easy answer for SWH (and the world at large) to consume. We may have approximations; I'll check with Gabor and others.

May 15 2019, 10:22 PM · GSoC 2019, Archive coverage
zack added a comment to T1709: implement an R-cran lister.

@eddelbuettel yeah, if there isn't a standard way to go all the way back in time, it's OK to currently only ingest what's currently returned as available. In the medium/long term it will converge to having archived everything (w.r.t. the considered time frame) anyway. And we can always retrofit later on stuff that is archived elsewhere. But I wouldn't want to make this a blocker to start archiving what's (easily) listable now.

May 15 2019, 10:17 PM · GSoC 2019, Archive coverage
eddelbuettel added a comment to T1709: implement an R-cran lister.

Nicholas: Sadly, one can't. I kinda/sorta have that implicitly as I have been running CRANberries since 2007 or so.

May 15 2019, 9:31 PM · GSoC 2019, Archive coverage
olasd added a comment to T1709: implement an R-cran lister.

Expanding on what Dirk Eddelbuettel posted on IRC when we talked about that, a minimal R script to fetch the current package information would be:

May 15 2019, 3:37 PM · GSoC 2019, Archive coverage
faux added a comment to T1709: implement an R-cran lister.

@nahimilega it is probably a two line script. install R and do readRDS() and you will get a data.frame object which is just like a table and has columns and then you can extract what you want. Cheers :). BTW when I did readRDS it retrieved a lot of links and I don't know about the lister that much but you can pickup from there.

May 15 2019, 2:06 PM · GSoC 2019, Archive coverage
nahimilega added a comment to T1718: Implement a NuGet(.NET) lister.

API Documentation -
https://docs.microsoft.com/en-us/nuget/api/catalog-resource#base-url

May 15 2019, 1:34 PM · Archive coverage
nahimilega triaged T1718: Implement a NuGet(.NET) lister as Normal priority.
May 15 2019, 1:31 PM · Archive coverage
nahimilega added a comment to T1709: implement an R-cran lister.

@olasd I do not have any familiarity with R language. Learning some basics and making this script would take me around a week. I was wondering it is possible that someone in Software Heritage who have some experience with R can write this script as it would be a matter of minutes to the person who knows R.
Is it possible to do so?

May 15 2019, 11:33 AM · GSoC 2019, Archive coverage
olasd added a comment to T1709: implement an R-cran lister.

Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.

May 15 2019, 11:14 AM · GSoC 2019, Archive coverage

May 13 2019

nahimilega added a comment to T1709: implement an R-cran lister.

Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.

May 13 2019, 9:36 PM · GSoC 2019, Archive coverage
zack renamed T1709: implement an R-cran lister from Implementation of R-cran lister to implement an R-cran lister.
May 13 2019, 1:59 PM · GSoC 2019, Archive coverage
nahimilega updated subscribers of T1709: implement an R-cran lister.

@faux on IRC mentioned that there is a public DB dump (https://cran.r-project.org/web/dbs) which might be helpful for the purpose.
This DB dump contains files with .rds extension which is used by R language. Here are a couple of rows from that DB dump https://forge.softwareheritage.org/P396

May 13 2019, 1:56 PM · GSoC 2019, Archive coverage
nahimilega triaged T1709: implement an R-cran lister as Normal priority.
May 13 2019, 1:48 PM · GSoC 2019, Archive coverage

Apr 19 2019

vlorentz triaged T1681: Use project metadata as a "lister" as Low priority.
Apr 19 2019, 11:03 PM · Archive coverage, Indexer, Metadata workflow

Apr 18 2019

zack added a subtask for T1451: ingest GNU Savannah Git repositories: T1659: rewrite the CGit lister as a proper lister.
Apr 18 2019, 10:38 AM · Archive coverage
zack renamed T1451: ingest GNU Savannah Git repositories from ingest savannah git repositories to ingest GNU Savannah Git repositories.
Apr 18 2019, 10:03 AM · Archive coverage

Apr 5 2019

ardumont added a comment to T1351: (periodically) ingest GNU package releases.

This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.

Apr 5 2019, 10:44 AM · Archive coverage

Apr 4 2019

iank added a comment to T1351: (periodically) ingest GNU package releases.

It is updated daily at.

Apr 4 2019, 6:37 PM · Archive coverage

Apr 3 2019

ardumont added a comment to T1351: (periodically) ingest GNU package releases.

Heads up, there is now a json file (compressed) describing the gnu mirror's arborescence tree.
It is updated daily at.
It's served at [1]

Apr 3 2019, 11:26 PM · Archive coverage
zack triaged T1623: ingest the Codeplex archive as Normal priority.
Apr 3 2019, 10:04 AM · Archive coverage

Apr 2 2019

anlambert closed T1379: npm loader, a subtask of T1378: Ingest npm into the Software Heritage archive (meta task), as Resolved.
Apr 2 2019, 3:38 PM · Origin-npm, Archive coverage
anlambert closed T1379: npm loader as Resolved by committing rDLDNPM35dd880e7725: swh.loader.npm.tasks: Add npm loader celery task.
Apr 2 2019, 3:38 PM · Npm loader, Archive coverage, Origin-npm

Mar 30 2019

ardumont updated the task description for T1139: ingest major gitlab instances.
Mar 30 2019, 1:21 PM · Archive coverage, Origin-GitLab

Mar 18 2019

nahimilega changed the edit policy for P370 pip list result.
Mar 18 2019, 11:19 AM · Archive coverage

Mar 15 2019

anlambert added a project to T1379: npm loader: Npm loader.
Mar 15 2019, 4:43 PM · Npm loader, Archive coverage, Origin-npm

Mar 12 2019

ardumont added a comment to T1351: (periodically) ingest GNU package releases.

@pombreda on #swh-devel suggested to use rsync -r which seems to
provide what we want!

Mar 12 2019, 6:53 PM · Archive coverage

Mar 2 2019

zack triaged T1562: ingest Caml Light/Heavy as Low priority.
Mar 2 2019, 6:01 PM · Archive coverage

Feb 7 2019

anlambert added a comment to T1389: Implement a base "package" loader for package managers.

The table below summarizes how to list all packages and get their metadata from well-known package managers.

Feb 7 2019, 4:32 PM · Origin-Debian, Origin-CRAN, Origin-GNU, Origin-npm, Origin-Pypi, Archive coverage

Feb 5 2019

anlambert raised the priority of T1389: Implement a base "package" loader for package managers from Wishlist to Normal.
Feb 5 2019, 2:31 PM · Origin-Debian, Origin-CRAN, Origin-GNU, Origin-npm, Origin-Pypi, Archive coverage

Dec 27 2018

zack triaged T1451: ingest GNU Savannah Git repositories as Low priority.
Dec 27 2018, 10:44 AM · Archive coverage
zack lowered the priority of T376: ingest git.eclipse.org repositories from Normal to Low.
Dec 27 2018, 10:43 AM · Archive coverage

Dec 18 2018

vlorentz added a parent task for T1389: Implement a base "package" loader for package managers: T1425: refactor the loader stack for package managers.
Dec 18 2018, 4:57 PM · Origin-Debian, Origin-CRAN, Origin-GNU, Origin-npm, Origin-Pypi, Archive coverage
vlorentz added a project to T1379: npm loader: Archive coverage.
Dec 18 2018, 3:38 PM · Npm loader, Archive coverage, Origin-npm

Dec 8 2018

ardumont added a comment to T1351: (periodically) ingest GNU package releases.

This should probably be split in 2 tasks:

Dec 8 2018, 3:11 PM · Archive coverage

Dec 3 2018

anlambert closed T1398: npm incremental lister, a subtask of T1378: Ingest npm into the Software Heritage archive (meta task), as Resolved.
Dec 3 2018, 6:02 PM · Origin-npm, Archive coverage

Nov 27 2018

anlambert triaged T1389: Implement a base "package" loader for package managers as Wishlist priority.
Nov 27 2018, 12:23 PM · Origin-Debian, Origin-CRAN, Origin-GNU, Origin-npm, Origin-Pypi, Archive coverage

Nov 26 2018

anlambert closed T1380: npm lister, a subtask of T1378: Ingest npm into the Software Heritage archive (meta task), as Resolved.
Nov 26 2018, 11:05 AM · Origin-npm, Archive coverage

Nov 22 2018

ardumont updated the task description for T1139: ingest major gitlab instances.
Nov 22 2018, 4:37 PM · Archive coverage, Origin-GitLab
anlambert triaged T1379: npm loader as Normal priority.
Nov 22 2018, 3:51 PM · Npm loader, Archive coverage, Origin-npm
anlambert triaged T1378: Ingest npm into the Software Heritage archive (meta task) as Normal priority.
Nov 22 2018, 3:43 PM · Origin-npm, Archive coverage

Nov 16 2018

zack triaged T1352: ingest Guix (SD) packages as Normal priority.
Nov 16 2018, 12:09 PM · Archive coverage
zack renamed T1351: (periodically) ingest GNU package releases from periodically ingest GNU package releases to (periodically) ingest GNU package releases.
Nov 16 2018, 12:08 PM · Archive coverage
zack added a project to T1351: (periodically) ingest GNU package releases: Archive coverage.
Nov 16 2018, 12:08 PM · Archive coverage

Oct 29 2018

olasd added a comment to T1139: ingest major gitlab instances.

I had added framagit and 0xacab on Friday, forgot to update the task.

Oct 29 2018, 10:49 AM · Archive coverage, Origin-GitLab
olasd updated the task description for T1139: ingest major gitlab instances.
Oct 29 2018, 10:48 AM · Archive coverage, Origin-GitLab
douardda updated the task description for T1139: ingest major gitlab instances.
Oct 29 2018, 10:01 AM · Archive coverage, Origin-GitLab
douardda updated the task description for T1139: ingest major gitlab instances.
Oct 29 2018, 10:00 AM · Archive coverage, Origin-GitLab

Oct 22 2018

zack added a comment to T1262: wiki: Update suggestion box if `all Debian derivatives` can be noted as ingested.

the internship topic on this is now available here: https://wiki.softwareheritage.org/wiki/Ingest_all_Debian_derivatives_(internship)

Oct 22 2018, 7:59 PM · Archive coverage
olasd added a comment to T1262: wiki: Update suggestion box if `all Debian derivatives` can be noted as ingested.
In T1262#23695, @zack wrote:

That's a very good idea, which I'll be happy to draft as a proper internship proposal. Before doing so, however, can you confirm that, scheduling wise, tracking something like ~100 additional derivatives wouldn't be a problem for us in terms of load?

Oct 22 2018, 6:45 PM · Archive coverage
ardumont updated the task description for T1246: pypi loader: Analyze existing errors.
Oct 22 2018, 10:24 AM · Archive coverage, Origin-Pypi

Oct 18 2018

ardumont added a comment to T1246: pypi loader: Analyze existing errors.

Ok, so reworked the group_by_exception snippet to have a more sensible output:

Oct 18 2018, 11:27 AM · Archive coverage, Origin-Pypi