Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 2 2019
May 31 2019
It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.
In my view, the best method to check for tarball would be to break the filename on " . " and check if the word between last and second last " . " is "tar" or not?. If it is "tar" then the file is useful, else the file does not have source code.
There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.
May 30 2019
Here are the extensions which have tar in their name
Here is the list of all the different extensions that are present on gnu website with a link to one example. I found only one way to know about the extensions, that is by using the way mentioned here https://stackoverflow.com/a/35188296/10424705 , but as there are a lot of gnu uses . in filenames to denote version number, hence there is no good way to uniquely find all the extensions. Although I have optimised the way to reduce redundancy, you may still find some extensions appearing more than once.
May 29 2019
@zack I agree, to the fact that archiving https://www.x.org/releases/individual/ is virtually not required because it is a git repo. However, I was concerned about archiving tarballs of other projects which are only present on https://www.x.org/releases/ like x.org/releases/X11R6.8.0/.
However, as you mention about
I don't like the idea of this lister.
In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.
May 28 2019
Launchpadlib
Pros
The library is available on the Debian stretch.
Easier and faster to get all the branches of a project as it returns at one go whereas bare API returns in an indexing fashion.
May 27 2019
npm is now archived continuously by Software Heritage (see for instance: https://archive.softwareheritage.org/browse/origin/https://www.npmjs.com/package/webpack/).
May 25 2019
I think the only thing missing here is adding the NPM logo to the archive coverage page.
how many are left? can we close this as well as T419 now that the PyPI listers/loaders have been in production for a while?
May 24 2019
Extending on what I wrote in the previous comment, I did a bit more research about this.
May 23 2019
As recommended by @olasd I checkout out Maven Central index ( https://repo.maven.apache.org/maven2/.index/) this is a
May 22 2019
Comment by @olasd
May 21 2019
May 20 2019
We've discussed a plausible plan for a "base package manager loader" with @ardumont and, to some extent, @anlambert.
May 17 2019
May 16 2019
As suggested by @olasd, what was done in 2015 to ingest packages -
- Create origins for all the folders indiscriminately
- Only import things that look like tarballs (i.e. that end with .tar.something)
@olasd recommended trying the listing approach for NuGET lister we discussed(to fetch for repository key in the api response), As recommended, I tried the approach on small dataset. I tried it on 1412 repositories are all of them were quite latest. I found 0 repository URL in them and in 900 of them repository key was empty(ie they were blank string). I think we need to change our approach.
As discussed on IRC the source code link for the repository is in very few of the repositories and the version control system used by repositories is not mentioned in the API response.
One way is Repository URL and the repository type field are present in .nuspec file for each project, so we have to download that file for each project and get source URL but the problem with this is downloading all binary packages to get a small chance to find a link to a source repository sounds like a lot of work, bandwidth and computing power for not much gain and that would only cover one of the ways package maintainers can set the source code information; the aforementioned blog post listed at least four
May 15 2019
Yes, for history I do not believe we have an easy answer for SWH (and the world at large) to consume. We may have approximations; I'll check with Gabor and others.
@eddelbuettel yeah, if there isn't a standard way to go all the way back in time, it's OK to currently only ingest what's currently returned as available. In the medium/long term it will converge to having archived everything (w.r.t. the considered time frame) anyway. And we can always retrofit later on stuff that is archived elsewhere. But I wouldn't want to make this a blocker to start archiving what's (easily) listable now.
Nicholas: Sadly, one can't. I kinda/sorta have that implicitly as I have been running CRANberries since 2007 or so.
Expanding on what Dirk Eddelbuettel posted on IRC when we talked about that, a minimal R script to fetch the current package information would be:
@nahimilega it is probably a two line script. install R and do readRDS() and you will get a data.frame object which is just like a table and has columns and then you can extract what you want. Cheers :). BTW when I did readRDS it retrieved a lot of links and I don't know about the lister that much but you can pickup from there.
API Documentation -
https://docs.microsoft.com/en-us/nuget/api/catalog-resource#base-url
@olasd I do not have any familiarity with R language. Learning some basics and making this script would take me around a week. I was wondering it is possible that someone in Software Heritage who have some experience with R can write this script as it would be a matter of minutes to the person who knows R.
Is it possible to do so?
In T1709#31492, @nahimilega wrote:Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.
May 13 2019
Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.
@faux on IRC mentioned that there is a public DB dump (https://cran.r-project.org/web/dbs) which might be helpful for the purpose.
This DB dump contains files with .rds extension which is used by R language. Here are a couple of rows from that DB dump https://forge.softwareheritage.org/P396
Apr 19 2019
Apr 18 2019
Apr 5 2019
This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.
Apr 4 2019
In T1351#30078, @ardumont wrote:It is updated daily at.
Apr 3 2019
Heads up, there is now a json file (compressed) describing the gnu mirror's arborescence tree.
It is updated daily at.
It's served at [1]
Apr 2 2019
Mar 30 2019
Mar 18 2019
Mar 15 2019
Mar 12 2019
@pombreda on #swh-devel suggested to use rsync -r which seems to
provide what we want!
Mar 2 2019
Feb 7 2019
The table below summarizes how to list all packages and get their metadata from well-known package managers.
Feb 5 2019
Dec 27 2018
Dec 18 2018
Dec 8 2018
This should probably be split in 2 tasks:
Dec 3 2018
Nov 27 2018
Nov 26 2018
Nov 22 2018
Nov 16 2018
Oct 29 2018
I had added framagit and 0xacab on Friday, forgot to update the task.
Oct 22 2018
the internship topic on this is now available here: https://wiki.softwareheritage.org/wiki/Ingest_all_Debian_derivatives_(internship)
In T1262#23695, @zack wrote:That's a very good idea, which I'll be happy to draft as a proper internship proposal. Before doing so, however, can you confirm that, scheduling wise, tracking something like ~100 additional derivatives wouldn't be a problem for us in terms of load?
Oct 18 2018
Ok, so reworked the group_by_exception snippet to have a more sensible output: