Page MenuHomeSoftware Heritage
Feed Advanced Search

Jun 12 2019

nahimilega added a comment to T1777: Rubygems Lister.

I looked into the data dumps provided on rubygems.org.
A bash script(link to the script) is provided by rubygem that will download the most recent weekly dump listed on https://rubygems.org/pages/data and load it into a PostgreSQL database.

Jun 12 2019, 12:44 PM · RubyGems lister, Archive coverage
nahimilega added a comment to D1561: Update docker-compose logs command in README.

Output of git remote -v?

Jun 12 2019, 12:27 PM
nahimilega added a comment to D1561: Update docker-compose logs command in README.

I am not able to push this change.

Jun 12 2019, 11:54 AM

Jun 11 2019

nahimilega updated the diff for D1492: CRAN Lister.
  • rebased the branch on master
Jun 11 2019, 5:58 PM
nahimilega created P428 Short report on my work on listers in the S1 Public space.
Jun 11 2019, 12:41 PM
nahimilega committed rDLS7c6245e663e0: swh.lister.gnu: Add function to check for file extension. (authored by nahimilega).
swh.lister.gnu: Add function to check for file extension.
Jun 11 2019, 12:07 PM
nahimilega committed rDLS709ba8a6e55c: swh.lister.gnu: Add functionality to list all the tarballs for a package. (authored by nahimilega).
swh.lister.gnu: Add functionality to list all the tarballs for a package.
Jun 11 2019, 12:07 PM
nahimilega committed rDLSebdb959823bc: swh.lister.gnu : Change download method of tree.json file to request (authored by nahimilega).
swh.lister.gnu : Change download method of tree.json file to request
Jun 11 2019, 12:07 PM
nahimilega committed rDLS151f6cd2235c: swh.lister.gnu (authored by nahimilega).
swh.lister.gnu
Jun 11 2019, 12:07 PM
nahimilega closed T1722: GNU Lister, a subtask of T1351: (periodically) ingest GNU package releases, as Resolved.
Jun 11 2019, 12:07 PM · Archive coverage
nahimilega closed T1722: GNU Lister as Resolved by committing rDLS151f6cd2235c: swh.lister.gnu.
Jun 11 2019, 12:07 PM · Archive coverage
nahimilega closed D1482: GNU Lister.
Jun 11 2019, 12:07 PM
nahimilega added a comment to D1482: GNU Lister.

If you do need to rebase, update the diff nonetheless (prior to push) so that phabricator sees the commits and close the diff itself.

Jun 11 2019, 12:05 PM
nahimilega updated the diff for D1482: GNU Lister.
  • Cleaned up api_response.json, file_structure.json
Jun 11 2019, 11:45 AM

Jun 9 2019

nahimilega updated the diff for D1482: GNU Lister.
  • Checked for typo and added comment for tarball variable
Jun 9 2019, 12:24 PM

Jun 8 2019

nahimilega added a comment to D1492: CRAN Lister.

@douardda Can you please take a look at it and give me confirmation to merge it.

Jun 8 2019, 9:32 PM
nahimilega added a comment to D1492: CRAN Lister.

I ran the lister in docker to test it, it ran without any problem
(the content of tcran.py is same 4 lines that are present in README of lister

(swh) archit@work-pc:~/swh-environment/swh-lister$ python tcran.py 
DEBUG:swh.lister.core.lister_base:Loading config from lister_cran
INFO:swh.core.config:Loading config file /home/archit/.config/swh/lister_cran.yml
DEBUG:swh.lister.core.lister_base:<swh.lister.cran.lister.CRANLister object at 0x7f03ea55c518> CONFIG={'content_size_limit': 104857600, 'log_db': 'dbname=softwareheritage-log', 'storage': {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}}, 'scheduler': {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}}, 'lister': {'cls': 'local', 'args': {'db': 'postgresql:///lister-cran'}}, 'credentials': [], 'cache_responses': True, 'cache_dir': '/home/archit/.cache/swh/lister/cran/'}
DEBUG:root:models: 10000
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5002
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5008
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1
DEBUG:root:models: 4372
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1

Here is one of the task that was created by the lister
https://forge.softwareheritage.org/P424

Jun 8 2019, 9:28 PM
nahimilega updated the diff for D1492: CRAN Lister.
  • squashed git commits
  • rebased the branch on master
  • removed methods that were not in use but we written because of abstractattribute
  • Ran the lister in docker
Jun 8 2019, 9:16 PM
nahimilega added a comment to D1482: GNU Lister.

I tested the lister with new changes in the docker container, it worked fine. Here is one of the loader task it created.

Jun 8 2019, 8:03 PM
nahimilega updated the diff for D1482: GNU Lister.
-Added a class variable instance to store tarballs and use it in task_dict()
Jun 8 2019, 7:25 PM
nahimilega created P426 cli in the S1 Public space.
Jun 8 2019, 7:11 PM
nahimilega updated the diff for D1482: GNU Lister.
  • Added a class variable instance to store tarballs and use it in task_dict()
Jun 8 2019, 7:07 PM
nahimilega created P425 error in swh scheduler in the S1 Public space.
Jun 8 2019, 5:48 PM
nahimilega committed rDLSf8a2ae866bc2: swh.lister.core: Remove abstractmethod (authored by nahimilega).
swh.lister.core: Remove abstractmethod
Jun 8 2019, 5:39 PM
nahimilega closed D1566: swh.lister.core: Remove abstractmethod.
Jun 8 2019, 5:38 PM
nahimilega created P424 cran loading tasks in the S1 Public space.
Jun 8 2019, 3:05 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 8 2019, 1:45 PM
nahimilega updated the summary of D1566: swh.lister.core: Remove abstractmethod.
Jun 8 2019, 1:31 PM
nahimilega updated the summary of D1566: swh.lister.core: Remove abstractmethod.
Jun 8 2019, 1:31 PM
nahimilega updated subscribers of D1566: swh.lister.core: Remove abstractmethod.
Jun 8 2019, 1:30 PM
nahimilega updated the summary of D1566: swh.lister.core: Remove abstractmethod.
Jun 8 2019, 1:29 PM
Herald added a reviewer for D1566: swh.lister.core: Remove abstractmethod: Reviewers.
Jun 8 2019, 1:28 PM

Jun 7 2019

nahimilega added a comment to D1482: GNU Lister.

One way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function
Hence will eliminate the need of adding tarballs in model.

Jun 7 2019, 6:31 PM
nahimilega created P422 gnu error in the S1 Public space.
Jun 7 2019, 5:53 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 3:23 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 3:13 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 2:53 PM
Herald added a reviewer for D1561: Update docker-compose logs command in README: Reviewers.
Jun 7 2019, 2:45 PM
nahimilega updated the diff for D1482: GNU Lister.
  • removed print statements
Jun 7 2019, 2:11 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 2:04 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 1:57 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 7 2019, 1:53 PM
nahimilega added a comment to D1482: GNU Lister.

@ardumont I checked it in docker, now it is working fine.

Task 765
  Next run: in 3 months (2019-09-05 11:18:21+00:00)
  Interval: 90 days, 0:00:00
  Type: load-gnu
  Policy: recurring
  Status: next_run_not_scheduled
  Priority: 
  Args:
    'libiconv'
    'https://ftp.gnu.org/old-gnu/libiconv/'
  Keyword args:
    tarballs: [{'date': '985114279', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.6.1.tar.gz'}, {'date': '1054061763', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.1.bin.woe32.zip'}, {'date': '1053376580', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.bin.woe32.zip'}, {'date': '1053376846', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.tar.gz'}]
Jun 7 2019, 1:33 PM
nahimilega updated the diff for D1482: GNU Lister.
  • swh.lister.gnu: Add tarball column in model
Jun 7 2019, 1:28 PM
nahimilega added a comment to D1482: GNU Lister.

I am not able to get why this tarballs Keyword args: is none, it there some error in the code?

Jun 7 2019, 12:18 PM
nahimilega added a comment to D1482: GNU Lister.

For the problem mentionned in irc, i replied to you there, here is my take on this:

20:05 <+ardumont> in general, don't only rely on documentation as this can go out of sync
20:05 <+ardumont> take a look also at the code
20:05 <+ardumont> for the docker-dev, that'd the docker-compose file
20:05 <+ardumont> archit_agrawal[m: ^
20:10 <archit_agrawal[m> ardumont: I will surely take a look at docker-compose file
20:12 <archit_agrawal[m> ardumont: as you told, I amended conf/lister.yml with gnu lister, now how shall I proceed further to sucessfully run the lister
20:28 <archit_agrawal[m> ardumont:  Do I have to run the way mentioned in readme of swh-lister ?
21:45 <archit_agrawal[m> I am trying to run gnu lister in docker . I am getting ModuleNotFoundError: No module named 'psycopg2.errors' error, can anyone please help me.
21:45 <archit_agrawal[m> https://forge.softwareheritage.org/P419
21:45 -- Notice(swhbot): P419 (author: nahimilega): request 400 from scheduler <https://forge.softwareheritage.org/P419>
22:18 <kalpitk[m]> I think 'pip install psycopg2' inside virtual env will be enough
22:34 <archit_agrawal[m> kalpitk: It is already installed in virtual env 
22:36 <+pinkieval> archit_agrawal[m: is the scheduler running in the venv?
22:38 <archit_agrawal[m> pinkieval: yes
22:39 <+pinkieval> can you paste its logs?
22:40 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P420   docker-compose ps outpur
22:40 -- Notice(swhbot): P420 (author: nahimilega): docker-compose ps output <https://forge.softwareheritage.org/P420>
22:42 <+pinkieval> if it's running in docker, then it's not running in the venv
22:42 <+pinkieval> and that's not its logs
22:43 <+pinkieval> "docker-compose logs swh-scheduler-api"
22:43 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P421
22:43 -- Notice(swhbot): P421 (author: nahimilega): scheduler api logs <https://forge.softwareheritage.org/P421>
22:44 <archit_agrawal[m> >and that's not its logs, I sent the previous message before I received this message
22:47 <+pinkieval> hmm, it has no issue referring to psycopg2
22:47 <archit_agrawal[m> pinkieval:  >can you paste its logs?  :I sent the previous message before I received this message
22:47 <+pinkieval> so the error is coming from the unpickling
22:48 <+pinkieval> python -c "import psycopg2.errors"
22:48 <+pinkieval> does this work?
22:49 <archit_agrawal[m> ModuleNotFoundError: No module named 'psycopg2.errors'
22:49 <archit_agrawal[m> No
----
09:38 <+ardumont> archit_agrawal[m: pinkieval: there might be 2 errors involved, one triggering the other
09:39 <+ardumont> the first one being there is probably no scheduler task-type gnu-lister referenced in the scheduler
09:39 <+ardumont> thus, when the lister asks for creating that kind of task, it's not happy about it
09:39 <+ardumont> and then the error we see here about psycopg2.error module not found
09:43 <+ardumont> archit_agrawal[m: prior to triggering your gnu lister task in your docker-env, you need to add the associated task-type
09:43 <+ardumont> swh scheduler task-type add --help
Jun 7 2019, 12:02 PM

Jun 6 2019

nahimilega created P421 scheduler api logs in the S1 Public space.
Jun 6 2019, 10:42 PM
nahimilega created P420 docker-compose ps output in the S1 Public space.
Jun 6 2019, 10:39 PM
nahimilega updated the diff for D1482: GNU Lister.
  • Fixed a typo with time_updated
Jun 6 2019, 9:53 PM
nahimilega created P419 request 400 from scheduler in the S1 Public space.
Jun 6 2019, 9:45 PM
nahimilega added a comment to D1482: GNU Lister.

To be clear, i'm fine with the diff now.
Like @douardda mentioned in the cran lister, i'm waiting for a confirmation that the code runs sensibly well in the docker env as well.

Thanks @ardumont, as you mentioned I did follow those steps to run it in docker, although something went wrong with the docker container and I have to reinstall whole of the docker in my pc, hence I was not able to test this lister yet, I think I will fix the docker issues in my pc by the end of the day, and then I can try to run this.

Jun 6 2019, 3:05 PM

Jun 5 2019

nahimilega updated the summary of D1441: tutorial: How to run a new lister (within docker-dev).
Jun 5 2019, 5:16 PM
nahimilega added a comment to D1482: GNU Lister.

Maybe you could update D1441 with such information, what do you think?

Jun 5 2019, 5:13 PM
nahimilega updated the diff for D1441: tutorial: How to run a new lister (within docker-dev).

Added an important note in lister tutorial

Jun 5 2019, 4:05 PM
nahimilega added a comment to D1441: tutorial: How to run a new lister (within docker-dev).

Build has FAILED

You need to rebase your change on master

Jun 5 2019, 3:54 PM
nahimilega updated the diff for D1441: tutorial: How to run a new lister (within docker-dev).

Wrapped thr line within 80 character

Jun 5 2019, 3:43 PM
nahimilega added a comment to T1777: Rubygems Lister.

On further investigation, I found out there are data dumps provided on rubygems.org
https://rubygems.org/pages/data
This could be used to get the list of all the packages.

Jun 5 2019, 12:27 PM · RubyGems lister, Archive coverage
nahimilega added a comment to D1482: GNU Lister.

Ameding the conf/lister.yml file to add the entries:

celery:
  task_broker: amqp://guest:guest@amqp//
  task_modules:
    ...
    - swh.lister.gnu.tasks
  task_queues:
    ...
    - swh.lister.gnu.tasks.GNUListerTask

Thanks @ardumont. This part is not present in any documentation. I guess we can add a section on how to run lister in docker environmnet under lister tutorial section.

Jun 5 2019, 10:48 AM
nahimilega created P415 Permission denied error. in the S1 Public space.
Jun 5 2019, 10:42 AM

Jun 4 2019

nahimilega added a comment to T1776: packagist (PHP) Lister.

OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.

Jun 4 2019, 11:07 PM · Lister, Archive coverage
nahimilega updated the diff for D1482: GNU Lister.
  • Chnaged list_of_tarballs to tarballs.
Jun 4 2019, 3:27 PM
nahimilega added a comment to T1776: packagist (PHP) Lister.

We have provisions in the scheduler API to deduplicate tasks, so that creating a new git loading task for an existing origin wouldn't duplicate the task but rather just reference the existing one.

Jun 4 2019, 3:15 PM · Lister, Archive coverage
nahimilega added a comment to T1776: packagist (PHP) Lister.
In T1776#32892, @olasd wrote:

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51

There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.

In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?

Jun 4 2019, 3:14 PM · Lister, Archive coverage
nahimilega added a comment to D1482: GNU Lister.

In the mean time, have you tried running this through the swh-docker-dev environment already?

I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.

If that's blocking, do not hesitate to ask question in irc in that regards.

Jun 4 2019, 3:05 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 4 2019, 3:03 PM
nahimilega added a comment to D1482: GNU Lister.

In the mean time, have you tried running this through the swh-docker-dev environment already?

I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.

Jun 4 2019, 3:00 PM
nahimilega updated the diff for D1482: GNU Lister.
  • Arranged conftest.py in core and README in alphabetical order
Jun 4 2019, 1:03 PM

Jun 3 2019

nahimilega added a comment to T1776: packagist (PHP) Lister.

One of the biggest challenges in the implementation of this Lister is to deduplicate all the links listed by the lister, as most of the packages are hosted on GitHub, or bitbucket or gitlab, and a large portion of those websites are already listed and ingested.
Hence we need to remove all the packages that are already listed by another lister in the process of listing all the packages in Packagist.

Jun 3 2019, 8:17 PM · Lister, Archive coverage
nahimilega updated the diff for D1482: GNU Lister.
  • Arranged conftest.py in core in alphabetical order
Jun 3 2019, 7:25 PM
nahimilega updated the diff for D1492: CRAN Lister.
  • Change docstring and made conftest.py in core ordered alphabetically
Jun 3 2019, 7:15 PM
nahimilega added inline comments to D1482: GNU Lister.
Jun 3 2019, 6:50 PM
nahimilega updated the diff for D1482: GNU Lister.
  • swh.lister.gnu: Add function to check for file extension.
Jun 3 2019, 6:42 PM
nahimilega retitled D1492: CRAN Lister from Implemented CRAN Lister to CRAN Lister.
Jun 3 2019, 4:10 PM
nahimilega updated the diff for D1482: GNU Lister.
  • swh.lister.gnu: Add function to check for file extention.
Jun 3 2019, 1:14 PM

Jun 2 2019

nahimilega updated the task description for T1777: Rubygems Lister.
Jun 2 2019, 8:28 PM · RubyGems lister, Archive coverage
nahimilega created P413 Short snipit of the output of gem buildin API in the S1 Public space.
Jun 2 2019, 8:23 PM
nahimilega triaged T1777: Rubygems Lister as Normal priority.
Jun 2 2019, 7:23 PM · RubyGems lister, Archive coverage
nahimilega added a comment to T1724: Maven Central repository support.

I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html

Jun 2 2019, 2:39 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage

May 31 2019

nahimilega added a comment to T1724: Maven Central repository support.

Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.

May 31 2019, 7:21 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega added a comment to T1724: Maven Central repository support.

It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.

May 31 2019, 4:47 PM · Maven loader, Maven lister, GSoC 2019, Archive coverage
nahimilega added inline comments to D1482: GNU Lister.
May 31 2019, 12:29 PM
nahimilega added a comment to T1722: GNU Lister.

In my view, the best method to check for tarball would be to break the filename on " . " and check if the word between last and second last " . " is "tar" or not?. If it is "tar" then the file is useful, else the file does not have source code.

May 31 2019, 12:27 PM · Archive coverage
nahimilega updated subscribers of T1776: packagist (PHP) Lister.

There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.

May 31 2019, 12:14 PM · Lister, Archive coverage
nahimilega triaged T1776: packagist (PHP) Lister as Normal priority.
May 31 2019, 12:04 PM · Lister, Archive coverage

May 30 2019

nahimilega updated the diff for D1482: GNU Lister.
  • Add instance variable and rebased it on latest master
May 30 2019, 5:46 PM
nahimilega added a comment to T1722: GNU Lister.

Here are the extensions which have tar in their name

May 30 2019, 5:29 PM · Archive coverage
nahimilega added a comment to T1722: GNU Lister.

Here is the list of all the different extensions that are present on gnu website with a link to one example. I found only one way to know about the extensions, that is by using the way mentioned here https://stackoverflow.com/a/35188296/10424705 , but as there are a lot of gnu uses . in filenames to denote version number, hence there is no good way to uniquely find all the extensions. Although I have optimised the way to reduce redundancy, you may still find some extensions appearing more than once.

May 30 2019, 5:00 PM · Archive coverage
nahimilega updated the diff for D1482: GNU Lister.
  • reworked git commits
May 30 2019, 2:00 PM
nahimilega updated the diff for D1482: GNU Lister.
  • reworked git commits
May 30 2019, 12:58 PM
nahimilega added a comment to D1482: GNU Lister.

Can you please:

  • try to reduce the samples though. There are too much data (painful to review and to maintain).

Here I am assuming samples mean the JSON content in api_response.json file. Am I correct ?

May 30 2019, 11:29 AM
nahimilega added inline comments to D1482: GNU Lister.
May 30 2019, 11:22 AM

May 29 2019

nahimilega updated the diff for D1441: tutorial: How to run a new lister (within docker-dev).
  • Change "enable tox" to "register celery task"
May 29 2019, 6:56 PM
nahimilega updated the diff for D1492: CRAN Lister.
  • Add instance parameter according to unified credentials structure between listers.
May 29 2019, 6:41 PM
nahimilega updated the diff for D1482: GNU Lister.
  • Add test cases for find_tarball and remove_unnecessary_directories function
  • Add instance parameter according to unified credentials structure between listers
May 29 2019, 4:44 PM
nahimilega closed T1774: Create a lister for x.org as Invalid.
May 29 2019, 12:46 PM · Archive coverage
nahimilega added a comment to T1774: Create a lister for x.org.

@zack I agree, to the fact that archiving https://www.x.org/releases/individual/ is virtually not required because it is a git repo. However, I was concerned about archiving tarballs of other projects which are only present on https://www.x.org/releases/ like x.org/releases/X11R6.8.0/.
However, as you mention about

May 29 2019, 12:45 PM · Archive coverage
nahimilega added a comment to T1734: Create a Lister for launchpad.net.

In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.

May 29 2019, 10:22 AM · Lister, Archive coverage
nahimilega triaged T1774: Create a lister for x.org as Normal priority.
May 29 2019, 12:01 AM · Archive coverage

May 28 2019

nahimilega added inline comments to D1482: GNU Lister.
May 28 2019, 11:41 PM
nahimilega updated the task description for T1734: Create a Lister for launchpad.net.
May 28 2019, 8:36 PM · Lister, Archive coverage