Page MenuHomeSoftware Heritage
Feed Advanced Search

Aug 2 2018

zack added a comment to T421: PyPI loader.

The pypi api provides already quite the information (P288, P289 for examples).
For now, the current implementation leverages it.

Aug 2 2018, 3:45 PM · PyPI loader, Origin-Pypi
zack updated subscribers of T421: PyPI loader.

The basic loader will be the tarball loader, yes. In addition to that there are two aspects to be defined:

  1. the stack of objects to be added to the DAG
  2. the metadata to extract
Aug 2 2018, 2:37 PM · PyPI loader, Origin-Pypi

Jul 29 2018

zack added a comment to D398: [WIP] "packing" object storage design documentation.

First of all, thanks for this design document. I have read through it (although not verified the details of offsets, sizes, etc, mind you :-)) and it looks reasonable to me. A few questions/comments below:

Jul 29 2018, 1:52 PM · Object storage
zack added a comment to T1161: SVN loader: Create local dump of remote repository to speed up loading task.

good catch !

Jul 29 2018, 8:19 AM · SVN Loader

Jul 26 2018

zack added a comment to T1158: hg loader: Clean up wrong snapshots/releases during hg loading of googlecode.

@zack or @olasd if you have some time to review P286 at one point in time, that would be awesome.

Jul 26 2018, 6:18 PM · Archive content

Jul 25 2018

zack added a comment to T960: draft specs for deposit with incomplete tarball .

Regarding implementation, no plans of implementing it are on the horizon, it is something to consider for the priority/yearly planning.
I can also open a review documentation subtask.

Jul 25 2018, 10:29 AM · SWORD deposit, Directory loader
zack added a comment to T422: PyPI lister.

If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.

That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.
Jul 25 2018, 10:28 AM · Developers, Origin-Pypi
zack added a comment to T420: mirror PyPI.

better LWN link to the actual article covering this: https://lwn.net/Articles/751458/

Jul 25 2018, 10:25 AM · Origin-Pypi
zack added a comment to T420: mirror PyPI.

Looking at the faq [4], they also (now?) recommend bandersnatch. Quoting it:

Jul 25 2018, 10:25 AM · Origin-Pypi

Jul 23 2018

zack added a comment to T960: draft specs for deposit with incomplete tarball .

I'll be AFK for a while, so I can't check the diff, but if you (@moranegg ) can point me to the current version (on docs.s.o?, if it's deployed), I'll be happy to have a look before it's implemented

Jul 23 2018, 9:01 PM · SWORD deposit, Directory loader

Jul 20 2018

zack added a comment to T336: "save code now".

E.g., you don't "schedule" the addition of an entire forge as a single task,

Yes, there are 2 tasks for now (incremental, full) but if we also hide that detail within T1157... Then that could be a win, i think ;)

Jul 20 2018, 10:08 PM · General
zack updated subscribers of T336: "save code now".

Does adding a supported forge (e.g gitlab instance) considered a possible save now request?

Jul 20 2018, 5:50 PM · General
zack added inline comments to D395: swh-loader-mercurial: Fix invalid release target and add missing data.
Jul 20 2018, 5:46 PM
zack added a project to T1157: Generic scheduler task creation according to task type: Scheduling utilities.
Jul 20 2018, 5:23 PM · Scheduling utilities

Jul 19 2018

zack added a comment to T1155: Mercurial loader: release target is invalid.

Thanks for spotting. We also need a separate task to correct the
revisions that were already loaded in the archive. Can you please file
it? (tag "archive content")

Jul 19 2018, 4:09 PM · Mercurial loader
zack added a comment to T1153: deposit: Keep raw metadata received.

Good idea!

Jul 19 2018, 1:25 PM · SWORD deposit (2018-09-25-HAL-opening)
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.

With that, I do think it is important to have the metadata accessible and keep in mind that with the contextual URL which is used by HAL, the metadata is easily found !

Jul 19 2018, 1:24 PM · SWORD deposit (2018-09-25-HAL-opening)
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
  • nobody (not even HAL, or any other depositor, including the ones concerned by the compliance use cases) can recompute independently from us this swh-id SR, because it depends not only on the metadata added, but also on the particular mangling of this metadata done during the ingestion, that may well change over time; providing only SR as a swh-id for such a deposit makes it impossible for somebody that may have a copy of the same code and an article mentioning the swh-id SR to check that the code is the same withouth accessing SWH: that would make us a middle man and for our long term strategy we do not want middle men, not even us
Jul 19 2018, 1:14 PM · SWORD deposit (2018-09-25-HAL-opening)
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.

TL;DR: by ingesting a revision and not returning its ID, we will have a protocol that — at the protocol level — loses information, and that is a bad idea.

Jul 19 2018, 1:19 AM · SWORD deposit (2018-09-25-HAL-opening)
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.

It is essential for reproducibility that the shw-id offered to researchers
to reference a deposited piece of software depend only on the software
deposited itself: if three papers use the same software tree, they must
show the same swh-id, no matter whether this software tree has been
deposited once, twice, or three times.

In the case of .zip/.tar files this is the swh-id of the root directory,
not the shw-id of the synthetic commit.

Jul 19 2018, 1:14 AM · SWORD deposit (2018-09-25-HAL-opening)

Jul 18 2018

zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.

Can you (and/or @rdicosmo ) elaborate on the rationale for this?

Jul 18 2018, 6:02 PM · SWORD deposit (2018-09-25-HAL-opening)

Jul 17 2018

zack added a comment to T1137: Deploy gitlab instance lister to infra.

great, thanks for working on this!

Jul 17 2018, 10:04 PM · Origin-GitLab
zack committed rMSLDc8c82009f870: check in OSCON 2018 slides (authored by zack).
check in OSCON 2018 slides
Jul 17 2018, 2:16 PM

Jul 12 2018

zack added a project to T1126: Move away non-gunicorn services from banco: System administration.
Jul 12 2018, 4:23 PM · System administration
zack committed rDMODad2c349864aa: refactor CLI tests to avoid duplicate assertion pairs (authored by zack).
refactor CLI tests to avoid duplicate assertion pairs
Jul 12 2018, 4:22 PM
zack closed T1135: swh-identify: follow symlink by default for paths given as args as Resolved by committing rDMOD07208f047d18: swh-identify: follow symlinks for CLI arguments (by default).
Jul 12 2018, 4:22 PM · Data Model
zack committed rDMODabffb2255753: cli.py: prefer os.fsdecode() over manual fiddling with locale.getpref... (authored by zack).
cli.py: prefer os.fsdecode() over manual fiddling with locale.getpref...
Jul 12 2018, 4:22 PM
zack committed rDMOD07208f047d18: swh-identify: follow symlinks for CLI arguments (by default) (authored by zack).
swh-identify: follow symlinks for CLI arguments (by default)
Jul 12 2018, 4:22 PM
zack triaged T1126: Move away non-gunicorn services from banco as Normal priority.
Jul 12 2018, 3:42 PM · System administration
zack committed rDMOD89f8d114b4f9: swh-identify: add support for passing multiple CLI arguments (authored by zack).
swh-identify: add support for passing multiple CLI arguments
Jul 12 2018, 3:32 PM
zack closed T1134: swh-identify: support multiple path arguments as Resolved by committing rDMOD89f8d114b4f9: swh-identify: add support for passing multiple CLI arguments.
Jul 12 2018, 3:32 PM · Data Model
zack closed T1133: swh-identify: show filename in output as Resolved by committing rDMODf53989093669: swh-identify: show filename in output (by default).
Jul 12 2018, 3:01 PM · Data Model
zack committed rDMODf53989093669: swh-identify: show filename in output (by default) (authored by zack).
swh-identify: show filename in output (by default)
Jul 12 2018, 3:01 PM
zack closed T1133: swh-identify: show filename in output, a subtask of T1136: swh-identify: support recursive checksumming of directories, as Resolved.
Jul 12 2018, 3:01 PM · Data Model
zack added a parent task for T1133: swh-identify: show filename in output: T1136: swh-identify: support recursive checksumming of directories.
Jul 12 2018, 2:19 PM · Data Model
zack added a subtask for T1136: swh-identify: support recursive checksumming of directories: T1133: swh-identify: show filename in output.
Jul 12 2018, 2:19 PM · Data Model
zack triaged T1136: swh-identify: support recursive checksumming of directories as Normal priority.
Jul 12 2018, 2:19 PM · Data Model
zack triaged T1135: swh-identify: follow symlink by default for paths given as args as Normal priority.
Jul 12 2018, 2:16 PM · Data Model
zack created T1135: swh-identify: follow symlink by default for paths given as args.
Jul 12 2018, 2:16 PM · Data Model
zack updated the task description for T1133: swh-identify: show filename in output.
Jul 12 2018, 2:04 PM · Data Model
zack triaged T1134: swh-identify: support multiple path arguments as Normal priority.
Jul 12 2018, 2:02 PM · Data Model
zack triaged T1133: swh-identify: show filename in output as Normal priority.
Jul 12 2018, 2:00 PM · Data Model

Jun 28 2018

zack committed rDSNIP3d22648a68a9: sql/swh-graph: add driver script to re-launch (authored by zack).
sql/swh-graph: add driver script to re-launch
Jun 28 2018, 4:59 PM
zack committed rDSNIP08a41c8df48d: sql/swh-graph: update script to take snapshots in accounts (authored by zack).
sql/swh-graph: update script to take snapshots in accounts
Jun 28 2018, 4:59 PM
zack updated subscribers of T1123: refuse deposit submissions that contains a single archive file (within the deposit archive).
Jun 28 2018, 10:33 AM · SWORD deposit
zack triaged T1123: refuse deposit submissions that contains a single archive file (within the deposit archive) as Normal priority.
Jun 28 2018, 10:33 AM · SWORD deposit
zack renamed T1122: properly handle ingestion of archives within archives (recursive extraction) from Decide how to handle software deposits containing double archive wrapping to properly handle ingestion of archives within archives (recursive extraction).
Jun 28 2018, 10:31 AM · General
zack triaged T1122: properly handle ingestion of archives within archives (recursive extraction) as Normal priority.

The general problem (see below for the deposit-specific case) is indeed complex to deal with (both conceptually in a pure Merkle setting and practically due to the existence of zip bombs). I think a workable solution might be ingest the archive as is and also ingest a separate directory corresponding to the archive content, with some metadata linking the two. That way by default we will only return what we have ingested (without recursion), but we will offer ways to dig-in recursively, e.g., in the web app. There will be plenty of devils in plenty of details for this though.

Jun 28 2018, 10:31 AM · General

Jun 27 2018

zack triaged T1121: save code now API entry point as High priority.
Jun 27 2018, 10:55 AM · Web app
zack closed T940: Cannot ssh to the Unibo test VM after reboot as Resolved.
Jun 27 2018, 8:27 AM · System administration
zack triaged T1021: SWORD deposit of metadata about an existing SWH object as Normal priority.
Jun 27 2018, 8:26 AM · Core Loader, SWORD deposit
zack triaged T1100: Caches should be cleared when deploying the webapp as Normal priority.
Jun 27 2018, 8:25 AM · Web app, System administration
zack changed the visibility for F3171354: adblock-whitelist.png.
Jun 27 2018, 8:24 AM
zack added a comment to T1120: save code now moderation UI.

just as an idea for the UI for whitelist/blacklist URL patterns, here is what adblock does, which is quite nice:

Jun 27 2018, 8:24 AM · Web app
zack renamed T1119: save code now submission form from save origin now web form to save code now submission form.
Jun 27 2018, 8:22 AM · Web app
zack renamed T1120: save code now moderation UI from save origin now moderation UI to save code now moderation UI.
Jun 27 2018, 8:22 AM · Web app
zack updated the task description for T1119: save code now submission form.
Jun 27 2018, 8:21 AM · Web app
zack edited projects for T1119: save code now submission form, added: Web app; removed General.
Jun 27 2018, 8:20 AM · Web app
zack triaged T1120: save code now moderation UI as High priority.
Jun 27 2018, 8:20 AM · Web app
zack triaged T1119: save code now submission form as High priority.
Jun 27 2018, 8:15 AM · Web app
zack edited projects for T336: "save code now", added: General; removed Web app.

I've generalized the title of this task, will add sub-tasks for the specific features that are still missing to complete this.

Jun 27 2018, 8:11 AM · General
zack renamed T336: "save code now" from "save origin now" form to "save code now".
Jun 27 2018, 8:11 AM · General
zack triaged T1022: SWORD deposit requesting to save content existing on an external code hosting platform as Normal priority.
Jun 27 2018, 8:09 AM · Core Loader, SWORD deposit

Jun 26 2018

zack added a comment to T1118: browse: add identifiers resolution in search form.

I agree we need a more user friendly way of resolving IDs. (And, in passing, I think we also need an API method /resolve for programmatically resolving PIDs.)
But rather than adding a separate search form, I think we should generalize the current one, to be a Google-style, catch-all search box.

Jun 26 2018, 11:54 AM · Web app
zack added a comment to T1115: Improve error messages when resolving PURLs containing a broken/incorrect origin.

[ Aside on the actual bug here. @rdicosmo can you change the edit policy of this task to "public"? It's the default and it's generally the right one as it allows to do stuff like change task tags and the like. ]

Jun 26 2018, 9:38 AM · Web app

Jun 25 2018

zack closed T1114: question: where is the API documentation repository? as Invalid.

it's here https://forge.softwareheritage.org/source/swh-web/

Jun 25 2018, 4:05 PM

Jun 21 2018

zack added inline comments to D346: identifiers: Make invalid persistent identifier parsing raise error.
Jun 21 2018, 12:08 PM
zack added inline comments to D312: Fix scheduler listener on buster's celery version (4.1.0-4).
Jun 21 2018, 11:33 AM
zack accepted D347: Update blake2 support to be less Debian-specific.
Jun 21 2018, 11:31 AM
zack requested changes to D346: identifiers: Make invalid persistent identifier parsing raise error.
Jun 21 2018, 11:29 AM

Jun 20 2018

zack added a comment to T1104: parse_persistent_identifier() should raise a parsing exception on invalid identifiers.

I recall some remarks about the persistent identifier representation being too simple or something.

I don't know what's wrong with that simple representation as:

  • everyone can manipulate dict
Jun 20 2018, 12:22 PM · Data Model

Jun 19 2018

zack edited projects for T682: Ingest Google Code Mercurial repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:30 PM · Archive coverage, Mercurial loader
zack edited projects for T592: ingest bitbucket git repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:29 PM · Archive coverage, Origin-Bitbucket
zack edited projects for T561: ingest bitbucket (meta task), added: Archive coverage; removed Archive content.
Jun 19 2018, 3:29 PM · Archive coverage, Origin-Bitbucket
zack edited projects for T419: ingest PyPI into the Software Heritage archive (meta task), added: Archive coverage; removed Archive content.
Jun 19 2018, 3:29 PM · Archive coverage, Origin-Pypi
zack edited projects for T376: ingest git.eclipse.org repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:29 PM · Archive coverage
zack edited projects for T593: ingest bitbucket hg/mercurial repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:28 PM · Archive coverage, Origin-Bitbucket
zack edited projects for T367: ingest Google Code repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:28 PM · Archive coverage, Restricted Project
zack edited projects for T617: ingest Google Code Subversion repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:28 PM · Archive coverage, Origin-GoogleCode, SVN Loader
zack edited projects for T1002: ingest Hackage, the Haskell package repository (meta task), added: Archive coverage; removed Archive content, General.
Jun 19 2018, 3:27 PM · Hackage loader, Hackage lister, Archive coverage
zack edited projects for T1086: ingest Debian's Alioth (archived) repositories (meta-task), added: Archive coverage; removed Archive content, General.
Jun 19 2018, 3:27 PM · Archive coverage
zack edited projects for T312: Gitorious import: ingest repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:27 PM · Archive coverage, Restricted Project, Origin-Gitorious, Format-Git
zack edited projects for T673: ingest Google Code Git repositories, added: Archive coverage; removed Archive content.
Jun 19 2018, 3:27 PM · Archive coverage
zack edited projects for T1111: ingest GitLab.com (meta-task), added: Archive coverage; removed Archive content.
Jun 19 2018, 3:27 PM · Archive coverage, General, Origin-GitLab
zack created Archive coverage.
Jun 19 2018, 3:24 PM
zack edited Description on Archive content.
Jun 19 2018, 3:22 PM
zack added a subtask for T1111: ingest GitLab.com (meta-task): T989: Implement GitLab lister.
Jun 19 2018, 3:21 PM · Archive coverage, General, Origin-GitLab
zack added a parent task for T989: Implement GitLab lister: T1111: ingest GitLab.com (meta-task).
Jun 19 2018, 3:21 PM · Origin-GitLab
zack triaged T1111: ingest GitLab.com (meta-task) as High priority.
Jun 19 2018, 3:21 PM · Archive coverage, General, Origin-GitLab
zack assigned T989: Implement GitLab lister to ardumont.
Jun 19 2018, 3:18 PM · Origin-GitLab
zack triaged T1110: document GitHub caseness caveats as Low priority.
Jun 19 2018, 2:51 PM · GitHub lister, Documentation
zack triaged T1109: web app: nicer error message about non-existent origins as Low priority.
Jun 19 2018, 2:37 PM · Web app
zack raised the priority of T757: Memory leak in swh.storage.api.server from Normal to High.
Jun 19 2018, 2:34 PM · Storage manager
zack committed rDMOD0d5bc1774829: add swh-identify CLI tool to compute persistent identifiers (authored by zack).
add swh-identify CLI tool to compute persistent identifiers
Jun 19 2018, 10:31 AM
zack closed D345: add swh-identify CLI tool to compute persistent identifiers.
Jun 19 2018, 10:31 AM
zack closed T1039: add swh-model CLI front-end to compute persistent identifiers as Resolved by committing rDMOD0d5bc1774829: add swh-identify CLI tool to compute persistent identifiers.
Jun 19 2018, 10:31 AM · Data Model

Jun 18 2018

zack added a comment to T1105: Web app: Remove duplicated api documentation.

yes! +1 :-)

Jun 18 2018, 1:06 PM · Web app

Jun 16 2018

zack changed the status of T1039: add swh-model CLI front-end to compute persistent identifiers from Open to Work in Progress.
Jun 16 2018, 10:46 PM · Data Model
Herald added a reviewer for D345: add swh-identify CLI tool to compute persistent identifiers: Reviewers.
Jun 16 2018, 10:42 PM
zack added a revision to T1039: add swh-model CLI front-end to compute persistent identifiers: D345: add swh-identify CLI tool to compute persistent identifiers.
Jun 16 2018, 10:42 PM · Data Model