In T421#21642, @ardumont wrote:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Feed Advanced Search
Advanced Search
Advanced Search
Aug 2 2018
Aug 2 2018
The basic loader will be the tarball loader, yes. In addition to that there are two aspects to be defined:
- the stack of objects to be added to the DAG
- the metadata to extract
Jul 29 2018
Jul 29 2018
First of all, thanks for this design document. I have read through it (although not verified the details of offsets, sizes, etc, mind you :-)) and it looks reasonable to me. A few questions/comments below:
zack added a comment to T1161: SVN loader: Create local dump of remote repository to speed up loading task.
good catch !
Jul 26 2018
Jul 26 2018
zack added a comment to T1158: hg loader: Clean up wrong snapshots/releases during hg loading of googlecode.
In T1158#21531, @ardumont wrote:
Jul 25 2018
Jul 25 2018
In T960#21455, @moranegg wrote:Regarding implementation, no plans of implementing it are on the horizon, it is something to consider for the priority/yearly planning.
I can also open a review documentation subtask.
In T422#21473, @ardumont wrote:If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.
That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.
better LWN link to the actual article covering this: https://lwn.net/Articles/751458/
In T420#21471, @ardumont wrote:Looking at the faq [4], they also (now?) recommend bandersnatch. Quoting it:
Jul 23 2018
Jul 23 2018
I'll be AFK for a while, so I can't check the diff, but if you (@moranegg ) can point me to the current version (on docs.s.o?, if it's deployed), I'll be happy to have a look before it's implemented
Jul 20 2018
Jul 20 2018
In T336#21437, @ardumont wrote:E.g., you don't "schedule" the addition of an entire forge as a single task,
Yes, there are 2 tasks for now (incremental, full) but if we also hide that detail within T1157... Then that could be a win, i think ;)
In T336#21431, @ardumont wrote:Does adding a supported forge (e.g gitlab instance) considered a possible save now request?
zack added inline comments to D395: swh-loader-mercurial: Fix invalid release target and add missing data.
zack added a project to T1157: Generic scheduler task creation according to task type: Scheduling utilities.
Jul 19 2018
Jul 19 2018
Thanks for spotting. We also need a separate task to correct the
revisions that were already loaded in the archive. Can you please file
it? (tag "archive content")
Good idea!
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
With that, I do think it is important to have the metadata accessible and keep in mind that with the contextual URL which is used by HAL, the metadata is easily found !
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
- nobody (not even HAL, or any other depositor, including the ones concerned by the compliance use cases) can recompute independently from us this swh-id SR, because it depends not only on the metadata added, but also on the particular mangling of this metadata done during the ingestion, that may well change over time; providing only SR as a swh-id for such a deposit makes it impossible for somebody that may have a copy of the same code and an article mentioning the swh-id SR to check that the code is the same withouth accessing SWH: that would make us a middle man and for our long term strategy we do not want middle men, not even us
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
TL;DR: by ingesting a revision and not returning its ID, we will have a protocol that — at the protocol level — loses information, and that is a bad idea.
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
In T1152#21324, @rdicosmo wrote:It is essential for reproducibility that the shw-id offered to researchers
to reference a deposited piece of software depend only on the software
deposited itself: if three papers use the same software tree, they must
show the same swh-id, no matter whether this software tree has been
deposited once, twice, or three times.In the case of .zip/.tar files this is the swh-id of the root directory,
not the shw-id of the synthetic commit.
Jul 18 2018
Jul 18 2018
zack added a comment to T1152: deposit of tarball/zip: return as main swh-id the directory id, add the synthetic revision id as ancillary information.
Can you (and/or @rdicosmo ) elaborate on the rationale for this?
Jul 17 2018
Jul 17 2018
great, thanks for working on this!
check in OSCON 2018 slides
Jul 12 2018
Jul 12 2018
zack committed rDMODad2c349864aa: refactor CLI tests to avoid duplicate assertion pairs (authored by zack).
refactor CLI tests to avoid duplicate assertion pairs
zack committed rDMODabffb2255753: cli.py: prefer os.fsdecode() over manual fiddling with locale.getpref... (authored by zack).
cli.py: prefer os.fsdecode() over manual fiddling with locale.getpref...
zack committed rDMOD07208f047d18: swh-identify: follow symlinks for CLI arguments (by default) (authored by zack).
swh-identify: follow symlinks for CLI arguments (by default)
zack committed rDMOD89f8d114b4f9: swh-identify: add support for passing multiple CLI arguments (authored by zack).
swh-identify: add support for passing multiple CLI arguments
zack closed T1134: swh-identify: support multiple path arguments as Resolved by committing rDMOD89f8d114b4f9: swh-identify: add support for passing multiple CLI arguments.
zack closed T1133: swh-identify: show filename in output as Resolved by committing rDMODf53989093669: swh-identify: show filename in output (by default).
zack committed rDMODf53989093669: swh-identify: show filename in output (by default) (authored by zack).
swh-identify: show filename in output (by default)
zack closed T1133: swh-identify: show filename in output, a subtask of T1136: swh-identify: support recursive checksumming of directories, as Resolved.
zack triaged T1135: swh-identify: follow symlink by default for paths given as args as Normal priority.
Jun 28 2018
Jun 28 2018
zack committed rDSNIP3d22648a68a9: sql/swh-graph: add driver script to re-launch (authored by zack).
sql/swh-graph: add driver script to re-launch
zack committed rDSNIP08a41c8df48d: sql/swh-graph: update script to take snapshots in accounts (authored by zack).
sql/swh-graph: update script to take snapshots in accounts
zack triaged T1123: refuse deposit submissions that contains a single archive file (within the deposit archive) as Normal priority.
zack renamed T1122: properly handle ingestion of archives within archives (recursive extraction) from Decide how to handle software deposits containing double archive wrapping to properly handle ingestion of archives within archives (recursive extraction).
zack triaged T1122: properly handle ingestion of archives within archives (recursive extraction) as Normal priority.
The general problem (see below for the deposit-specific case) is indeed complex to deal with (both conceptually in a pure Merkle setting and practically due to the existence of zip bombs). I think a workable solution might be ingest the archive as is and also ingest a separate directory corresponding to the archive content, with some metadata linking the two. That way by default we will only return what we have ingested (without recursion), but we will offer ways to dig-in recursively, e.g., in the web app. There will be plenty of devils in plenty of details for this though.
Jun 27 2018
Jun 27 2018
just as an idea for the UI for whitelist/blacklist URL patterns, here is what adblock does, which is quite nice:
zack renamed T1119: save code now submission form from save origin now web form to save code now submission form.
zack renamed T1120: save code now moderation UI from save origin now moderation UI to save code now moderation UI.
I've generalized the title of this task, will add sub-tasks for the specific features that are still missing to complete this.
zack triaged T1022: SWORD deposit requesting to save content existing on an external code hosting platform as Normal priority.
Jun 26 2018
Jun 26 2018
I agree we need a more user friendly way of resolving IDs. (And, in passing, I think we also need an API method /resolve for programmatically resolving PIDs.)
But rather than adding a separate search form, I think we should generalize the current one, to be a Google-style, catch-all search box.
zack added a comment to T1115: Improve error messages when resolving PURLs containing a broken/incorrect origin.
[ Aside on the actual bug here. @rdicosmo can you change the edit policy of this task to "public"? It's the default and it's generally the right one as it allows to do stuff like change task tags and the like. ]
Jun 25 2018
Jun 25 2018
Jun 21 2018
Jun 21 2018
zack added inline comments to D346: identifiers: Make invalid persistent identifier parsing raise error.
zack requested changes to D346: identifiers: Make invalid persistent identifier parsing raise error.
Jun 20 2018
Jun 20 2018
zack added a comment to T1104: parse_persistent_identifier() should raise a parsing exception on invalid identifiers.
In T1104#20616, @ardumont wrote:I recall some remarks about the persistent identifier representation being too simple or something.
I don't know what's wrong with that simple representation as:
- everyone can manipulate dict
Jun 19 2018
Jun 19 2018
zack edited projects for T682: Ingest Google Code Mercurial repositories, added: Archive coverage; removed Archive content.
zack edited projects for T592: ingest bitbucket git repositories, added: Archive coverage; removed Archive content.
zack edited projects for T561: ingest bitbucket (meta task), added: Archive coverage; removed Archive content.
zack edited projects for T419: ingest PyPI into the Software Heritage archive (meta task), added: Archive coverage; removed Archive content.
zack edited projects for T376: ingest git.eclipse.org repositories, added: Archive coverage; removed Archive content.
zack edited projects for T593: ingest bitbucket hg/mercurial repositories, added: Archive coverage; removed Archive content.
zack edited projects for T367: ingest Google Code repositories, added: Archive coverage; removed Archive content.
zack edited projects for T617: ingest Google Code Subversion repositories, added: Archive coverage; removed Archive content.
zack edited projects for T1002: ingest Hackage, the Haskell package repository (meta task), added: Archive coverage; removed Archive content, General.
zack edited projects for T1086: ingest Debian's Alioth (archived) repositories (meta-task), added: Archive coverage; removed Archive content, General.
zack edited projects for T312: Gitorious import: ingest repositories, added: Archive coverage; removed Archive content.
zack edited projects for T673: ingest Google Code Git repositories, added: Archive coverage; removed Archive content.
zack edited projects for T1111: ingest GitLab.com (meta-task), added: Archive coverage; removed Archive content.
zack committed rDMOD0d5bc1774829: add swh-identify CLI tool to compute persistent identifiers (authored by zack).
add swh-identify CLI tool to compute persistent identifiers
Jun 18 2018
Jun 18 2018
yes! +1 :-)
Jun 16 2018
Jun 16 2018
zack changed the status of T1039: add swh-model CLI front-end to compute persistent identifiers from Open to Work in Progress.
Herald added a reviewer for D345: add swh-identify CLI tool to compute persistent identifiers: Reviewers.