Page MenuHomeSoftware Heritage
Feed Advanced Search

Jul 1 2021

zack requested changes to D5952: changelog: Reference first completion of sourceforge git/svn origins.
Jul 1 2021, 9:33 AM
zack added a comment to D5952: changelog: Reference first completion of sourceforge git/svn origins.

Thanks a lot for this!

Jul 1 2021, 9:32 AM
zack added a comment to T3418: Decide a consistent policy on having multiple archived objects for the same extid.

(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions

I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.

Jul 1 2021, 9:01 AM · Storage manager, Mercurial loader

Jun 30 2021

zack added a comment to T3418: Decide a consistent policy on having multiple archived objects for the same extid.

I've the feeling that option (1) will lead in the long run to an explosion on the size of the mapping which will make us eventually converge (slowly) toward option (3).

Jun 30 2021, 7:33 PM · Storage manager, Mercurial loader

Jun 21 2021

zack accepted D5899: swh-model: get SWHID from Content/Directory objects in from_disk.
Jun 21 2021, 4:53 PM
zack committed R183:6da3bade1513: add Demeyer, Mens book on Software Evolution (authored by zack).
add Demeyer, Mens book on Software Evolution
Jun 21 2021, 9:02 AM

Jun 19 2021

zack committed rDGRPHf885bdb0099e: git2graph: bugfix: traverse all nodes even when edges are not traversed (authored by zack).
git2graph: bugfix: traverse all nodes even when edges are not traversed
Jun 19 2021, 2:12 PM
zack committed rDGRPH3adff1c45b7a: tools/dir2graph: new tool to convert a local dir to nodes/edges files (authored by zack).
tools/dir2graph: new tool to convert a local dir to nodes/edges files
Jun 19 2021, 2:03 PM

Jun 18 2021

zack added a comment to D5899: swh-model: get SWHID from Content/Directory objects in from_disk.

I'll wait @vlorentz and @olasd before to write the unit test (in the case they want to use a different approach)

Jun 18 2021, 6:33 PM
zack updated subscribers of D5899: swh-model: get SWHID from Content/Directory objects in from_disk.

@olasd @vlorentz: I've noted down only nits and "classics" in the above review. If you want to chime in on the approach (where the methods are, properties, caching, etc.) please do!

Jun 18 2021, 5:13 PM
zack requested changes to D5899: swh-model: get SWHID from Content/Directory objects in from_disk.
Jun 18 2021, 5:10 PM
zack added a comment to D5899: swh-model: get SWHID from Content/Directory objects in from_disk.

LGTM in general, but needs unit tests (in addition to the two nitpicks above about docstrings)

Jun 18 2021, 5:10 PM
zack triaged T3393: add swhid() method to from_disk classes as Normal priority.
Jun 18 2021, 11:54 AM · Data Model

Jun 17 2021

zack committed rMSLD358de8ea3b7a: check in slides for GraphRM talk (authored by zack).
check in slides for GraphRM talk
Jun 17 2021, 3:12 PM
zack committed rMSLD40a6bd7d6e97: check-in (old) last-bits for telecom paris talk (authored by zack).
check-in (old) last-bits for telecom paris talk
Jun 17 2021, 3:12 PM
zack committed rDGRPH0068f61008e6: FindEarliestRevision: make timing optional with a dedidcated CLI flag (authored by zack).
FindEarliestRevision: make timing optional with a dedidcated CLI flag
Jun 17 2021, 10:47 AM

Jun 16 2021

zack committed R183:9ef36ff25c1f: add a bunch of entries about network studies on software (authored by zack).
add a bunch of entries about network studies on software
Jun 16 2021, 2:56 PM
zack committed R183:b6e967f148c4: add papers: mockus2009ammassing and gao2007archive (authored by zack).
add papers: mockus2009ammassing and gao2007archive
Jun 16 2021, 2:37 PM
zack added a comment to D5879: identify: Fix exclude_patterns parameter type for identify_object.

I think this also needs bumping the versioned dependency on swh-model (and a release of that).

Jun 16 2021, 12:22 PM

Jun 15 2021

zack triaged T3383: swh identify --recursive breaks --exclude, resulting in a "AttributeError: 'str' object has no attribute 'decode'" traceback as High priority.
Jun 15 2021, 4:48 PM · Data Model
zack updated subscribers of T3382: Save process seems to be stuck.

Thanks @ardumont for following up to this task.

Jun 15 2021, 4:27 PM · Save Code Now
zack closed D5551: Fix swh-scanner for python 3.7 and >= 3.8.

landed in d58bcb59a0999ae124de23db88fc9f73603d452a

Jun 15 2021, 11:11 AM
zack commandeered D5551: Fix swh-scanner for python 3.7 and >= 3.8.
Jun 15 2021, 11:11 AM
zack closed T3209: Fix swh-scanner for python > 3.7 as Resolved by committing rDTSCNd58bcb59a099: Fix swh-scanner for python 3.7 and >= 3.8.
Jun 15 2021, 11:10 AM · Code scanner
zack committed rDTSCNd58bcb59a099: Fix swh-scanner for python 3.7 and >= 3.8 (authored by aastha1999).
Fix swh-scanner for python 3.7 and >= 3.8
Jun 15 2021, 11:10 AM

Jun 11 2021

zack accepted D5825: swh-model: add recursive option.
Jun 11 2021, 2:54 PM
zack removed a reviewer for D5825: swh-model: add recursive option: vlorentz.
Jun 11 2021, 2:51 PM
zack requested changes to D5825: swh-model: add recursive option.
Jun 11 2021, 1:21 PM
zack added a project to T3374: Ingest sourceforge repositories (origins of type git, svn, hg): Archive coverage.

Note: when this is (reasonably) done, we should document the addition of SourceForge to the archive coverage page at archive.s.o and also to the archive changelog.

Jun 11 2021, 12:27 PM · System administration, Archive coverage, Origin-SourceForge
zack renamed T3349: use swh.model.merkle/from_disk instead of swh.scanner.model from consider using swh.model.merkle/from_disk instead of swh.scanner.model to use swh.model.merkle/from_disk instead of swh.scanner.model.
Jun 11 2021, 11:16 AM · Code scanner
zack added a subtask for T3349: use swh.model.merkle/from_disk instead of swh.scanner.model: T2730: scanner: should output the root SWHID as well.
Jun 11 2021, 11:16 AM · Code scanner
zack added a parent task for T2730: scanner: should output the root SWHID as well: T3349: use swh.model.merkle/from_disk instead of swh.scanner.model.
Jun 11 2021, 11:16 AM · Easy hack, Code scanner

Jun 10 2021

zack abandoned D5420: cli/identify: Add support for --recursive.

Superseded by D5825. Abandoning this one.

Jun 10 2021, 8:38 PM
zack commandeered D5420: cli/identify: Add support for --recursive.
Jun 10 2021, 8:38 PM
zack added a comment to D5825: swh-model: add recursive option.

LGTM, thanks !

Jun 10 2021, 8:38 PM
zack accepted D5825: swh-model: add recursive option.
Jun 10 2021, 8:37 PM

Jun 9 2021

zack requested changes to D5825: swh-model: add recursive option.
Jun 9 2021, 11:57 AM
zack added a comment to D5825: swh-model: add recursive option.

wouldn't it make sense to have a separate command (eg. recursive-identify instead of identify --recursive)?

Jun 9 2021, 10:08 AM

Jun 8 2021

zack added a comment to D5825: swh-model: add recursive option.

But what was the process before? Did it ignore directory entries?

It checks only the given directories generating a from_disk.Directory object for each directory. Should it uses the same logic used for the recursive option?

Jun 8 2021, 8:02 PM
zack added a comment to D5825: swh-model: add recursive option.
  1. If relevant, could you implement --verify too?

Sure, if it is useful i could open another diff for it

Jun 8 2021, 5:37 PM
zack added a project to T3350: Deploy sourceforge lister in production: Archive coverage.
Jun 8 2021, 11:45 AM · Archive coverage, System administration, Origin-SourceForge
zack added a comment to T3366: Improve the page rendering mechanism in the web UI.

I'm adding here a note about something to consider in terms of pros/cons: accessibility. As for the most part we are archiving textual information, we really want it to be accessible for all users. Right now we go further than that, ensuring that the Web UI can be browser with a textual browser: so, for instance, w3m https://archive.softwareheritage.org/swh:1:cnt:c839dea9e8e6f0528b468214348fee8669b305b2 just works out of the box. I'm not up to date on what's the accessibility impact of current JS frameworks, nor that we should have as a requirement that the archive is browsable without JavaScript enabled (as per today's standards "browsable with free javascript" is probably good enough for us, and we have a curl-able API anyway), but accessibility per se is definitely going to be a requirement.

Jun 8 2021, 11:19 AM · Web app
zack shifted T3366: Improve the page rendering mechanism in the web UI from the Restricted Space space to the S1 Public space.
Jun 8 2021, 11:13 AM · Web app
zack renamed T3366: Improve the page rendering mechanism in the web UI from Improve the page rendering mechanism in the web to Improve the page rendering mechanism in the web UI.
Jun 8 2021, 11:13 AM · Web app
zack triaged T3366: Improve the page rendering mechanism in the web UI as Normal priority.
Jun 8 2021, 11:13 AM · Web app

Jun 7 2021

zack added a comment to T3149: Benchmark software for the object storage.

how about just collecting all raw timings in an output CSV file (or several files if needed) and compute the stats downstream (e.g., with pandas)?
that would allow changing the percentiles later on as well as compute different stats, without having to rerun the benchmarks

Jun 7 2021, 3:21 PM · Object storage
zack renamed Save Code Now from SaveCodeNow to Save Code Now.
Jun 7 2021, 9:45 AM
zack triaged T3361: "Save code now" seems to be stuck as High priority.
Jun 7 2021, 9:44 AM · Save Code Now

Jun 4 2021

zack added a comment to D5816: loader: add an hg-specific mapping for branching.

Minor request, which can also be implemented in a subsequent commit, can we have the mapping documented somewhere? As a first approximation even a docstring would do, so that it will show up at docs.s.o. (Not sure if the files being modified here are the most relevant place for it though, it can also go at the root of the mercurial loader Python hierarchy, up to you !)

Jun 4 2021, 11:39 AM
zack resigned from D5816: loader: add an hg-specific mapping for branching.
Jun 4 2021, 11:36 AM

Jun 3 2021

zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

That explains it, and it's good enough for me, thanks :)

Jun 3 2021, 2:30 PM · Mercurial loader
zack updated the task description for T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).
Jun 3 2021, 2:30 PM · Mercurial loader
zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

My remaining question then is: how about, instead of branch-{tip,heads,closed-heads}/name we use branches/{heads,closed,tip}/name ?

Jun 3 2021, 2:20 PM · Mercurial loader
zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

They are all branch heads (git "branch" are about heads too, bookmarks too), so a heads/ prefix does not bring much.

Jun 3 2021, 1:37 PM · Mercurial loader
zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.

Jun 3 2021, 1:11 PM · Mercurial loader

Jun 2 2021

zack requested changes to D5816: loader: add an hg-specific mapping for branching.

(i'm marking this as on hold until we have reached a decision on T3352, just to avoid this gets deployed by mistake. But feel free to go ahead with the rest of the review or even override, if you think there is a better safeguard to avoid this gets deployed)

Jun 2 2021, 8:06 PM

May 31 2021

zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

Understood. To explain my thinking here, the refs/... structure is something we picked to represent git branch names as faithfully as possible, adding as little as possible on top of it. In trying to represent branch names from another VCS, as a first approximation I'd rather reuse the same *approach* than a *result* that is similar, if that makes sense. So, to pivot the question around, what is the minimal (also in the sense that it is shorter / has less cruft) naming scheme that would allow us to represent without ambiguity all the Mercurial naming aspects that you want to capture?

May 31 2021, 10:24 PM · Mercurial loader
zack added a comment to T3352: Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip).

Is the ability to recognize that a snapshot comes from Mercurial an actual goal here? I don't think we care about "clashes" between snapshot created from different VCS, but maybe I'm missing something.

May 31 2021, 9:03 PM · Mercurial loader

May 28 2021

zack changed the status of T3349: use swh.model.merkle/from_disk instead of swh.scanner.model from Open to Work in Progress.
May 28 2021, 11:13 AM · Code scanner
zack triaged T3349: use swh.model.merkle/from_disk instead of swh.scanner.model as Normal priority.
May 28 2021, 11:13 AM · Code scanner

May 24 2021

zack added a comment to T3341: Move real-time discussion away from Freenode.

Thanks for raising this. I wanted to do so too, but couldn't find the time :)

May 24 2021, 2:15 PM · Community Building

May 21 2021

zack added a comment to T3313: Web API: per-user accounting.

@anlambert @vsellier: question about this, in order to document the status quo.
Currently, where are the django web app logs stored and for how long are they kept?

May 21 2021, 1:36 PM · System administration, Web app

May 19 2021

zack added a comment to T3202: Help new users discover the features available in the archive browsing view.

What I can do is enabling the guided tour by configuration. This way we can deactivate it in production
until we got something stable and usable while we can test the feature on staging.

May 19 2021, 5:06 PM · Web app

May 18 2021

zack committed rMSLD10a10574cede: telecom-paris talk: add R-B picture (authored by zack).
telecom-paris talk: add R-B picture
May 18 2021, 1:33 PM
zack committed rMSLD8dabda112ab6: check in slides for talk at Télécom Paris (authored by zack).
check in slides for talk at Télécom Paris
May 18 2021, 1:33 PM
zack committed rMSLD23a4262df8d7: graph compression: update refs and refresh status of WIP/future work (authored by zack).
graph compression: update refs and refresh status of WIP/future work
May 18 2021, 9:54 AM
zack committed rMSLDb0206689ac29: archive coverage slide: avoid vertical overflow (authored by zack).
archive coverage slide: avoid vertical overflow
May 18 2021, 9:51 AM
zack committed rMSLDe9c006d2d707: dataset.org: drop "soon" from Azure availability (authored by zack).
dataset.org: drop "soon" from Azure availability
May 18 2021, 9:51 AM
zack committed rMSLD95a51a7681ff: biblio.org: complete EMSE 2020 ref, add IEEE SW gender study ref (authored by zack).
biblio.org: complete EMSE 2020 ref, add IEEE SW gender study ref
May 18 2021, 9:51 AM
zack updated the task description for T3329: document ORC format dataset availability.
May 18 2021, 9:33 AM · Datasets
zack triaged T3329: document ORC format dataset availability as High priority.
May 18 2021, 9:32 AM · Datasets

May 10 2021

zack accepted D5717: templates/api: Update Rate limiting section in API documentation.
May 10 2021, 11:50 AM
zack published D5717: templates/api: Update Rate limiting section in API documentation for review.
May 10 2021, 11:50 AM
zack closed T3317: web client: known() method raise "400 Client Error" traceback as Invalid.

@zack WebAPIClient.known takes a list of strings, not a string

May 10 2021, 11:05 AM · Web client
zack triaged T3318: scanner should use the known() method of web.client as Low priority.
May 10 2021, 9:02 AM · Code scanner
zack triaged T3317: web client: known() method raise "400 Client Error" traceback as High priority.
May 10 2021, 9:00 AM · Web client

May 8 2021

zack updated the task description for T3316: SWHID v2: determine binary-to-text encoding for checksum part.
May 8 2021, 1:18 PM · Data Model
zack triaged T3316: SWHID v2: determine binary-to-text encoding for checksum part as Normal priority.
May 8 2021, 11:43 AM · Data Model
zack closed T2210: Data Model as Invalid.

Closing this as it was a vague meta-task from 2020 roadmap (but we'll keep the actual sub-tasks, which were more clearly identified and are still relevant).

May 8 2021, 11:37 AM · Data Model, Roadmap 2020

May 7 2021

zack added a parent task for T735: SourceForge lister: T3315: archive SourceForge.
May 7 2021, 5:25 PM · Origin-SourceForge
zack added a subtask for T3315: archive SourceForge: T735: SourceForge lister.
May 7 2021, 5:25 PM · Archive coverage
zack triaged T3315: archive SourceForge as Normal priority.
May 7 2021, 5:25 PM · Archive coverage
zack triaged T3313: Web API: per-user accounting as Low priority.
May 7 2021, 9:48 AM · System administration, Web app
zack triaged T3312: web API rate limit: 10x more quota for authenticated users as High priority.
May 7 2021, 9:35 AM · Web app

May 6 2021

zack added a comment to T3311: Use .gitmodules to discover origins.

I think the only issue with (3) is not being retroactive

May 6 2021, 6:49 PM · Archive coverage, Git loader
zack added a comment to T3311: Use .gitmodules to discover origins.

This is a good idea, thanks for raising it.

May 6 2021, 6:06 PM · Archive coverage, Git loader
zack added a comment to D5704: keycloak: Set SSO Session Idle to one week, Session Max to one month.

Why 6 hours and not, say, 1 week or even 1 month?
It is very common these days to remain connected for that long, and the UX in having to relogin often is a lot worse.

May 6 2021, 3:00 PM

May 3 2021

zack renamed T3301: graph: add test for the "algo" parameter of walk() from swh-graph: No tests of the "algo" parameter of walk() to graph: add test for the "algo" parameter of walk().
May 3 2021, 6:55 PM · Easy hack, Compressed graph service
zack added a comment to D5420: cli/identify: Add support for --recursive.

@KShivendu zack's comment explains how the code should work, and gives a pointer to an existing implementation of the technique (swh-scanner); this should be enough to start.

But don't feel obligated to continue working on this; as mentioned the task is harder than we expected.

May 3 2021, 1:54 PM

Apr 30 2021

zack accepted D5657: Spool large packfiles to disk instead of consuming tons of memory.

nice hack/trade-off !

Apr 30 2021, 8:39 PM
zack accepted D5654: docs/persistent-identifiers: Add guidelines for fixing invalid SWHIDs (this time for uppercase).

great wording, thanks !

Apr 30 2021, 1:29 PM
zack committed rDTSCN3f8784c726a7: run_benchmark.sh: add missing "scan_time" column header (authored by zack).
run_benchmark.sh: add missing "scan_time" column header
Apr 30 2021, 11:14 AM

Apr 29 2021

zack accepted D5644: scanner-benchmark: add algorithms timings in results.

the fact that algo_min is treated differently than other cases is horrible :-P, but it's not new in this diff, so ok :)
also, in the requirements you should probably put what's the minimum version of swh.core that you need, but that too is not a big deal

Apr 29 2021, 3:01 PM
zack added a comment to T3298: Consider making SWHID handling case insensitive.

Ah, this is an interesting practical problem.
I'm not a fan of changing the spec of SWHID version 1 to make them case insensitive, as it seems to be a significant change (in particular for the code that checks for the syntactic correctness of IDs).
But we can totally add a "SHOULD" section to the resolvers part of the spec recommending (but not mandating) that resolvers treat core SWHIDs as case insensitive. (Of course all the contextual parts cannot be considered case insensitive.)

Apr 29 2021, 12:17 PM · Data Model, Web app

Apr 28 2021

zack accepted D5629: graph: s/REST/RPC/.
Apr 28 2021, 8:45 AM
zack accepted D5630: vault: s/REST/RPC/.
Apr 28 2021, 8:43 AM
zack accepted D5631: lister: s/REST( API)?/API/.
Apr 28 2021, 8:42 AM
zack accepted D5632: web: s/Graph REST API/Graph RPC API/.
Apr 28 2021, 8:42 AM

Apr 27 2021

zack added a comment to T1576: document the typical cost(s) of hosting an archive mirror.

docs !

Apr 27 2021, 4:06 PM · Documentation, Mirror

Apr 26 2021

zack added a comment to T3087: Implement support for takedown notices (infra, admin tools, workflow).

So what about exports of the archive available on git-annex?

Apr 26 2021, 8:34 AM · Roadmap 2022, meta-task, Roadmap 2021, Web app