The identification part of this task has been done with documenting/implementing our PIDs, the rest is more suited for the software citation work on which @moranegg is actively working.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 4 2018
this has been done in bc30e8bc60ac3a310f91a15b5692e6b9bc6a30a3
this is now done (YAY :-)), closing the task to reflect current status
Oct 1 2018
Sep 4 2018
Aug 24 2018
A priori, at current speed, there remains ~7.5 days till the end of the gitlab origins ingestion.
Aug 3 2018
Jul 26 2018
Jul 25 2018
Jul 24 2018
Jul 21 2018
But that's listing, not loading. It's not clear to me that a user that added a forge would be interested in knowing when we're done adding its origins, that's just an implementation detail. The user will want to know when we have archived all of it at least once, which is complicated to define. It might be enough to give visibility to when the listing is done, but it'll certainly require a different user-facing explanation than saving an origin.
Jul 20 2018
In T336#21437, @ardumont wrote:E.g., you don't "schedule" the addition of an entire forge as a single task,
Yes, there are 2 tasks for now (incremental, full) but if we also hide that detail within T1157... Then that could be a win, i think ;)
E.g., you don't "schedule" the addition of an entire forge as a single task,
In T336#21431, @ardumont wrote:Does adding a supported forge (e.g gitlab instance) considered a possible save now request?
Does adding a supported forge (e.g gitlab instance) considered a possible save now request?
Jul 19 2018
Jul 18 2018
Jul 17 2018
Jul 9 2018
Jul 5 2018
Some repositories @olasd mentioned to me that qualifies as gitlab repositories (in parenthesis, their current size in term of repositories):
- https://0xacab.org/api/v4/projects/ (600)
- https://framagit.org/api/v4/projects/ (8619)
- https://salsa.debian.org/api/v4/projects/ (25155)
- https://gitlab.com/api/v4/projects/ (567086)
- https://gitlab.freedesktop.org/api/v4/projects/ (254)
- https://gitlab.gnome.org/api/v4/projects/ (3247)
- https://gitlab.inria.fr/api/v4/projects/ (837)
- ...
Jul 3 2018
Jun 28 2018
The general problem (see below for the deposit-specific case) is indeed complex to deal with (both conceptually in a pure Merkle setting and practically due to the existence of zip bombs). I think a workable solution might be ingest the archive as is and also ingest a separate directory corresponding to the archive content, with some metadata linking the two. That way by default we will only return what we have ingested (without recursion), but we will offer ways to dig-in recursively, e.g., in the web app. There will be plenty of devils in plenty of details for this though.
Jun 27 2018
I've generalized the title of this task, will add sub-tasks for the specific features that are still missing to complete this.
Jun 25 2018
Jun 19 2018
Jun 14 2018
I completely agree that 'filename' is not enough and adding each time a new piece of context isn't a good solution.
Both path strategies (integers vs identifiers) are interesting.
Here is a concrete proposal for the path language:
Well, there are other scenarios: like us being forced to remove content for legal reasons. But note that I'm not arguing against the path-based approach. The risk exists only for path encoded using *integers*, because they're by construction relative to the object you traverse. You can have paths that contain the full-step information (e.g., a file/directory name, or a commit identifier), and those paths would be resolvable even if you lose access to intermediate objects. The problem with those kind of paths is that they are much longer than the integer-based ones. That robustness-v-compactness trade-off is the though one I was referring to.
I see your point, but let's remember that here we want to provide a means for a user A to encode efficiently the context information necessary for another user B to be shown the same view of the archive as the one A has.
It just occurred to me that this works (in the sense that the paths will be resolvable) only if we have all the objects in the path from the snapshot down to the pointed object, which is not something we can guarantee in general — e.g., we might have archived a repository which had missing objects in the first place.
It is all contextual information which would not make it impossible to see the final object you're pointing too. But this issue calls into question the robustness of integer-based paths for our purposes here. For instance, an fpath based on actual file/directory names will always be displayable, one based on integers will not be.
Though trade-off…
Actually, we can generalize the approach even a bit more.
Jun 13 2018
Thanks for starting this... it's an important discussion, and it goes quite beyond the need of a "filename" attribute in our family of context attributes :-)
Seems a nice way to go: we would also need some easy to use interface to
edit the "visibility" bit too...
The simplest approximation of this that I can see is adding a visibility column to the origin table, and tweaking that manually when we get a request.
Jun 12 2018
(tagging as General, while we discuss it)
Jun 6 2018
Jun 5 2018
closing, now that all sub-tasks have been completed
May 29 2018
May 18 2018
I agree with your proposal.
So I think the best option here is to used named parameters as optional parts in the identifiers. This will give us some flexibility regarding the adding of new ones in the future. Regarding the separator, we could either used \ or | as they should not interfere with origin urls to extract.
Thanks for the clear explanation.
May 17 2018
the problems I see with optional URL parameters instead of modifying the identifiers themselves are the following:
As I am currently implementing the task, I am wondering if adding optional parts to a swh identifier v1 is the adequate solution.
May 16 2018
Apr 28 2018
Mar 30 2018
@ardumont, I think you can resolve this one ;-)
Mar 27 2018
relevant highlights:
Mar 25 2018
update from joeyh, there is no need for any specific hack to maintain a local mirror, it is just an undocumented feature:
Mar 24 2018
Mar 7 2018
Feb 15 2018
Feb 6 2018
swh-loader-git and swh-loader-debian have now been migrated to snapshots as well, and restarted.
Feb 2 2018
Current status on the development migration towards snapshot (branch wip/snapshot(s)) as far as I know:
Jan 14 2018
Jan 12 2018
yeah, i was thinking about it while running earlier on today :) i'm not yet sure if i'll specify the meaning of the sha1 of each object here, or just say that the sha1 is the primary key of the object and refer to swh-model, we'll see
When writing the documentation, please be sure to be explicit whether content identifier its sha1 or its salted sha1_git, because that's not clear which it is from this discussion :)
In T335#16990, @zack wrote:identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ; scheme_version = "1" ; obj_type = "snp" # snapshot | "rel" # release | "rev" # revision | "dir" # directory | "cnt" # content ; obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;
in the future, if we switch to blake2/256 (or equivalent length checksums), the examples would become something like:
concrete, tentative proposal (EBNF):
identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ; scheme_version = "1" ; obj_type = "snp" # snapshot | "rel" # release | "rev" # revision | "dir" # directory | "cnt" # content ; obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;