Do we really want to do this?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 5 2019
Jun 4 2019
Just a couple of comments:
- the current proposal is ori instead of org as 3-letter stem
- your use cases are all valid, but would equally work with a full URL and with a hashed URL
May 30 2019
May 29 2019
In T1731#32712, @vlorentz wrote:Okay then. I'll work on updating the identifier specification.
So, again, what are the remaining issues that inhibits you to just go ahead and use URI hashes as Cassandra origin IDs?
Those I listed above, which were more "philosophical" than technical. I started implementing it last Monday anyway, and it looks good.
I don't like the idea of this lister.
In T1731#32684, @vlorentz wrote:In this case, we'll also need to have an identifier for URL + type, if they want to cite/link to the non-default one.
We could use the "contextual information" mechanism, eg. swh:1:ori:SHA1;type=git
May 28 2019
In T1731#31900, @vlorentz wrote:This sounds like a good idea.
But it has some weird implications on components that use the concept of "origin head" (web UI and metadata indexers); because they'll use radically different content depending on which loader visited last.
But having two VCSs at the same URL is weird in itself, so 🤷
[ moving here my feedback from D1516 ]
In D1516#34010, @olasd wrote:I'm not convinced this is such a good idea; this machine is way more than a "db replica" server (it only has one replica, most its databases are actually primary) and I don't think DNS provides the appropriate granularity level to record this information.
May 26 2019
May 25 2019
I think the only thing missing here is adding the NPM logo to the archive coverage page.
I think this is now done, right @anlambert ?
only 3% to go in -lister and -core \o/
these catch-all meta-tasks that will grow forever are not terribly useful, the individual tasks + their tasks should be enough
closing this catch-all meta task, the individual sub tasks are clear enough
looks like this is done now, as you're deep in the implementation already! closing
We've discussed this back then, and decided in the end to leave it at the Django layer. Closing.
has this been completed since?
can we haz this, please? :)
snapshot count is now there, closing
is this still the case?
this hasn't happened for a long while
Just checking in on this, are we are discussing moving DBs around. Do we still don't have a backup for the indexer DB?
If so, priority of this one should probably be raised.
This is done, I've forked off the part about consistently documenting configuration options to T1758.
swh-storage-testdata is gone, closing
we have had this for a while now
@douardda: can I punt this to you to either further investigate or just close as Invalid? 3 years later it might no longer be relevant…
closing, we do have an SVN loader now: it has still some issues, but the bulk of the job is done
how many are left? can we close this as well as T419 now that the PyPI listers/loaders have been in production for a while?
@anlambert what's the status of ingesting very large SVN repos, now that we have put the loader in production?
the archiver is gone, closing
fixed long ago, AFAIK
@olasd recently made a lot of progress on this one.
the archiver is gone, closing
oh, the info is in T192 already
closing
is this still going? can it be closed as obsolete, maybe just noting down here the 4 failed tarballs? (we're going to do a full swipe soon anyway with the new listers/loaders)
what's the scope of this? the Web API? all our APIs?
either way, please tag it appropriately
this is now done, at least based on IP addresses, we'll need (if it doesn't exist yet) a dedicated task for how to do it differently, e.g., using API keys
my take: don't bother (see: T1716#32312)
In T1716#32249, @ardumont wrote:Webapp/cookers migrated to use the azure vault instance.
the more I look into this, the more I get convinced that what we should actually remove is the https://www.softwareheritage.org/archive/ page, its content should be just integrated/moved into the homepage of archive.s.o.
May 24 2019
May 23 2019
A nice related work here are the LAW datasets.
May 22 2019
Tangential, but impactful on this discussion, we have had in the past a discussion about removing origin types from our notion of origin (there might be a task about it, but I couldn't find it right now).
In D1460#33397, @haltode wrote:
- if edges are not specified, we should follow *all* edges during the visit
Yes, this is what I meant here: "Where by default we can explore the graph following all types of edges"
In D1460#33390, @haltode wrote:I think that the src_type/dst_type in both the URL and extra_edges is a bit redundant. We could refactor the visit function into the following endpoint:
GET /graph/visit/swh_id/[?allowed_edges=["src_type/dst_type",...]][?direction={forward,backward}]Where by default we can explore the graph following all types of edges, and restrict it if necessary.
May 20 2019
In T833#31772, @vlorentz wrote:
- sending a request for each repository would need ~2 to 3 years for a full pass over github. That's with our current infrastructure, so it's not a hard limit.
May 19 2019
In D1460#33101, @haltode wrote:Here is a revisited version, I also added a starting point parameter for the visit:
GET /graph/visit/src_type/dst_type/src_hash/[?direction={forward,backward}][?extra_edges=["src_type/dst_type",...]]
May 16 2019
It'd be great to also expose the link with the revision.
In D1460#32488, @haltode wrote:GET /graph/visit/src_type/dst_type/[?direction={forward,backward}][?extra_edge="src_type/dst_type"]*
May 15 2019
@eddelbuettel yeah, if there isn't a standard way to go all the way back in time, it's OK to currently only ingest what's currently returned as available. In the medium/long term it will converge to having archived everything (w.r.t. the considered time frame) anyway. And we can always retrofit later on stuff that is archived elsewhere. But I wouldn't want to make this a blocker to start archiving what's (easily) listable now.
May 14 2019
In T1689#31540, @olasd wrote:IMy suggestion was having *someone* trigger the merge with a comment on the diff, once the tests pass.