PR got merged \m/
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 1 2017
Nov 13 2017
Nov 10 2017
Nov 6 2017
Nov 5 2017
(agreed)
Nov 4 2017
PR got merged \m/
Oct 31 2017
Follow up on this:
Oct 27 2017
The revision in question is:
Debugging some more, the date generating this error is the following, which raises indeed the initial overflow error:
Oct 26 2017
Oct 9 2017
It's been running for a while now :)
Sep 18 2017
As a proof of concept nginx has been manually deployed to uffizi on port 15003. It does alleviate the BadStatusLine issues the archiver encountered before under high load. This "just" needs to be properly deployed.
Sep 15 2017
looks like this is no longer of interest
we're taking a different route for this now, based on @grouss WIP
Sep 13 2017
Sep 11 2017
Jun 6 2017
Jun 2 2017
May 29 2017
Apr 26 2017
Apr 5 2017
Mar 31 2017
Ok, i've taken a closer look at the duplications. So far, only duplication in origins with type 'ftp' (from gnu injection) and 'git':
Mar 30 2017
Do not use yet the new column blake2s256 for filtering (content_add, skipped_content_add)
Mar 29 2017
Fixes according to latest review:
- fix unique key computation (using tuple)
- fix sql about missing default value (on new column)
Looks good with one easily solved caveat inline.
Looks good with one easily solved caveat inline.
Use sql/bin/db-upgrade to generate the 103-104 sql migration script
Mar 28 2017
Just a couple of comments:
- as discussed on IRC, having the object storage fully streaming is a goal per se, no matter what the Vault needs. If the vault needs it, its priority it's just higher; but the goal remains nonetheless (please file this as a separate task, so that we can collect knowledge and TODO items about that in a dedicated space.)
- I might be wrong, but it seems to me that an underlying assumption of Option 2 above is that we will not cache cooked objects. That's wrong. The Vault is, conceptually, a cache and should remain so. The reason is that we expect Vault usage to be really "spike-y". Most of the content we archive will never be requested because it will remain available to its original hosting place most of the time. But when something disappears from there, especially if it's some "famous" content, we will have people looking for it into Software Heritage; possibly many people at the same time. To cater for those use cases we will need to be sure we can make the cooking only once, and serve it multiple times subsequently at essentially zero cost. Then, of course, the cache policy and how aggressive in deletion we will be is totally up for discussion and will need some data points (that we don't have yet) for tuning.
Ok, dropping the unique indexes and creating simple index then.
Improve test on skipped_content_add
Mar 27 2017
- swh.storage: Use agreggate key to filter on missing skipped contents
- swh.storage: Extract key variable for insertion
- swh.storage: Use upsert scheme on (skipped_)content_add function
- Revert "swh.storage: Use upsert scheme on (skipped_)content_add function"
- db version 104: Update schema properly
There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.
There is some confusion in our current schema about what the unicity expectations are. This diff adds some on top, so we should clear them up before moving any further.
I agree that metadata exports need to keep meaningful intrinsic identifiers as well.
Fix sql formatting which was off
Mar 25 2017
Mar 24 2017
Mar 23 2017
From IRC with permission.
16:46:39 seirl ╡ olasd: i don't remember if you had an opinion on that too 16:49:32 olasd ╡ my opinion is that we should try very hard to avoid doing long-running stuff without checkpoints 16:50:59 ╡ I don't think it's reasonable to expect a connection to stay open for hours 16:53:37 ╡ this disqualifies any client on an unreliable connection, which is maybe half the world? 16:54:14 seirl ╡ okay, i'm not excluding trying to find a way to "resume" the download 16:55:06 ╡ that way we can just store the state of the cookers, which is pretty small 16:55:29 ╡ also, people on unstable connections tend to not want to download 52GB files 16:55:38 olasd ╡ except when they do 16:55:45 nicolas17 ╡ o/ 16:56:03 ╡ I don't mind downloading 52GB 16:56:32 olasd ╡ swh doesn't intend to serve {people on stable, fast connections}, it intends to serve people 16:56:41 nicolas17 ╡ but if it's bigger than 100MB and you don't support resuming then I hate you 16:56:44 seirl ╡ okay there's a misunderstanding by what I meant by that 16:56:53 ╡ assuming we DO implement checkpoints 16:57:04 ╡ (and resuming) 16:57:15 ╡ people with unstable connections are usually people with slow download speeds 16:57:47 ╡ so they won't be impacted a lot by the fact that streaming the response while it's being cooked has a lower throughtput 16:57:55 olasd ╡ I still don't think streaming is a reasonable default 16:58:09 seirl ╡ okay 16:59:13 olasd ╡ however, I think making the objstorage support chunking is a reasonable goal 16:59:34 ╡ even if it's restricted to the local api for now 16:59:55 seirl ╡ oh, i hadn't thought of chunking the bundles 17:00:02 nicolas17 ╡ if I start downloading and you stream the response, and the connection drops, what happens? will it keep processing and storing the result in the server, or will it abort? 17:00:29 seirl ╡ nicolas17: i was thinking about storing the state of the processing (which is small) somewhere 17:00:34 ╡ in maybe an LRU cache 17:00:48 ╡ if the user reconnects, the state is restored and the processing can continue 17:01:15 nicolas17 ╡ would this be a plain HTTP download from the user's viewpoint? 17:01:21 seirl ╡ yeah 17:01:27 nicolas17 ╡ would the state be restored such that the file being produced is bitwise identical? 17:01:33 seirl ╡ that's the idea 17:01:45 ╡ we can deduce which state to retrieve from the Range: header 17:02:06 nicolas17 ╡ great then 17:03:04 olasd ╡ nicolas17: mind if I paste this conversation to the forge ? 17:03:15 nicolas17 ╡ go ahead 17:03:17 * ╡ olasd is lazy 17:03:23 seirl ╡ that said i perfectly understand that wanting the retrieval to be fast and simple for the users is an important goal, if we're not concerned about the storage and we can easily do chunking that might be a good way to go 17:03:42 nicolas17 ╡ the bitwise-identical thing is important or HTTP-level resuming would cause a corrupted mess :P
Currently the cookers store their bundles in an objstorage. The current design of the objstorage requires to have the whole object in ram, and it would require significant changes to be able to "stream" big objects to the objstorage. This is a big problems for the cooking requests of big repositories.
Mar 6 2017
Ack on the principle. But noting down a caveat for use case (3).
Mar 3 2017
There are limits to the current implementation as per the todo description in the code (swh.storage.content_update).
Mar 2 2017
Rebased to master
fixup some bad sql comments commit
- storage.content_update: Simplify sql update implementation
- storage.content_update: Simplify sql update implementation
- storage.content_update: Move altering schema tests in their own class
I didn't dive deep into the docstring, but at least I gave a stab at the SQL query.
I didn't dive deep into the docstring, but at least I gave a stab at the SQL query.
Feb 24 2017
I have a working POC for this which uses swh-journal as basis.
Feb 22 2017
My current thinking on the general topic of "what are origins for distributions/package manager environments" is that in those contexts an origin should be a pair <distributor, package>. So, for instance, <pypi, django>, or <debian, ocaml>.
There exist some duplicated origins (well, at least regarding the loader-tar's origins):
softwareheritage=> select * from origin where type='ftp' limit 10; id | type | url | lister | project ---------+------+------------------------------------------------+--------+--------- 4423668 | ftp | rsync://ftp.gnu.org/gnu/3dldf | | 4423671 | ftp | rsync://ftp.gnu.org/gnu/3dldf | |
There exist some duplicated origins (well, at least regarding the loader-tar's origins):
Feb 18 2017
softwareheritage=> select distinct(perms) from directory_entry_file; perms -------- 33200 33248 33276 33261 100644 16877 120000 33152 32768 33188 33216 33225 33060 0 40960 295332 33268 100755 33184 33252 33272 33279 33196 33277 33189 16888 16895 33260 33256 33204 16893 41471 16832 33206 (34 rows)
....
Feb 17 2017
Feb 15 2017
After all those fixes, we were down to 8911 releases with improper checksums, all of them synthetic (from swh-loader-tar).
Their checksums were computed with a wrong algorithm (appending a newline to the stored message, and treating an integral timestamp as a floating point value), and have now been fixed.
All releases should now have a proper identifier.
I've looked at the 31k releases with improper checksums.
Feb 14 2017
This issue has been solved and the fix deployed everywhere.
I just actually stopped the SVN loaders :)