Sure.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 15 2021
Perfect. Thanks for the support!
It recently finished:
Jun 15 13:10:29 worker02 python3[2461343]: [2021-06-15 13:10:29,513: INFO/ForkPoolWorker-1] Task swh.loader.git.tasks.UpdateGitRepository[4b1cdb75-952f-4949-b95f-67259c5bfb62] succeeded in 22746.730732845142s: {'status': 'eventful'}
Jun 14 2021
Issue has been resolved and tarballs hosted on the Internet Archive can now be properly loaded in production (see example).
Jun 11 2021
Great, it seems we are getting there :-)
Fix implemented in D5859 works \o/
It is possibly also the only reasonable place where we can have heuristics to
de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.
I just figured out that data we are missing (Content-Length, Last-Modified) from tarballs archived by the Internet archive are in fact available in x-archive-orig-* HTTP response headers
This task led to an internal improvment so thanks for the heads up.
Deployed first yesterday on staging (manually) to check the behavior.
Everything was fine.
In T3365#66091, @anlambert wrote:
That's great news!
Jun 10 2021
LGTM
still not a big fan of the usage of random in the tests ;), but otherwise, it matches what you explain to me this morning
LGTM, still not a big fan of the usage of random in the tests ;), but otherwise, it matches what you explain to me this morning
Jun 8 2021
I guess we could avoid walking on subtrees in that case and just take the most recent
date on the first level of an archive content.
Then, if not present, we could fallback as walking the arborescence tree to determine the most
recent date (but i have no idea how long that could take for big archives)
Yes, that sounds like something achievable from afar.
@ardumont, maybe we could use the timestamps of the files extracted from an archive to compute the author/committer date when not available from the HEAD response ?
What could be a decent heuristic to handle this kind of situation?
Jun 7 2021
Thanks @ardumont for investigating this. The fact that the IA does not provide the LastModified information may make sense for their specific case (it is possible that they do not have kept the LastModified info from the original location).
I did not realize immediately that the archive urls used for the ingestion (submitted by
the users) are the ones from the internet archive! Neat trick!
ah. It's an edge case from the start, nice!
It's deployed now.
fwiw, the urls mentioned are now browsable.
Thanks for the explanation. The strange thing here is, that only the repository I requested to be archived are not yet browsable. Other requests (even much newer) are browsable for me
If you give it a bit of time, that should eventually be browsable.
Internally, the webapp uses a replicated database.
Ok. Thanks a lot. In the dashboard most entries now are marked as "succeeded". However, I'm not able to access the recently archived repositories
Thanks for the bump! It was stuck and not it got unstuck.
Jun 2 2021
- The fix was deployed on webapp1 and moma
- The refresh script was manually launched:
root@webapp1:~# /usr/local/bin/refresh-savecodenow-statuses Successfully updated 140 save request(s).
The previous requests were correctly refreshed and are now displaying the right status.
Will be deployed with version v0.0.310 of the webapp (build in progress)
May 28 2021
Now what's missing here (not sure how hard it is) is the mean and max ingestion time
of save code now requests (time between they being accepted and the loader task is
over)
May 27 2021
Improve the current way of fetching save code now requests to update. Status of those
requests even if updated with their latest status (which is derived out of the
scheduler) will get fetched again for update until the information retrieved out of the
main archive is also configured ("replication lag subsides").
Your diff sounds quite enough.
May 26 2021
Or even directly the archive loader's default behavior (append previously seen branch
from early snapshot/visit). As discussed, I'm wondering whether an archive loader (gnu
or cran [1]) would not benefit from always displaying previously seen branches (whether
they are still present in the current main api we list/visit or not [2]).
This could be implemented by adding a new option to the loader.
May 10 2021
May 8 2021
May 3 2021
Apr 29 2021
Deployment is done.
Deployment in progress.
Apr 28 2021
Needs deployment now.
Apr 27 2021
I'll keep it open until the docker env is ok as well (see the diff D5615).
Well, trying out the following in your browser (console or devtools):
Deployed (production/staging)
Apr 26 2021
Apr 23 2021
Closing this now as the gist of this is done and the remaining fix is to be dealt with in another task [1].