Page MenuHomeSoftware Heritage

Reschedule googlecode svn origins from scratch
Closed, ResolvedPublic

Description

As T847/T876 revealed, the bug fixed in the loader-core about the misbehavior flushing step could result in missing data.
Those tasks only revealed the occurrences's target missing though.

It's unfortunately possible that other objects might be missing (contents, directories, etc...).

As we fixed quite some bugs in the loader-svn anyway and even though they should have been rescheduled, it seems more reasonable to reschedule all origins to make sure.

At worst, it won't do anything.
At best, it will:

  • fill in the missing data.
  • gives a proper listing of origins with external-id
  • permit to ascertain bugged origins are no longer (again)

Note:
Only the loader-svn should be impacted by those missing data since it's only recent that all loaders now derives from it.
Loader-svn being historically the first one and using the flushing mechanism.

Event Timeline

ardumont updated the task description. (Show Details)Dec 11 2017, 11:01 AM
ardumont changed the task status from Open to Work in Progress.
ardumont raised the priority of this task from Normal to High.Dec 11 2017, 11:03 AM

Scheduled back from saatchi (as i needed the producer credentials to access the queue properties):

$ cat /srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps.reverse-sorted-by-size.txt | tail -n +2 | ./schedule_with_queue_length_check.py --queue-name svndump --threshold 1000 --waiting-time 60 | tee scheduling-svn

So this will schedule up to 1000 tasks in the loader-svn queue every 60 seconds.
The state of what has been scheduled is in the scheduling-svn file.

After discussion with the team, it has been decided to remove from the re-scheduling the svn dumps whose compressed size exceeds 2Gib.
This reflects the same decision took for git repositories.

That list of compressed dumps whose size exceeds the 2Gib threshold is stored in uffizi:
/srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps-with-size-superior-to-2gib.txt

Recreated the scheduling input lists (which filters out those huge dumps) and rescheduled with that list using:

tail -n +187239 /srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps-with-size-inferior-to-2gib.txt | awk '{print $3" "$2}' | ./schedule_with_queue_length_check.py --queue-name svndump --threshold 1000 --waiting-time 120 | gzip -c - >> scheduling-svn.txt.gz

Note:
Both the old and the new input files are sorted in the same ascending order on the first column, the compressed dump's size:

  • old: /srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps.reverse-sorted-by-size.txt
  • new: /srv/storage/space/mirrors/code.google.com/sources/INDEX-svn-dumps-with-size-inferior-to-2gib.txt
ardumont changed the status of subtask T896: Clean up wrong origins from Open to Work in Progress.Dec 14 2017, 12:06 PM
ardumont added a comment.EditedFeb 2 2018, 1:44 PM

This is in stand-by during the snapshot migration.

ardumont changed the status of subtask T947: googlecode import: Some dumps are just empty repository from Open to Work in Progress.Feb 5 2018, 1:45 PM
ardumont closed this task as Resolved.Sep 19 2018, 1:56 PM

That's been done for a while now.