Page MenuHomeSoftware Heritage

Make the archiver use content_archive.num_present

Authored by qcampos on Aug 5 2016, 4:27 PM.



content_archive.num_present act like a cache for the number of copies
where the content is present. This allow to make a much faster request
when retrieving the contents that have fewer copies than expected.

This diff works with the commit 9fdbadd0c48c4e76b0be2f9a32a09ef41a38d3d4,
and they follow a previous discussion in D86 about this optimization.

Diff Detail

rDSTO Storage manager
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

qcampos retitled this revision from to Make the archiver use content_archive.num_present.
qcampos updated this object.
qcampos edited the test plan for this revision. (Show Details)
qcampos added a reviewer: olasd.

Note that in the current deployment the column is not indexed and we should make sure it is before deploying this code.

zack requested changes to this revision.Aug 10 2016, 10:12 AM
zack added a reviewer: zack.
zack added a subscriber: zack.
zack added inline comments.
114 ↗(On Diff #312)

looks like this hunk shouldn't be here, it seem to be including a #backup_file# in the diff


there seems to be a couple of naming inconsistencies in this function:

  1. the plural, which is not used in other functions that yield multiple contents (e.g., get_unarchived_content_batch)
  2. the lack of trailing _id, whereas the function does return IDs and not actual content

so perhaps it should be renamed to _get_unarchived_content_id (?)

Note: I haven't checked the complete version of the file, so it might be in need of a more general swipe of naming uniformation.

  • "starting after" -> "starting at"
  • you should specify if previous_content is included or excluded when restarting from a previous point
  • given we're talking paging, you should specify in which order the contents are returned

More generally, I suggest to use a more consist terminology for pagination: the SQL LIMIT/OFFSET paradigm seems to be a good example to follow here. How about: limit=1000, offset=None (i.e., also dropping the trailing "_content" from the offset argument)?

55 ↗(On Diff #312)

this seems to be a left over


same comments about the function prototype as above

This revision now requires changes to proceed.Aug 10 2016, 10:12 AM
qcampos edited edge metadata.
qcampos marked 3 inline comments as done.

Perform some minor corrections.
There is still some work to do about the prototypes.

qcampos added inline comments.

Won't the term offset be misleading about the type of the variable ? As the current previous_content variable is a sha1, and offset may sound like an integer.

zack requested changes to this revision.Aug 12 2016, 10:56 AM
zack edited edge metadata.

Looks good, only a minor terminology change requested, then this could go in.


Good point.
Then, as a last change (pun, intended), I propose to rename "previous_content" to "last_content". It's more appropriate, and note how you describe the meaning of the "previous_content" argument invariable as "*last* content retrieved" :-)

This revision now requires changes to proceed.Aug 12 2016, 10:56 AM
qcampos edited edge metadata.
qcampos marked 2 inline comments as done.

Rename previous_content to last_content.

qcampos edited edge metadata.

Merge with origin

Correct the named argument to use last_content.

zack edited edge metadata.
This revision is now accepted and ready to land.Aug 16 2016, 10:31 AM
qcampos edited edge metadata.

Update the diff due to rebase on top of origin.

This revision now requires review to proceed.Aug 16 2016, 4:54 PM
olasd edited edge metadata.
This revision is now accepted and ready to land.Aug 16 2016, 4:58 PM
This revision was automatically updated to reflect the committed changes.