Page MenuHomeSoftware Heritage

Datasets exported from Spark are missing some rows
Closed, WontfixPublic


I did some requests on Spark to export tables I had already exported on Amazon, and I found a lot of weird discrepancies. It seems that the data exported from Spark systematically has some amount of rows missing, when I compare it to the corresponding dataset exported from Amazon.

1st example, exporting all the nodes in a single query that does a UNION of all the relevant tables yields on Spark:

4671443206 cnt
4422303776 dir
9907464 rel
1125083793 rev
57144153 snp

The counts are good for everything except the content table, with exactly 410820000 contents missing.

2nd example, exporting the "edges" by unnesting the directory layer:

Spark:  434459032
Amazon: 481829426
(2827695890 missing -- 9.831%)

Spark:  45488558029
Amazon: 48316253919
(47370394 missing -- 5.852%)

Spark:   91186016707
Amazon: 112363058067
(21177041360 missing -- 18.847%)

Event Timeline

seirl triaged this task as Normal priority.Jun 11 2019, 11:52 PM
seirl created this task.

We no longer export edges from Spark

zack changed the task status from Resolved to Wontfix.Jun 3 2020, 4:20 PM