Page MenuHomeSoftware Heritage

license dataset: use a consistent file format for CSV-like files
Closed, MigratedEdits Locked

Description

Currently the dataset is very inconsistent in terms of separators and column headers:

$ for f in *.csv.zst ; do echo "* $f" ; zstdcat $f | head -n 1 ; done
* blobs-earliest.csv.zst
obj_swhid	earliest_swhid	earliest_ts	rev_occurrences
* blobs-fileinfo.csv.zst
sha1,mime_type,encoding,line_count,word_count,size
* blobs-nb-origins.csv.zst
swh:1:cnt:27b1c5c45ec61842dc7fb8ba1d47a4e0c29c06b2	1
* blobs-origins.csv.zst
swh:1:cnt:2d710544b1dd29f9f2394bd2d2638b64ae1c1f15	https://github.com/0l0r1n/messagebird-apex
* blobs-scancode.csv.zst
sha1,license,score
* license-blobs.csv.zst
swhid,sha1,"name"

We should make things uniform.

I propose to use for all CSV-like files the following format:

  • comma-separated, with opportunistic quoting only when needed
  • first line is a header with column names

which is the most common CSV dialect out there, which can be imported out of the box with no customization in stuff like Pandas, sqlite, and even spreadsheets.

When done it will always mean that we can factor out the format description in the README, rather than duplicating it for each table, documenting the variants.

To get to that point here are the actionable "diffs" w.r.t. the current state:

  • blobs-earliest.csv.zst: switch from TAB-separated to comma-separated (quotes should not be needed anywhere)
  • blobs-fileinfo.csv.zst: (no changes needed)
  • bobs-nb-origins.csv.zst: switch to comma-separated (no quotes needed), add columns header: swhid,count
  • blobs-origins.csv.zst: switch to comma-separated (quotes possibly needed, due to URLs), add columns header: swhid,url
  • blobs-scancode.csv.zst: (no changes needed)
  • license-blobs.csv.zst: (no changes needed)