Currently the dataset is very inconsistent in terms of separators and column headers:
$ for f in *.csv.zst ; do echo "* $f" ; zstdcat $f | head -n 1 ; done * blobs-earliest.csv.zst obj_swhid earliest_swhid earliest_ts rev_occurrences * blobs-fileinfo.csv.zst sha1,mime_type,encoding,line_count,word_count,size * blobs-nb-origins.csv.zst swh:1:cnt:27b1c5c45ec61842dc7fb8ba1d47a4e0c29c06b2 1 * blobs-origins.csv.zst swh:1:cnt:2d710544b1dd29f9f2394bd2d2638b64ae1c1f15 https://github.com/0l0r1n/messagebird-apex * blobs-scancode.csv.zst sha1,license,score * license-blobs.csv.zst swhid,sha1,"name"
We should make things uniform.
I propose to use for all CSV-like files the following format:
- comma-separated, with opportunistic quoting only when needed
- first line is a header with column names
which is the most common CSV dialect out there, which can be imported out of the box with no customization in stuff like Pandas, sqlite, and even spreadsheets.
When done it will always mean that we can factor out the format description in the README, rather than duplicating it for each table, documenting the variants.
To get to that point here are the actionable "diffs" w.r.t. the current state:
- blobs-earliest.csv.zst: switch from TAB-separated to comma-separated (quotes should not be needed anywhere)
- blobs-fileinfo.csv.zst: (no changes needed)
- bobs-nb-origins.csv.zst: switch to comma-separated (no quotes needed), add columns header: swhid,count
- blobs-origins.csv.zst: switch to comma-separated (quotes possibly needed, due to URLs), add columns header: swhid,url
- blobs-scancode.csv.zst: (no changes needed)
- license-blobs.csv.zst: (no changes needed)