license dataset: use a consistent file format for CSV-like files
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Nov 14 2022, 3:05 PM

Description

Currently the dataset is very inconsistent in terms of separators and column headers:

$ for f in *.csv.zst ; do echo "* $f" ; zstdcat $f | head -n 1 ; done
* blobs-earliest.csv.zst
obj_swhid	earliest_swhid	earliest_ts	rev_occurrences
* blobs-fileinfo.csv.zst
sha1,mime_type,encoding,line_count,word_count,size
* blobs-nb-origins.csv.zst
swh:1:cnt:27b1c5c45ec61842dc7fb8ba1d47a4e0c29c06b2	1
* blobs-origins.csv.zst
swh:1:cnt:2d710544b1dd29f9f2394bd2d2638b64ae1c1f15	https://github.com/0l0r1n/messagebird-apex
* blobs-scancode.csv.zst
sha1,license,score
* license-blobs.csv.zst
swhid,sha1,"name"

We should make things uniform.

I propose to use for all CSV-like files the following format:

comma-separated, with opportunistic quoting only when needed
first line is a header with column names

which is the most common CSV dialect out there, which can be imported out of the box with no customization in stuff like Pandas, sqlite, and even spreadsheets.

When done it will always mean that we can factor out the format description in the README, rather than duplicating it for each table, documenting the variants.

To get to that point here are the actionable "diffs" w.r.t. the current state:

blobs-earliest.csv.zst: switch from TAB-separated to comma-separated (quotes should not be needed anywhere)
blobs-fileinfo.csv.zst: (no changes needed)
bobs-nb-origins.csv.zst: switch to comma-separated (no quotes needed), add columns header: swhid,count
blobs-origins.csv.zst: switch to comma-separated (quotes possibly needed, due to URLs), add columns header: swhid,url
blobs-scancode.csv.zst: (no changes needed)
license-blobs.csv.zst: (no changes needed)

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T4685 license dataset: add logic to convert/import dataset into a SQL database
		Migrated	gitlab-migration	T4683 license dataset: use a consistent file format for CSV-like files

Event Timeline

zack triaged this task as Low priority.Nov 14 2022, 3:05 PM

zack created this task.

zack added a project: Datasets.Nov 14 2022, 3:09 PM

zack mentioned this in T4685: license dataset: add logic to convert/import dataset into a SQL database.Nov 14 2022, 4:49 PM

zack added a parent task: T4685: license dataset: add logic to convert/import dataset into a SQL database.

blobs-fileinfo.csv.zst: (no changes needed)

It's the only one to use a sha1, so I'll replace it with a swhid

This task has been migrated to GitLab.

license dataset: use a consistent file format for CSV-like filesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

license dataset: use a consistent file format for CSV-like files
Closed, MigratedEdits Locked
Actions

Related Objects
Search...