Page MenuHomeSoftware Heritage

parse_persistent_identifier() should raise a parsing exception on invalid identifiers
Closed, MigratedEdits Locked

Description

In [1]: from swh.model.identifiers import parse_persistent_identifier
In [2]: parse_persistent_identifier('foo')
Out[2]: {'namespace': 'foo', 'scheme_version': {}}

Event Timeline

zack triaged this task as Normal priority.
zack created this task.
zack added a subscriber: ardumont.

@ardumont I'm tentatively assigning this to you as I think the code is yours, but feel free to reassign as needed!

@ardumont I'm tentatively assigning this to you as I think the code is yours, but feel free to reassign as needed!

Now, now, for the me, the code is the team's code. So it's "mine" because i'm part of the team.
But, indeed, I initially authored it.

Regarding that code, i simply added the checks to make parsing errors raise (in accord with the current task).


I recall some remarks about the persistent identifier representation being too simple or something.

I don't know what's wrong with that simple representation as:

  • the actual representation is also simple enough
  • we don't have type anyway
  • everyone can manipulate dict

Thanks for the enlightenment.

Note: I'm adding @anlambert in the loop as he is the one manipulating those the most.
He also helped improve the current implementation. So he might be interested by the discussion.

I recall some remarks about the persistent identifier representation being too simple or something.

I don't know what's wrong with that simple representation as:

  • everyone can manipulate dict

(a bit orthogonal to this specific task, but while we're at it)

here's what's wrong with dicts, in general:

  • you cannot type (in the sense of type checking) them more precisely than "it's a dict", and if you end up having dicts everywhere (which is our case) that's not very useful
  • they're prone to typo-based errors in their keys, both when accessing and when updating them
  • they're mutable, so when you pass them around you just have to pray client code will not modify them
  • accessing them requires hashing the key, which has a cost (often negligible, but still)

They're useful in a bunch of situations, like when the set of keys you use it's not close-ended (e.g., all the "metadata" situations we have in the archive).
But when the list of fields is fixed, which is the case for persistent identifiers, named tuples could be much better, and they address all the above issues.

Ok, thanks!

I suppose that by named tuples, you are talking about those [1] which i need to read further.
In the mean time, i created the T1112 about improving such representation.

[1] https://docs.python.org/3/library/collections.html#collections.namedtuple