Persistent Identifiers

The question of persistent identifiers arises in OSRR in section 2 of our RFP…

2.11) Contribute content

2.11.1) Registered users may contribute content to the site in two ways: by loading a file or by entering the content in a web form provided by the software.

2.11.2) Each item uploaded shall have the following information attached to it:

[ … many items deleted … ]

  • short URL: it shall be a name used to access content using a short and easy to remember URL. The system must control that these URLs are persistent and not repeated.

The first thing to note is that though this requires that a “short URL” be present and be persistent, it does not require that the user be able to define their own short URL. This is a system-provided short URL, not user defined.

Short URL

What do we mean by short? Something with very little hierarchy and of typeable length. For example:

http://archive.feautor.org/id/XS235B21
     http://archive.feautor.org/item/2983-2718-1239
     http://archive.feautor.org/go/DJC1212441

Each of these is short enough to easily type and each only have one level of hierarchy. Each consists of a persistent identifier (everything after the last slash) embedded in a short URL that can be maintained as long as the Feautor project owns its feautor.org domain and provides a resolution service behind the “id” or “item” or “go” directory.

Persistent

What do we mean by persistent? That’s a bit trickier than short. Essentially “persistent” means that the users should be able to resolve the URL not only today, but 10 years from now when Feautor is on a new platform.

To facilitate such persistence a few principals are essential:

  • Longevity: Feautor must assign identifiers to each item that will not be changed over the course of the archive’s life.
  • Unique: The identifiers must be unique, yet not reference anything semantic (anything with “meaning”) from the item itself.
  • Independent: The identifiers must not be wed to the internals of the database system in which Feautor is implemented.
  • Indirect: Feautor must reference these identifiers with enough indirection that later on an additional layer of redirection could be applied if needed.

Longevity

We hope and must plan for the Feautor archive or resources to live for a long long time. In order for authors to reference their contributions with confidence and users to find those contributions for years to come, the identifiers for items must not change. In computing it is common to think of “long term” as “for the life of the system” or “for two to three years”. But we expect the “long term” for Feautor to be decades, not years. We must design the current system so that these identifiers can be moved to new systems in the future.

Unique

Each identifier must be unique. No identifier should ever be reused for a new item once it has been assigned, even if the item to which it was originally assigned is removed from the Feautor archive for some reason.

Feautor will have categories of contributions, channels, authors, groups, and many other meaningful constructs. It will be tempting to assign item URLs that reference these in some way. For example, some might think it cool to include the group name in some way in the item URL. This sort of meaningful component of the “short URL” would be a mistake. The “short URL” must have nothing that implies any sort of organization, grouping, or format.

One good way to get away from meaningful identifiers is to insist on numerical identifiers. Unfortunately numerical identifiers tend to be long, since we can only take advantage of ten characters. If we include letters in the identifier in order to facilitate a shorter URL, then we should take care that those letters can’t accidentally form a meaningful word. One strategy might be to never use more than two or three consecutive letters.

We must avoid grouping or format references in the id since, over the “long term” we expect that our notion of groups may change and the formats material will be accessed in will be transformed. The identifier has one job only: it serves as a reference to the object.

Independent

It will be tempting to use a database sequence number as a system identifier. This is dangerous and should be avoided. The database system identifiers already have a job to do in keeping database operations efficient. They are often used as keys between related records, for instance. It is not very hard to come up with scenarios where a system efficiency demand would make a new system identifier a reasonable choice, and it would be a shame to design the system in such a way that such system choices could not be made.

It is better to treat the persistent identifier as a piece of metadata associated with the item. It can then move to new systems as needed over the years.

Indirect

While the current system may well have some way of directly responding to a query bearing the persistent identifier, it would be best if that query at least appear to be though an indirect path. To keep the URL short, a directory, whether actual or virtual, can serve this purpose. This gives us a place to hang a redirect order or some other form of resolution when systems change in the future.

For example, it might be simple to provide a URL like this from a given system…

http://www.feautor.org/web/archive.php/contribution/show/id/4262

Such a URL (even if the id were not a system id) has too many elements that are unique to today’s system to be resilient enough for this purpose. Things like references to specific scripts are especially to be avoided.

Alternatively…

http://www.feautor.org/4262

This is shorter and less wed to the particular system. However using the “www” host and giving the identifiers the run of the “root” namespace will make this hard to maintain if we move to a new web host or need to manage more than one collection from a variety of systems in the future.

This is more like what we need…

http://archive.feautor.org/id/4262

Being at an alternate hostname gives us a lot of flexibility for doing other things with the main “www” host in the future. Using the “id” directory allows us to put a script in that directory that can, for now, redirect queries to the full URL in the first example above; but over the long term may redirect to another system or even direct traffic based on identifiers of different types.

Construction of the Persistent Identifier

If we can’t use the system identifier as the persistent identifier, then what can we use? It could just be another unique and incrementing number for now. It could be a random number checked for uniqueness. It could be a date-related number (though that is dangerous if the date is too understandable to humans (maybe unix epoch date in seconds, plus some random number for uniqueness).

It would also be nice to add a check-digit to the persistent identifier so that we could later tell the difference between malformed ids and legal ids that (for some reason) we’ve lost track of. A simple check-digit can be a big help and is not provided by a system identifier.