This week I attend the International Repositories Infrastructure Workshop (This workshop was sponsored by JISC, DRIVER and SURFfoundation) The goal of the workshop was to identify shared agendas for action and coordination between major national and international stakeholders, for the purpose of developing an international federated network of repositories.
Other blogs about this event can be found here http://digitallibrarian.org/?p=44 and here http://digitalcuration.blogspot.com/2009/03/international-repositories.html . Tweets which have been uttered can be found here http://twitter.com/search?q=#repinf09
In this blog I will write about Identifiers, and the Identifier workshop I have attended in.
The Identifier workshop was chaired by Andrew Treloar (Australia, ANDS project) and he did a great job in bringing consensus to the group. First of all we have to accept that many identifier systems already exist, and that no-one is planning on abandoning their beautifully build identification mechanisms. However when talking about reliable interoperable infrastructures for serving scholarly communication work flows, we have to be able to communicate across the silo’s we’ve beautifully have crafted. In the workshop we came to the conclusion that what we need in scholarly communication work flows is not yet-another-identification-mechanism, but a meta service that builds bridges across these identifier mechanisms. A similarity/equivalence service is highly recommended in order to bring global scholarly communication workflows a step forward. A service tells “this thing from this identifier system is the same as that thing from the other identifier system” (without in getting into any philosophical details)
This means in practice that for example a researcher who moves from one country to another, to work in another research institute, can be identified as the same person. This about this person can be said that he/she has worked on these research projects that are registered in these separate systems, and has published these scholarly works in these separate journals form these different publishers, has written these web log items, repositories and has produced these datasets.
For the action plan that is presented to the funders (a link will be provided as soon as the report is finished) we have concentrated on 4 categories of identifiers that needs serious up-take in order to support a global scholarly communication infrastructure. These categories are identifiers for “organisations”, “repositories”, “objects” and “people”. An equivalence service tells the equivalence between two things within a identifier category.
[iframe http://prezi.com/17905/view/ 500 400]
Further on the presentation.
- “Organisations” can have identifiers, we considered the DNS registry as a starting point. More identifier systems might exist, and we use the equivalence service to bind the organisation identifiers. Organisations might emerge, dissolve, split and merge with one another, the equivalence service must take that into account.
- “Repositories” can also have identifiers, we considered to use ROAR or DOAR to use as a starting point registry. However for complete coverage of the scholarly communicationsworkflow we must build a registry that not only contains Open Access repositories (like ROAR and DOAR does). Furthermore the repositories runs on Self-populating and automatically de-populating mechanisms. And just like organisations the repositories might emerge, dissolve, split and merge with one another, the equivalence service must take that into account.
- For “Objects” also many registires exist like DOI, Handle, URN:NBN, ARK, etc. On the level of the bitstream (Manifestation level in FRBR terms) an MD5 hash match might to the trick in order to tell the equivalence between identifier systems. E.g. this publication in that repository using Handles is the same as that publication in that repository using URN:NBN. This is phase 1 in the action plan. Phase 2 is to make equivalence on the Expression level in FRBR terms. For textual publication this can be done by using methods used in plagiarism detection software, where the statistical proximity is measured. E.g. the version of this publication at the publisher is the same as this author version in this repository. (a possible service can be made where the end-user can choose between these versions he/she would prefer to read / gain access to. The
- “People” identifier was originally called “Author” identifier, but we decided to make it mote general and considering the role as a property of a person. This we did because a researcher might take effort in the research process (contributing data to a dataset using measurement equipment), but might not always write something as an author. People cannot be merged or split, but can have many identifiers when participating in different systems, wear different roles and use different persona’s. For example a researcher has a Thomson Researcher ID and write under different persona’s depending on the journal he/she is writing on, also she/he might use different names due to marrial structures that can differ in different countries, also he/she has a Scopus ID, a Crosref ID, a Dutch DAI, a ISNI (ISO Name ID), a Linked-in account, an Open ID account with different persona’s, a Campus login, a national federated login (SURFfederatie), an ID in several CRIS systems because he/she is working part-time at different universities under different roles, etc. A meta-service must be build to ensure the global equivalence and non-equivalence of a person in order for systems to know this is the same person it is dealing with. The service is self-populating, where the person can say: “is is who I am also” and “this is who I am sertainly not”. Services like this already exist like www.danyid.org where a person can claim or tell the system the Identities he/she has got. Since Dandy ID is a popular web2.0 service for socialnetworks, it is possible to add identity management services, which is commercially very interesing for these social networks.
Thoughts: The way I see it is that the meta-identifier-structure is a loosely-coupled structure where RDF stores are globally distributed, where each store tells a part of the story. For example in one store the equivalence of the ID of a person of the login on Campus A is binded with the ISNI of that person. Another store contains the bindigns between the ISNI and Linked-in Accounts. And in another store the bindings between the Dutch DAI and the linked-in acount is stored. So one can list a list of publications that are binded with a Dutch DAI using the login of Campus A. Is there any persistency? Well, if there are many stores making a lot of different binding the path can be re-routed, if not there we have a problem. In order to create a stable infrastructure, terms like LTP policy, Contracts and Service level agreements should be used… (the knowledge exchange project, see below, should provide a partial answer to that.)
Thoughts: What we left out of scope is to bring this into a broader perspective where an equivalence service is nothing more then a relationship service, where the relationship is named between two things. In this perspective the predicate “is equal to” is just a term of two things that are representing the same thing or concept. Making it more generic it could contain many more predicates like “is cited by”, “works at”, “is owned by”, etc. A generic approach might not only make semantic relationships within an identifier category, but between identifier categories aswell. The people of the citation workshop might be interesing in utalising this service where then “cited from” relationships can be stored across identifier systems on a meta-level in a global interoperable fashion.
Knowledge Exchange – URN:NBN based Persistent Identifier Infrastructure pilot
On Monday afternoon I gave a presentation about the Knowledge Exchange project that promotes and implements a robust and sustainable identifier and resolution infrastructure for permanent access to knowledge assets for science and cultural heritage that is sustainable for the long term.
[iframe http://prezi.com/17406/view/ 500 400]
Yes this is just yet another identifier mechanism, but the special thing here in this project is that it is a joint cooperation of National players who already have URN:NBN mechanisms in place and want to team-up. This project is not about technology, because it is already there, this project is about policy making on how to create an infrastructure that is robust, sustainable organisation model and that provide access to scientific and cultural heritage for over a long period of time.
The outcome of the project is a LTP policy for global registration and resolution of URN:NBN’s. The project group will define a set of roles and a set of responsibilities that must be effectuated by these roles. This policy will adopt most likely something like www.datasealofapproval.org.
More about this project can be found here: www.surfgroepen.nl/sites/surfshare/public/pid/
Just some thoughts:
Although the project has not been started, I can imagine that the a policy rule could be: “If you want to use URN:NBN numbers to identify your knowledge assets, you must have a working LTP strategy in place”. In practice this means that if you have a repository in the Netherlands and you want to join the global URN:NBN identification and resolution network you have to let all your URN:NBNed documents store in the National Library LTP eDepot. The National Registration Agency is the only party that can distribute and register URN:NBN prefixes. When the repository is registered they have to provide a OAI-PMH feed that contains the URL’s of the documents in the repository and the URN:NBN identifiers. The URL-URN bindings are stored in the national resolver and the files are copied to the eDepot. The URL of the eDepot file is binded to the URN in the resolver as a “backup” location.
When I was talking to Jonathan Rees (Science Commons) an idea popped into my mind that this mechanism can be extrapolated in order to form a LOCKSS principle (Lots of copies keep stuff safe). The mechanism to copy files to the eDepot, can also be used to copy the files to other Dutch repositories. One URN identifier contains lots of URL’s of the mirror duplicates at other repository locations. This also can be a mandatory policy rule that in the end needs to be enforced in a technical manner, and very possible to build already in the Netherlands.
The only thing is that the Dutch eDepot has a LTP strategy that folows two LTP methods 1. strip the text to simple ASCII text, and 2. keep the files migrating to the most current version of the format. This is expensive and the repositories have a too small budget to use similar methods. So after a this thought experiment the LOCKSS principle is not a very safe way to guarantee readibility over a long period of time. (bing!) Except when the eDepot is synchronizing the repositories by feeding back the most current version of the data format. The advantage is that 1. the enduser can always read the most current version of the dataformat, and 2. the end-user does not have to use the slow tape-machined eDepot access to read the most current dataformat, but can use the high performing repository systems to gain faster access to the most current version. All thanks to the URN:NBN resolver that redirects the user to the most appropriate *default* location.
And just a side note: A presentation of Juha Hakala about the 7 levels of identification:
Libraries, Collections, Authors, Works, Manifestations, Components, Queries. More on http://pid.ndk.cz/dokumenty/zakladni-literatura/Persistent_identifiers_elag2005.ppt