Home » Featured, open science, Professional, web technologies

PACE — Ping-back for Academic Citation Enhancements

16 June 2011 No Comment

Connecting datasets and publications automatically
Wouldn’t it be great for a scientific data archive to know what publications made use of their data sets? Pingback mechanisms, used in blog systems, can send citation notifications automatically. Can the same be applicable for online journal systems, notifying each other and data archives about citations? It all comes down to agree on using a really simple standard.

Map of science derived from clickstream data

MESUR.org: Related journals, based on citation clickstreams. From: Bollen J, Van de Sompel H, Hagberg A, Bettencourt L, Chute R, et al. (2009) Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE 4(3): e4803. doi:10.1371/journal.pone.0004803


This blog article describes  very drafty the Ping-back mechanism (http://en.wikipedia.org/wiki/Pingback) used in blogs, now used in repositories, data archives and journal systems.

This idea is just giving an impression about how to tackle one of the problems in the scholarly community where it is common practice to create bi-directional citation links in retrospect, which is very labor intensive. A given is that in journals like Plos a link is provided, citing the dataset. Yet at the datacenter there is no information back to the article, because they are unaware of this citation. This comes with the reality that there is no notification mandate at the journal side, and it is labor intensive if not automated.

If a notification partnership between journal and data center will become reality, why not automate it?
The technology is already there developed in the web-log community, using a Ping-back mechanism. Blogs that cite one another automatically send notification messages where they can refer back to the originating blog.

The steps how it works, and how it can work for journals and data centers, is explained below.

For the example we use the following ingredients

  • The article A in journal system X has a DOI 10.x/a
  • The dataset B in data archive Y has a DOI 10.y/b
  • The journal system is posting information about the article on the website, including citation information referring to the dataset.
  • Both journal system and data archive are using Pingback mechanisms according to the specifications.

The illustration on the right is following the steps described below.

An editor it finishing a publication in an online journal system, he is minting the DOI for the article A, and fills-in the metadata, also the citation information where the DOI of the dataset B is typed in.

At the moment when the article A is published on the web the Ping-back mechanism kicks in. It essentially sends a RPC Pingback notification to the Dataset B’s web address http://dx.doi.org/10.y/b.

Because the data archive, where dataset B resides, understands Pingback RPC requests, it automatically makes a check in the HTML at the web address where the Article A resides on the Journal System at http://dx.doi.org/10.x/a . The data archive is expecting to find in the HTML of the article something that looks like:

This way the dataset automatically is becoming aware of the  Articles it is cited, used or aggregated in.
A fully installed mechanism works bi-directional, so also where a dataset is making an assertion to an article, the article becomes aware it is being asserted.

Advanced Pingback in RDF

Perhaps we would like to add more information about the nature of the link between the two locations. This can ideally be done in RDF. Where in the Linked Data mindset RDF documents are linking to each other, not HTML documents.
This will lead to the following advanced Pingback check. This falls outside the Pingback specification, and therefor is not a standard.

The 3TU data center is already expressing their data sets descriptions in RDF. example: http://data.3tu.nl/repository/resource:location-49760597/object/ORE
The folowing could easily be done for them.

The example continues:
Metadata is given to the article A, also the dataset it has used, using rdf-statement ore:aggregates, in a ResourceMap for the article. This will result in a RDF/n-tripples expression (object, predicate, subject) :  10.x/a aggregates 10.y/b

<http://dx.doi.org/10.x/a>
<http://www.openarchives.org/ore/terms/aggregates>
<http://dx.doi.org/10.y/b>

Now, at the moment when the article A is published on the web the Ping-back mechanism kicks in. It essentially sends a RPC Pingback notification to the Dataset B’s web address http://dx.doi.org/10.y/b.

Because the data archive, where dataset B resides, understands Pingback RPC requests, it automatically makes a check in the rdf at the web address where the Article A resides on the Journal System at http://dx.doi.org/10.x/a . The data archive is expecting to find in the rdf of the article something that looks like the tripple:


http://dx.doi.org/10.x/a

http://www.openarchives.org/ore/terms/aggregates

http://dx.doi.org/10.y/b

If the data archive finds this tripple with the data set id as the object, and the ore:aggregates term as the predicate, it will grab the id of the subject in this tripple, and add it to it’s own ResourceMap. That looks like:


http://dx.doi.org/10.y/b

http://www.openarchives.org/ore/terms/isAggregatedBy

http://dx.doi.org/10.x/a

more specific towards citation

The examples above use the Object Reuse and Exchange (ORE) ontology, which makes the relations between the two very generic. We use this standard because it is widely used. However when we want to be more specific about the fact that this article cites the dataset, additional assertions can be made by using the Semantic Publishing and Referencing (SPAR) ontology. This lead to the following tripple:


http://dx.doi.org/10.x/a

http://purl.org/spar/cito/cites

http://dx.doi.org/10.y/b

The RPC of the Ping-back mechanism can automatically create an inverse relation at the data center side.


http://dx.doi.org/10.y/b

http://purl.org/spar/cito/isCitedBy

http://dx.doi.org/10.x/a

even more specific citation of a dataset

Even neater is to make an assertion specific to that fact that a DataSource is cited, where the predicate is a sub-class of “cites” in the ontology.


http://dx.doi.org/10.x/a

http://purl.org/spar/cito/citesAsDataSource

http://dx.doi.org/10.y/b

And the inverse


http://dx.doi.org/10.y/b

http://purl.org/spar/cito/isCitedAsDataSourceBy

http://dx.doi.org/10.x/a

Standard for Pingback in RDF embedded in HTML

Now it all comes to create a simple standard.

So for the sake of simplicity I would presume that MicroData would be the standard to use for RDF integration in HTML5 publishing and authoring tools. I base this on that fact that Google, Bing and Yahoo have come up with Schema.org to set a standard vocabulary for enriching HTML. (I am not going into the discussion whether this is a good thing ruling a whole hard working community out, etc.)
So when a publishing tool is going to post a page where citation is involved, this is what the Ping-back mechanism of the data center side should pick-up an be able to process.

Below an example I have reused and changed a bit from http://www.schema.org/Article
The HTML text

How to Tie a Reef Knot
by John Doe
This article is based on data set http://dx.doi.org/10.y/b

The enriched version

How to Tie a Reef Knot by This article is based on data set http://dx.doi.org/10.y/b

The Pingback-RDF mechanism at the data center side would be able to, first parse RDF that is embedded in this HTML file. Next to check if the global identifier of the data set appears as being cited. If so, it then can extract the metadata of the article and publish a ‘citedBy’ link on it’s own page.

I am curious what you think of it, so lease leave some comments.

 

 

References

ResourceMaps http://www.openarchives.org/ore/1.0/primer.html

Enhanced Publications http://www.surffoundation.nl/enhancedpublications

Ping-back http://www.hixie.ch/specs/pingback/pingback

DataCite http://datacite.org/whycitedata

Journal System http://pkp.sfu.ca/?q=ojs

Data Archive http://datacentrum.3tu.nl/

Semantic Publishing and Referencing http://purl.org/spar/

Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Introducing the Semantic Publishing and Referencing (SPAR) Ontologies | by David Shotton

How to use Citation Typing Ontology (CiTO) in your blog posts | by Martin Fenner

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.

Spam Protection by WP-SpamFree