OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting) and Koha

For those of you who don’t know, OAI-PMH is the Open Archives Initiative Protocol for Metadata Harvesting. In other words, it’s a “application-independent interoperability framework” used for harvesting and disseminating metadata.

For more information, see this link:

http://www.openarchives.org/OAI/openarchivesprotocol.html

 

While Koha has for years had the capacity to act as a OAI-PMH server (i.e. it’s been able to parse OAI-PMH requests and serve back records in MARC and DC formats), it hasn’t had a mechanism for harvesting records. However, with my latest project, I’m hoping to change that.

At Prosentient Systems, we often host Koha (the LMS) alongside Dspace (the digital repository), and interconnect them using home-grown scripts. Since these scripts are home-grown, they are not easy to share nor are they easy for non-technical people to implement. As a result, we’ve been looking at better ways of connecting the two systems, and OAI-PMH seems to be the solution.

I’m told that Dspace can act both as a OAI-PMH harvester and discriminator, so we would be sending OAI-PMH requests to Dspace, transforming them using XSLT from Dublin Core to MARC21, then importing them into Koha using established mechanisms.

That’s the plan!

At the moment, I’m still in the proof-of-concept stages, but I have set up a harvester (using the Perl module HTTP::OAI::Harvester), which retrieves records from Koha in Dublin Core format and prints them out.

That might not sound that impressive, but it took some time to become familiar with the HTTP::OAI modules and to realize that the example code on CPAN wasn’t 100% accurate. After some trial and error, I was able to get down to the XML::LibXML::Document object that I wanted, so that I could print that out as a string of Dublin Core XML.

Next steps?

Store the harvesting configuration (e.g. baseurl, last harvested date, sets, metadataPrefix, etc.) in the Koha MySQL database, store the Dublin Core XML in the database, and ensure that the harvesting script could run successfully as a cronjob.

After that, I will focus on writing the best Dublin Core => MARC21 XSLT ever, so that the most accurately translated records will make it into Koha from the original datasource (in Prosentient’s case: the client’s Dspace).

I’m really excited about this project. I think it will help with linking Koha to external data providers. Client owned digital repositories are one thing, but I noticed that the National Library of Medicine in the USA also has a OAI-PMH server for PubMed (http://www.pubmedcentral.nih.gov/oai/oai.cgi). More information can be found here: http://www.ncbi.nlm.nih.gov/pmc/tools/oai/.

Wouldn’t it be nice if your Koha users could search through PubMed records from within Koha, so that they could see what you have both locally and available electronically through PubMed?

Admittedly, having local copies of remote records might not always be ideal, if only for the fact that it could increase the size of your database substantially. However, aren’t we taught that data redundancy is a good thing? Updated records in a OAI-PMH repository (I just noticed that repository is the official term for a OAI-PMH server) can be detected and the local version can be updated as a result. There is also a “deleted” record status, so you can remove deleted records that might not point to an actual resource anymore.

In any case, I think that this should be a fun and educational project for myself, and I’m sure that lots of people will find a use for this in the future!

Advertisements