OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting) and Koha

For those of you who don’t know, OAI-PMH is the Open Archives Initiative Protocol for Metadata Harvesting. In other words, it’s a “application-independent interoperability framework” used for harvesting and disseminating metadata.

For more information, see this link:

http://www.openarchives.org/OAI/openarchivesprotocol.html

 

While Koha has for years had the capacity to act as a OAI-PMH server (i.e. it’s been able to parse OAI-PMH requests and serve back records in MARC and DC formats), it hasn’t had a mechanism for harvesting records. However, with my latest project, I’m hoping to change that.

At Prosentient Systems, we often host Koha (the LMS) alongside Dspace (the digital repository), and interconnect them using home-grown scripts. Since these scripts are home-grown, they are not easy to share nor are they easy for non-technical people to implement. As a result, we’ve been looking at better ways of connecting the two systems, and OAI-PMH seems to be the solution.

I’m told that Dspace can act both as a OAI-PMH harvester and discriminator, so we would be sending OAI-PMH requests to Dspace, transforming them using XSLT from Dublin Core to MARC21, then importing them into Koha using established mechanisms.

That’s the plan!

At the moment, I’m still in the proof-of-concept stages, but I have set up a harvester (using the Perl module HTTP::OAI::Harvester), which retrieves records from Koha in Dublin Core format and prints them out.

That might not sound that impressive, but it took some time to become familiar with the HTTP::OAI modules and to realize that the example code on CPAN wasn’t 100% accurate. After some trial and error, I was able to get down to the XML::LibXML::Document object that I wanted, so that I could print that out as a string of Dublin Core XML.

Next steps?

Store the harvesting configuration (e.g. baseurl, last harvested date, sets, metadataPrefix, etc.) in the Koha MySQL database, store the Dublin Core XML in the database, and ensure that the harvesting script could run successfully as a cronjob.

After that, I will focus on writing the best Dublin Core => MARC21 XSLT ever, so that the most accurately translated records will make it into Koha from the original datasource (in Prosentient’s case: the client’s Dspace).

I’m really excited about this project. I think it will help with linking Koha to external data providers. Client owned digital repositories are one thing, but I noticed that the National Library of Medicine in the USA also has a OAI-PMH server for PubMed (http://www.pubmedcentral.nih.gov/oai/oai.cgi). More information can be found here: http://www.ncbi.nlm.nih.gov/pmc/tools/oai/.

Wouldn’t it be nice if your Koha users could search through PubMed records from within Koha, so that they could see what you have both locally and available electronically through PubMed?

Admittedly, having local copies of remote records might not always be ideal, if only for the fact that it could increase the size of your database substantially. However, aren’t we taught that data redundancy is a good thing? Updated records in a OAI-PMH repository (I just noticed that repository is the official term for a OAI-PMH server) can be detected and the local version can be updated as a result. There is also a “deleted” record status, so you can remove deleted records that might not point to an actual resource anymore.

In any case, I think that this should be a fun and educational project for myself, and I’m sure that lots of people will find a use for this in the future!

Digital Preservation (in Australasia)

I was just browsing through the June/July 2013 issue of ALIA’s “Incite”, when I chanced across an article titled “Digital Heritage Collections”, which discusses the on-going efforts of collecting and preserving digital materials in National & State Libraries Australasia (NSLA).

Apparently, the Australian government is going to change legal deposit legislation to include digital materials, which is both exciting and challenging. It seems like the NSLA have and will continue to have a lot of work to do. You can read more about their on-going efforts here: http://www.nsla.org.au/projects/digital-preservation

I find digital preservation to be a very intriguing aspect of modern libraries and archives. Professionally, I have very little hands-on experience with digital preservation, although I know people involved in teaching preservation, practising conservation, helping with digital preservation in universities, and working on digital preservation software (primarily Archivematica https://www.archivematica.org/).

I have taken a full-length course in print and digital preservation, and have done research into digital preservation which included writing a paper on optimal file formats to choose for born digital materials, but that’s about as far as I’ve gone at this point.

On one hand, everyone – not only library and archives professionals but also anyone with any interest in short to long term preservation of their digital materials – should be thinking about digital preservation. How many of us can still access text documents that we wrote 10-20 years ago? I probably have book reports from primary school on a floppy somewhere but it is becoming more difficult to find floppy disk drives (although where there is a will there is a way – I’m confident of that) and how many programs can still read those formats? I don’t even know what format or what program I used back in the day (I think it was called WPP but that isn’t a very useful search term). Sure, you might think, those old book reports might not be important. However, how many people wrote their memoirs or creative and intellectual efforts using that same program at that time? Probably quite a few. What’s been lost already and what could be lost?

On the other hand, who has time or energy for digital preservation? As mentioned in the “Digital Heritage Collections” article, “the preservation of digital assets is an active process”. There is a reason why people are employed full-time to do this work. It’s constantly on-going and never ending work. There are many questions to ask:

1. Assessment: What is preserved and what is not preserved?

2. Format: Do you continually reformat (i.e. migrate file formats) or do you use software emulation to reproduce the material in its original environment with its original format? (where do you draw the line with emulation? Emulating the application? the operating system? the hardware? the bugs?)

3. Integrity: Are the materials retaining their integrity over time? Are they slowly corrupting or losing data? Did the reformat completely or partially mess up the binary data? Who (or what) is going to check the integrity of all the materials, especially when these materials can range in the thousands or millions of items? Is the format of the material considered stable and long-lasting?

4. Authenticity: Are the preserved materials true and authentic representations of the original? Are they experienced in the same way as the original? (both questions very important in archives).

5. Storage: Where are these digital materials stored? In a database of their own? On a file system? Physical servers are subject to chemical degradation and mechanical failures as well over time. Do you backup your database and/or file system? Do you copy your storage across multiple systems to ensure data redundancy?

6. Access: Once  you’ve stored your files, how do you make them accessible? Do you make access copies (often PDFs in the case of text documents)? In the case of emulation, how do you expose the emulated system to end users in a way that ensures authenticity?

For more information on these questions and how to answer them, research digital preservation, ask digital preservation experts, and maybe take a look at some of these links:

http://www.dpconline.org/docs/lavoie_OAIS.pdf‎

http://www.digitalpreservationeurope.eu/preservation-training-materials/files/oais-reference-model.ppt

http://en.wikipedia.org/wiki/Digital_preservation

The OAIS reference model (also known as the functional model) is a useful way of thinking about the process of digital preservation.

There’s also much more to digital preservation than what I’ve written here. I haven’t talked at all about metadata attachment or encapsulation. Review those later links (and the Archivematica website) for more information on the rabbit hole that is digital preservation.

It’s a fascinating area of library and archives work, and while it might not be very practical for the everyday person, it’s an absolute necessity at the institutional and national level.

Vocabulary:

In your reading, you’ll likely encounter the acronyms SIP, AIP, and DIP.

The Submission Information Package (SIP) is what people give to the librarians and archivists.
The Archival Information Package (AIP) is what is preserved and stored.
The Dissemination Information Package (DIP) is what is shown to end users.

Visit the following links for much more comprehensive and thorough definitions of SIP, AIP, DIP, and other important digital preservation vocabulary:
http://www.lib.umich.edu/preservation-and-conservation/digital-preservation/digital-preservation-glossary
http://www.paradigm.ac.uk/workbook/introduction/oais-information.html