OAI-PMH (The Open Archives Initiative Protocol for Metadata Harvesting) and Koha

For those of you who don’t know, OAI-PMH is the Open Archives Initiative Protocol for Metadata Harvesting. In other words, it’s a “application-independent interoperability framework” used for harvesting and disseminating metadata.

For more information, see this link:

http://www.openarchives.org/OAI/openarchivesprotocol.html

 

While Koha has for years had the capacity to act as a OAI-PMH server (i.e. it’s been able to parse OAI-PMH requests and serve back records in MARC and DC formats), it hasn’t had a mechanism for harvesting records. However, with my latest project, I’m hoping to change that.

At Prosentient Systems, we often host Koha (the LMS) alongside Dspace (the digital repository), and interconnect them using home-grown scripts. Since these scripts are home-grown, they are not easy to share nor are they easy for non-technical people to implement. As a result, we’ve been looking at better ways of connecting the two systems, and OAI-PMH seems to be the solution.

I’m told that Dspace can act both as a OAI-PMH harvester and discriminator, so we would be sending OAI-PMH requests to Dspace, transforming them using XSLT from Dublin Core to MARC21, then importing them into Koha using established mechanisms.

That’s the plan!

At the moment, I’m still in the proof-of-concept stages, but I have set up a harvester (using the Perl module HTTP::OAI::Harvester), which retrieves records from Koha in Dublin Core format and prints them out.

That might not sound that impressive, but it took some time to become familiar with the HTTP::OAI modules and to realize that the example code on CPAN wasn’t 100% accurate. After some trial and error, I was able to get down to the XML::LibXML::Document object that I wanted, so that I could print that out as a string of Dublin Core XML.

Next steps?

Store the harvesting configuration (e.g. baseurl, last harvested date, sets, metadataPrefix, etc.) in the Koha MySQL database, store the Dublin Core XML in the database, and ensure that the harvesting script could run successfully as a cronjob.

After that, I will focus on writing the best Dublin Core => MARC21 XSLT ever, so that the most accurately translated records will make it into Koha from the original datasource (in Prosentient’s case: the client’s Dspace).

I’m really excited about this project. I think it will help with linking Koha to external data providers. Client owned digital repositories are one thing, but I noticed that the National Library of Medicine in the USA also has a OAI-PMH server for PubMed (http://www.pubmedcentral.nih.gov/oai/oai.cgi). More information can be found here: http://www.ncbi.nlm.nih.gov/pmc/tools/oai/.

Wouldn’t it be nice if your Koha users could search through PubMed records from within Koha, so that they could see what you have both locally and available electronically through PubMed?

Admittedly, having local copies of remote records might not always be ideal, if only for the fact that it could increase the size of your database substantially. However, aren’t we taught that data redundancy is a good thing? Updated records in a OAI-PMH repository (I just noticed that repository is the official term for a OAI-PMH server) can be detected and the local version can be updated as a result. There is also a “deleted” record status, so you can remove deleted records that might not point to an actual resource anymore.

In any case, I think that this should be a fun and educational project for myself, and I’m sure that lots of people will find a use for this in the future!

Cataloguing Series…MARC 490 and MARC 830

When cataloguing an item from a series, are you ever confused by why you need to put the “series title” in both the MARC 490 and MARC 830 fields?

I remember being told to “just do it” by teachers and library staff, but in actuality…these fields have very different purposes. I was reminded of this today when someone asked if Koha supported MARC Authorities for Series. At first, I was confounded. Authority records for series? That seemed bizarre. Then I realized…they meant does Koha support MARC Authorities for uniform titles…

Here is the general info from the Library of Congress:

490 – Series Statement (R)

http://www.loc.gov/marc/bibliographic/bd490.html

830 – Series Added Entry-Uniform Title (R)

http://www.loc.gov/marc/bibliographic/bd830.html

 

The key to it all resides in a footnote on the 490 page:

Indicator 1 – Series tracing policy
1 – Series traced [REDEFINED, 2008]
Prior to 2009, series for which the transcribed form and the traced form were the same were in field 440, and field 490 was not used. If the transcribed form and the traced form were different, the transcribed form was in field 490 and Indicator 1 had value “1” (Series traced differently) The traced form was in an 8XX field. Beginning in 2009, field 440 is not used and the transcribed form of the series name is in field 490 with the traced form in 8XX, even if the names are the same.

In other words, if you have an authority record for the uniform title of a series, the title from that authority record (MARC 130 – Heading-Uniform Title (NR) in the authority record) will go in the 830 field, and the transcribed form (i.e. the series title that appears on the title page/source of information) will go in the 490 field. You’ll then change the first indicator for 490 to 1 (as mentioned above) and you’re all good!

Serial Cataloguing

Over the past 6 years, I’ve seen and experienced a few different approaches to cataloguing serials using MARC. After all, typically when you enter an organization, you adopt whatever cataloguing conventions they practise there. This makes a certain amount of sense, since there is always organizational acculturation whenever you begin a new job. Standards are often considered to be guidelines more than rules.

I’m in a very different position now than I have been over the past few years though. At this point, I have people posing questions to me about how they’re “supposed” to catalogue. This is quite another animal all together and as a result actually fairly difficult to answer.

If you consult the following links, you will notice a few different ideas about how people are “supposed” to catalogue serials.

Serial Cataloguing

http://special-cataloguing.com/node/1403

CONSER Cataloging Manual

http://www.itsmarc.com/crs/mergedprojects/conser/conser/contents.htm

University of Illinois at Urbana-Champaign: Serials Cataloging

http://www.library.illinois.edu/cam/procedures/serguide.html

Arizona State Museum: Instructions for Serials Cataloging

http://www.statemuseum.arizona.edu/library/cataloging_manual/serialscat.shtml

OK…but how are you ACTUALLY supposed to do it?

Well, it seems to me that the MARC 362 field (http://www.loc.gov/marc/bibliographic/bd362.html)  is supposed to be used to record the “beginning” and “end” dates of a publication. This may also include the sequential designation (i.e. Vol., No., etc.) in the case of periodical publications.

Then, the MARC 863 (http://www.loc.gov/marc/holdings/hd863865.html) field is supposed to be used to record the actual holdings. In some cases, this might involve multi-part items that are not periodicals, but that’s outside the scope of this post. In regards to serials, there are various levels enumeration, which allow you to specify your holdings at various levels of detail. Perhaps you just want one 863 entry per year. Perhaps you want one every month. I suppose this is where a certain amount of localized convention comes into the picture.
What I would like to point out is that 853 and 863 fields seem to be directly linked, and thus subfields should be used for their linked purpose. If you are marking an item as missing, use a X or Z subfield to write that information out as a “note”. That’s where it belongs. Follow the examples specified in the Library of Congress webpage I have linked to above.

Without a doubt, serial cataloguing is a complicated beast, but hopefully this post will elucidate things a bit and prompt further research on the part of those doing serials cataloguing.