Wednesday, August 12, 2009

Sibyl Schaefer on generation of EAD and MARC from shared database at NYU

Going to talk about problem we recognize at NYU and are working to solve. We have ILS system where users can find library materials. However, they don’t usually find finding aids, because not all special collections have MARC records associated with them. We have an entirely different finding aid search system based on Apache Solr. If you go to the lib homepage, you have to specifically select special collections to search them.

We have three special collections bodies that contribute to this one EAD search tool. (Based on Archivist’s Toolkit to generate data.) Load from there into NYU publishing system, gets spit back out in preview, solr indexes it, then it goes online.

If they go to the bobcat ILS, they will not get these. The only definite reason the collections have marc records is if they have to barcode a box to send to offsite storage. Otherwise not required.

In addition – lots of duplication of work. The MARCXML record that comes out of AT is not being used. The EAD finding aid is being handed off to catalogers to generate new MARC from scratch. [DOH!]

Another issue (problem/opportunity) – We just signed onto Ex Libris’ new discovery tool Primo, which lets you pipe information from different areas. How to get more info from databases into library catalogs to enable more power from one search.

AT has been adopted for use within all NYU libraries within last year, so all EAD is now being generated from the toolkit. Have three different instances going, because there are various institutional/poltical reasons.

To address these problems, we set up the AT working group. Consists of me (AT Specialist), the digital library person, the AT programmer (since development is housed at NYU), 3 different catalogers (head of cataloging, plus two others who deal with special collections), tech services lib, electronic resources lib, and point people from three special collections.

Had kickoff meeting in May/June. We came up with vision of AT generating both EAD and MARC at once. Both of these enter respective systems, then eventually hit Primo and get deduped so that both the Marc and the EAD don’t show up in the end result search.

Working through challenges on this now. First, legacy data needs cleanup. Once you put your data into a specific format, you realize that junk goes in, then junk comes out. If you let junky data stay there, it kind of perpetuates. You have an archival collection linking to something that isn’t the authorized form of the name, then it gets spit out again in finding aids and ends up in other places. So we’ve done training classes for grad student assistants to help with searching for authorized form of names, how to enter into AT, how to cleanup existing names, etc.

Starting to tackle subjects now. This is tricky because meaning of different parts of heading is not preserved in EAD headings. (IE MARC subfields) SO you end up with someone having to revisit heading to get it into MARC. We’re working on how to handle this. One idea is to force the dashes in the headings to serve as some sort of delimiter. We have a vendor who is going to be cleaning up authority records, so we’re hoping theymight be able to indicate subfields. The other option is to implement improved subject heading handling in AT, which will help put semantic meaning of terms into AT itself.

The next problem is that MARCXML exports include funky punctuation. So we’re looking at changing some of the AT export code to handle this. But we need to ensure that we don’t break EAD display when we do this.

Also, the location of information. Currently, if you have to encode barcodes at the box level, you have to include it for every single folder. There is a plugin being implemented at yale to solve that, and make location info go into standard location.

And there is a problem because primo uses different fields to dedupe, but right now titles aren’t matching up.

Q: Why won’t XML go into Aleph? It does work via conversion with marcedit, but the problem is there is no connection between the authority databases, so if a cataloger tries to fix a problem in ILS, it doesn’t flow back. And right now they don’t have access to AT.

Q: Is this just the collection level, or lower levels? A: right now it is only collection level, with link to finding aid from the ILS marc record. Both go into Primo, which should search both. One question we have is that you can’t really offer that level of detailed searching without showing them where the terms are in the actual finding aid. We have this worked out for the EAD search, but now have to look at it at a larger level.

Q: Is this public yet? A: No, not yet. Still doing cleanup and planning.

Q: You said you dedupe records to prevent retrieval of two records for the same thing. So which do you show them. A: Probably the MARC record with link to the full finding aid. But brings up the problem I just discussed of how do you show where their search term in the finding aid.

Q: So why wasn’t the MARCXML being used? A: That’s kind of the question that precipitated this working group.

Q: We have a very similar process at Duke for combing EAD and MARC using Endeca. Going to launch soon. We display the full container list in a separate tab, and provide highlighting there. It’s a bit clunky for really large finding aids, but…

Q: What is the difference between primo and aleph – primo sits on top of aleph.

Labels: , , ,