Wednesday, August 12, 2009

Sibyl Schaefer on generation of EAD and MARC from shared database at NYU

Going to talk about problem we recognize at NYU and are working to solve. We have ILS system where users can find library materials. However, they don’t usually find finding aids, because not all special collections have MARC records associated with them. We have an entirely different finding aid search system based on Apache Solr. If you go to the lib homepage, you have to specifically select special collections to search them.

We have three special collections bodies that contribute to this one EAD search tool. (Based on Archivist’s Toolkit to generate data.) Load from there into NYU publishing system, gets spit back out in preview, solr indexes it, then it goes online.

If they go to the bobcat ILS, they will not get these. The only definite reason the collections have marc records is if they have to barcode a box to send to offsite storage. Otherwise not required.

In addition – lots of duplication of work. The MARCXML record that comes out of AT is not being used. The EAD finding aid is being handed off to catalogers to generate new MARC from scratch. [DOH!]

Another issue (problem/opportunity) – We just signed onto Ex Libris’ new discovery tool Primo, which lets you pipe information from different areas. How to get more info from databases into library catalogs to enable more power from one search.

AT has been adopted for use within all NYU libraries within last year, so all EAD is now being generated from the toolkit. Have three different instances going, because there are various institutional/poltical reasons.

To address these problems, we set up the AT working group. Consists of me (AT Specialist), the digital library person, the AT programmer (since development is housed at NYU), 3 different catalogers (head of cataloging, plus two others who deal with special collections), tech services lib, electronic resources lib, and point people from three special collections.

Had kickoff meeting in May/June. We came up with vision of AT generating both EAD and MARC at once. Both of these enter respective systems, then eventually hit Primo and get deduped so that both the Marc and the EAD don’t show up in the end result search.

Working through challenges on this now. First, legacy data needs cleanup. Once you put your data into a specific format, you realize that junk goes in, then junk comes out. If you let junky data stay there, it kind of perpetuates. You have an archival collection linking to something that isn’t the authorized form of the name, then it gets spit out again in finding aids and ends up in other places. So we’ve done training classes for grad student assistants to help with searching for authorized form of names, how to enter into AT, how to cleanup existing names, etc.

Starting to tackle subjects now. This is tricky because meaning of different parts of heading is not preserved in EAD headings. (IE MARC subfields) SO you end up with someone having to revisit heading to get it into MARC. We’re working on how to handle this. One idea is to force the dashes in the headings to serve as some sort of delimiter. We have a vendor who is going to be cleaning up authority records, so we’re hoping theymight be able to indicate subfields. The other option is to implement improved subject heading handling in AT, which will help put semantic meaning of terms into AT itself.

The next problem is that MARCXML exports include funky punctuation. So we’re looking at changing some of the AT export code to handle this. But we need to ensure that we don’t break EAD display when we do this.

Also, the location of information. Currently, if you have to encode barcodes at the box level, you have to include it for every single folder. There is a plugin being implemented at yale to solve that, and make location info go into standard location.

And there is a problem because primo uses different fields to dedupe, but right now titles aren’t matching up.

Q: Why won’t XML go into Aleph? It does work via conversion with marcedit, but the problem is there is no connection between the authority databases, so if a cataloger tries to fix a problem in ILS, it doesn’t flow back. And right now they don’t have access to AT.

Q: Is this just the collection level, or lower levels? A: right now it is only collection level, with link to finding aid from the ILS marc record. Both go into Primo, which should search both. One question we have is that you can’t really offer that level of detailed searching without showing them where the terms are in the actual finding aid. We have this worked out for the EAD search, but now have to look at it at a larger level.

Q: Is this public yet? A: No, not yet. Still doing cleanup and planning.

Q: You said you dedupe records to prevent retrieval of two records for the same thing. So which do you show them. A: Probably the MARC record with link to the full finding aid. But brings up the problem I just discussed of how do you show where their search term in the finding aid.

Q: So why wasn’t the MARCXML being used? A: That’s kind of the question that precipitated this working group.

Q: We have a very similar process at Duke for combing EAD and MARC using Endeca. Going to launch soon. We display the full container list in a separate tab, and provide highlighting there. It’s a bit clunky for really large finding aids, but…

Q: What is the difference between primo and aleph – primo sits on top of aleph.

Labels: , , ,

FAÇADE: Future-proofing Architectural Computer-Aided Design

FAÇADE: Future-proofing Architectural Computer-Aided Design

Presented by Tom Rosko, MIT, Head of Institutional Archives and Special Collections at meeting of SAA Architectural Records Roundtable,

At MIT we have the Stata Center, as well as the media lab, architecture school, etc.Applied for IMLS grant a few years ago on how to handle these new digital architectural records.

Current arch. Data is being lost, particularly 3d data.

Staff includes head of arch library, several other people. I was not formally part, was brought in t consult.

Challenge: to develop long term archival strategy for digital archival records, particularly 3D ones. Also to develop strategies for using DSpace to do this. And ways to capture and present data.

The use of 3D CAD has made things increasingly complex in the architectural world. The BIM (Building information modeling) concept has also increased complexity – increases interrelationships between different types and formats of data. And there are not a lot of standards for how these things interoperate.

As project progressed, realized how interrelated data is, and how just preserving the 3D models alone may not be enough.

Architectural firms tend to think more about getting the project done than about long term archiving and reuse. Q: What about the as-builts and other deliverables to clients? Discussion: This is somewhat inconsistent. Contracts are starting to spell this out, with what kind of digital files to deliver. But this is inconsistent. Some also require hard copies, with the idea of scanning that even if the digital files become unreadable.

Tom: Firms are starting to recognize this issue.

The project also looked at potential audiences in addition to Practice (architects, designers, engineers.) for example, researchers, historians, scholars, instructors, students, and general public.

Developed use cases for what uses we thought each type of user might make of the materials. Created advisory board and consulted other audiences.

Content for this project: 3 data sets. Frank Gehry, Stata Center at MIT (2004, CATIA), Moshe Safdie, US Institute of Peace (2009, Revit) and Thom Mayne, Caltrans (2004, Microstation)

These provided 100+ file formats, tens of thousands of gigabytes, almost no metadata, etc. The dataset was massive for each of the projects. They were complete project files, not just the final documents or end products. And some audiences want that stuff, so that played into the mindset of how to develop this.

Geometry – different ways of storing data. For example mesh versus arcs and curves. Parametric allows users to refer to features rather than underlying geometry.

The different software used varied in how they modeled and the geometry methods used.

Looked at open standards for model and geometry info (STEP, IFC, IGES, VRML, STL), as well as for display formats, including 3D PDF.

Various industry exchange data formats.

If CAD software only exports “inert geometry”, it doesn’t truly represent the complexity of the underlying 3d model. Does that matter? To the targeted audiences, it didn’t seem to matter that much. But may matter to us.

How to manage all this data. Intellectually, went with the BIM idea – that unless you incorporate the relationships between different types of info, you will lose information. So used RDF XML ontology to model relationships. Developed Project Information Model, which links together all types of info in a relationship map. (see slide 35)

Slide 36 – properties on objects. Every file gets five properties: Project Phase, Building Zone/System, Architectural Discipline, Document Type, File Format

These were basic tags. More important specially curated documents would get more tags. 3D models and 2D drawing sets, client presentations, etc.

Developed concordance of information. What formats existed, how many of each, and an initial appraisal attempt. (slide 39)

Slide 40, Curators Workbench – a custom tool that allowed someone to go in and view some of the material, make decisions, and add additional appraisal info. Had grad students from school of architecture working on this. Students also converted file formats where needed, and library staff helped with some of the metadata.

DSpace used for preservation, dissemination, and access control. Also used FACDE UI external to DSPACE, bulk ingest tools (curator’s workbench, DSpace packager importer). Format registry used for tracking file formats.

Good diagrams on slides 42ish re: data processes.


For presentation used SIMILE exhibit and timeline tools, and longwell RDF-based faceted browser. Presentation shows screenshot.

[Outcomes. Demonstrates use of open source software to solve this sort of problem. The ontology developed for this project may be applicable to other environments.]

Challenges from an archivists view: When this was just the 3D files, we thought about what are the rights to the material (intellectual property). Not just use of material , but sharing it with the public. When we go beyond drawings to all documentation – there is correspondence in there that may have its own IP issues.

Then there’s the display end - what will work, and how will it work.

And a lot of the other issues we’re routinely dealing with on digital files.

Grant finishing up in Sept, so winding down now. Looking at what to do for follow-up now --- including grant proposal for 3D CAD, and setting up another pilot instance this fall.

Opportunities: tools like curator’s workbench can help us address large “data” files. And interface developments (how to view and search materials) also applicable.

The major hurdles are scalability and IP. How big can it go and how can you sustain it, and what can you show to others – how much will we have to restrict because we just don’t have the answers.

Several questions relating to security implications of having this much construction data searchable in digital repository. Not really addressed in this project.

Q: Seems antithetical to MPLP because you are creating massive amounts of new metadata. A: [Can automate a lot, but there is still a lot of work. ]

Q: Better solution active records management system, so that the complexity can be handled more on the front end?

Q: Is there any data not currently being supplied that should be folded into national CAD standards? This would be a good thing to think about – the AIA is involved in this.

Q: There is similar work in “pedesink” (sp) that oversees STEP standard for longterm retention of CAD, CAM, CAE.

Labels: , ,