Sunday, November 16, 2008

Opening IT Up: Using Open Source Software

Carla Schroer, Cultural Heritage Imaging, (A California Nonprofit Corp.) 20 years of open-source experience.Carla at c-h-i dot org, http://www.c-h-i.org

About me: 20 years in Silicon Valley, 13 at Sun., etc.

Open Source is a licensing model for software. It tells you nothing about the actual software, how it was developed, whether support is available, etc.

Open source licenses have generally been approved by OSI (open source initiative). REquires free distribution, available source code, and people are allowed to modify source code. There are several major types of license:

Permissive Licenses (BSD,MIT,Apache)
Copyleft (Mozilla, GGLGPL, Eurpoean Union Public License (EUPL)

Strong copyleft licenses - Gnu General Public License GPL.

What distinguishes? What you get to do with code downstram. Under a permissive code, I can take and relicense under another license, etc.

A copyleft license, by contrast, I have to release any modifications under same license. Can't go closed/into commercial thing. THere are two flavors. In lightweight, it's file based. So I could fix bugs in one file and release it. (But could release plugin under any license I chose.)

Strong copyleft licenses mean that it is viral. Anything that touches licensed code required to have same license. Strong copyleft licenses are project based and can effect code beyond the original licensed code.

Update: Carla wrote me a note clarifying her point on this:

There is one thing in your description of the licenses part that isn't quite right, and makes a difference to me. This is in the area of strong copyleft licenses. You say:

Strong copyleft licenses mean that it is viral. Anything that touches licensed code required to have same license.

While I think a lot of people believe this to be true, I strive really hard not to go that far in talking about the copyleft effect of the GPL licenses. My slide, and I hope my presentation, said that under some circumstances the copyleft effect can go beyond the original files. I didn't have time to get into the vagaries of this in a short overview, and I recommended some resources for folks that wanted to understand the boundaries a bit further. I think the GPL licenses are really valuable for some situations, and that folks are unduly scared away from it due to fears about the extent of the copyleft effect. It is true that it can affect files that are not part of the original GPL code, but I wouldn't go so far as to say that it affects everything that touches it. (the license talks about "linking with" and also the act of distributing code together can trigger the effect). There are different legal interpretations on what circumstances trigger the requirement, though some things are quite clear.


Permissive license allows much use with few restrictions, Copyleft ensure that work based on license remains open.

These are not all compatible -- you can't combine some of them. Software freedom law center writes a lot about this. There are very few legal precedents, so some is still up in air.

Choosing software: You should ask the same questions you would ask for Open Source as for non-open source. An important thing is what is the cost of switching if it doesn't work out. Also what's the TCO? The cost of the license is one factor. But you still have all the other costs -- training, sustaining, maintenance, etc.

Probably the most controversial thing I'll say this AM. I believe that open source can help mitigate concerns about adopting new file formats and standards. The ability to adopt a new technique is mitigated because the code is open source -- because it's open, it's more likely to be available and usable in future.

CHI Ongoing Collaborations (include Worcester Art Museum, etc.) I'm going t talk about a specific project where we used leadership grant from IMLS and built software tool. Worked with team from UC Santa Cruz. Their standard license for having grad students work requires the U. to have all IP rights. Took some serious negotiation. By contrast , the Italian National Research Council probably wouldn't have been involved unless it was an open source model. An important thing is that everyone knows what the terms will be from the beginning.

Another issue is copyright ownership. So we ended up creating a joint copyright with the people who write the code, which gives us the rights to license and still lets them do stuff with it.

Open Source can be a tool to get people working together for the common good. License terms need to be agreed upon up front. And you need to make sure you choose collaborators with same goals.

Open source is a tool, not a religion. There are a range of licenses, good for different things.
Carla at c-h-i.org, http://www.c-h-i.org

Christopher J. Mackie, Associate Program Officer, REsearch in Information Technology Program, the Andrew W. Mellon Foundation.

Why Open Source? (My views don't represent foundation, if I say something egregiously stupid, it doesn't represent the foundation!)

We think we are the largest NGO funder of open source software. (We'd love to see someone bigger doing that. We have 30+ prokects with users and developers on every continent. http://rit.mellon.org. All major media sites picked up some visualization tools for historical timelines we made for historians and used them in their coverage. All is open source, most is community source.

Open Source isn't always a bunch of people working in their garage in the evenings. It can be big business.

Open source is a licensing scheme (or sometimes a religious belief)l it's not

a guarantee of freedom or success
a sustainability model
a software architecture (in itself)
A single organizational model
a (good) technology strategy in and of itself.

So why use it?

Open source is governed by and for users. Can't be bought, closed. No one can buy project and foreclose it or require upgrade. Ownership is key.

Some people want open source for cost savings. This can happen if you try, but not always primary motivation. Biggest reason for adoption is risk management. It diffuses risk of new development across many institutions. (As long as project is well-run and viable.)

Why not? The Mickey and Judy model. Local optimization/Competency trap. (My mom's got costumes, your dad has a barn, let's put on show!) Bad tech strategy is bad strategy --doesn't matter if open source.

OS tends to cause developer centrism instead of user centrism. Stakeholders want to know what it does for me, and many OS projects have trouble answering that.

OS software brought in badly can subvert an overall technology strategy. Must be smart about it.

OSS Business Model. Most projects don't have one. Some have dual licensing. There is good (dual commercial/noncommercial). If vendor is well intentioned, these can be valuable. Evil is platinum edition, where the OS version is just a crippled version of the good one which is proprietary. That's not really open source.There's the services model, where software is oss but you pay someone to support. There's the appliance model where you are sold a box. There's hosting where the vendor does everything. And then there's software as service. Like hosting, but the vendor sets up internal infrastructure differently.

Varieties of OSS. Traditional is developer driven. Can be a terrific model if customers are developers. But may fail if developers are not final customers.

Single vendor driven - many companies allow you to download, but then try to become a monopoly vendor for it.

Most of the benefits of OSS only come if you can fire the vendor without finring the vendor.

And then there's what we support, which is community source, functionally driven. Collection Space is one of our projects like that.

Community source software is designed and built by and for the functional specialists for their community. We have ways of having these people get together and design a blueprint.Community owned -- when the collection space project is done, it will turn over IP to a foundation. It's community sustained - by contributions, and/or by a healthy vendor ecosystem. And it's state of the art in governance, tech, sociology... and sustainability.

We've done XX many projects, many of which are out of funding, and none of which have died.

What does it take? Not wealth. Vendors make it affordable. Ongoing support costs are 40-60% less than commercial, and vendors allow people to buy support instead of hiring developers. Mostly it takes organizational buy in and a strategic technology plan.

Strategic Plans and critical mass -- CriticalMASS (Mission, agility, sovereignty and sustainability) No strat. tech plan is a plan, but a really bad one.

Strategic Software: Gain buy in from stakeholders and executives. Contextualize CriticalMASS values for your institution. Evaluate resources holistically. If you do all this stuff, we think you'll end up with community source. But if you don't, that's fine.


Carl Goodman, Senior Deputy Director, Museum of Moving Image
Cgoodman at movingimage dot us

PI for Collections Space project
www.collectionspace.org

Getting comfortable with OSS. You're not on your own if you use OSS - you may have more support. The palmolive principle - you're already swimming in OSS everywhere. Wordpress, firefox, drupal, programming languages, Shopping Carts (OSCommerce), course management (Moodle)

Crisis breeds will to try new approaches. People who were more conservative may be willing to try new things.

Our project is not your father's open source project. THis is not people in their free time creating` spaghetti code.

Next generation web bases collection information, management and access platform that happens to be open source. But open source is not the defining point of the software. A number of partners on the project, and funded by Mellon.

Grew out of our work on a homegrown collections system called OpenCollection. Won an award for that. Led to larger grant. It became clear that we needed to step back and reinvent the software and work to create culture and community around it. BAsed on idea that managing, acquiring, disseminating collections is a core activity, as is getting it online, but difficult.

We are developing an alternative to commercial or homegrown CMS.

We want to leverage university partners culture of research and innovations, recognition that they have museums of constitutents. Museums do have developers, but they are booked. And most museums don't.

We're also trying to put UI design up front. The fluid project, which we're working with, is working to create UI elements for incorporation into OSS. We in collections management deserve usable software!

Brought in 40 people representing 20 organizations to think about what a new collections system would look like. Transparency. This is hard for museums. We are doing this in a way so that the community can be involved at any stage and see all debates, discussions, and mistakes. Project website and wiki are very detailed. We're taking a coordinated and highly structured approach to distributed software development. We have 12 people working full time on it, on a 2.5 year project.

We're decoupling various aspects of prokect from each other so that they can innovate and still stay in touch with each other. The functional team is working on the needs requirements for first phase pof project. There is a design UX team looking at user interfaces, use cases, etc.

The technical team is working on underlying architecture and tools.


Timeframe for al this. Tech platform Dec. 08. Development begins Jan. 09. Tire kicker march 09.

Sustainability- we are working on this as well.

THere is a lot of OSS inside the system. JAva/PHP or python or rails, Fedora, etc. Imagemagick, etc. And we hope parts of our project end up inside other projects. For example project OLE at Duke, or Bamboo. Omeka, Pachyderm. Other projects are further along, and have been helpful to us even though in different domain.


Brad Westbrook, Archivist's TOolkit Project Manager, and X at UC San Diego. MLS from UCLA, and MA in English from Suny Albany.

Overview of AT and how it became OSS, and some of our lessons learned. Mary contrasted us with Collection Space, and noted that we're more mature. True to some degree. But also more immature. They are already into sustainability, etc. That's something we came to belatedly.

It's an OSS RDB collection system. Purposely a staff-side tool - not targeted at external clients. Start up funding in 2002 by Digital Library Federation, and two development cycles funded by Mellon Foundation. Three public releases to date. Dec 2006, last january, and last one was wednesday night from this hotel. We are pleased with uptake. We have 40 or so institutions that have implemented as production tool and being available for other users as a resource.

Some institutions include Getty, Museum of flight, vermont folklife center, bates college, Princetion, UCLA, etc.

Designed to address key problems in archives domain. Serialized processing tools. One task done by one tool, another by another. A lot of redundant data entry and siloing, as well as inefficiency and increased training cost.

Also resulted in data with low interoperabiliuty. The online archive of CA found this out in 2001 when they found that the EADs that had been submitted. There was tremendous variability despite being national standard.

Also thought the tool could help reduce growing archival backlogs.

Solution was to build program that would promote standardization (DACS, ISARR (authority recs)
Supports export standards like EAD, MARCXML, METS, MODS, DC)

Also promote efficiently by integrating functions, enabling repurposing of data, automating encoding and reporting, and providing customization features. In the end, we thought it could decrease the cost of processing and improve sharing across community.

So why Open Source? We wanted it to be affordable. We wanted it to be on an enterprise quality database. Oracle and SQL Server were cost barriers for organization. So chose to go with MySQL. ALso, we wanted flexibility and adaptability. One of the complaints is that a vendor will give you SW that does something, but you can't modify it -- you take what you get. So if it meets most of your needs, you take it and adapt. We wanted a tool that orgs could modify. We liked the community volunteer model. Giving the users oversight for development priorities and features request, documentation, etc.

We're released under the Educational COmmunity License 1.0. Did this belatedly. Finally after wrestling with Apache or GPL. But felt ECL was favorably received by funder (we thought) and allowed certain commercial opportunities that might not be there with GPL. USed a lot of third party OSS to build out, and we began transition to user governance. We've shifted from sharing requirements with a small group of specialists to doing it with the community at large.

We worked with SAA to establish workshops for training. THere have been 5 in 2008 so far. We'll begin usability testing using AT users in NYC area. And we petitioned SAA to create AT roundtable, which will hopefully become seat for users to take over governance.

What we've learned? Match between product type and open source strategy. We probably at the beginning thought we'd have developers climbing out of windows hoping to contribute. We've finally come to the conclusion that this is because it's targeted at a very small group of users, and contributing requires domain knowledge. So we will probably not have many developers contributing.

Another thing is to start sustainability planning and think about license much earlier in process. We ended up with a mixture of code with some GPL and some not, which limits what we can do.


Q: How do you measure sustainability and suitability? A: there are some indexes out there like business readiness index. But these only rate big projects, are subjective, etc. SO hard to use. Important to look at how many people involved, how active community is, etc. Ideally you want a community with a diversity of vendors. A project where one institution does most of the work has a single point of failure. Etc.

Q: (Ari) Is there an equivalent for Code4Lib for archives/museums, and/or could this be expanded to code for cultural heritage. To help with developer domain knowledge problem? Bill - aware of code4lib. We've had a presence there due to Mark McKenzie. We've started announcing releases there. How we can take that further.

My comment: Importance of a good plugin architecture, documentation and example code to getting community development. Much harder if you have to understand entire project architecture just to make a small change.

Q: If Firefox dies, you lose your bookmarks. If your collections system dies, you lose everything? A: Disaster planning is important. You have to think about that. One thing people say is "with commercial I can always call someone." But that assumes that they will answer the phone at 5 am on sunday, that they will have an answer quickly, etc. Ideally your disaster plan should be somewhat beyond and independent of just calling someone.

Q: Ari - we started using ATK, but were flummoxed when we realized it didn't have built in integration with Fedora. WHat kind of planning is going into this sort of integration. A: We've put some thought into this, particularly with Fedora and Dspace. Haven't done a ton of planning beyond that - but would wecome community contributions on that. A: CollectionSpace is taking a lot of pains to make sure that we will be open to this type of integration. One of our use cases is Fedora at UC Berkeley. So instead of thinking of as system we're trying to see it as granular and open to these types of cobinations.

Q: You talked about single vendor projects. Can you give an example. A: one good example is Zimbra. If you want to run alongside an exchange server (which is how most people use it) you have to buy from them. 35b went into OSS over last few years, and much of it went into these single vendor solutions.

There's another strategy where marketing material says it's open, and what they really mean is that they have a proprietary API you can write to.

Q: Adobe FLex? Q: Like PDF is open source? A: PDF is an open standard, but not open source in that Acrobat is not open. But the standard is open. That's an important distinction.

Q: I've found that OSS software is held to higher standard than commercial - should be cheaper, offer more flexibility easily, etc. How do you manage this and help with integration and acceptance? A: There tends to be an overselling to leadership. A lot of that is our own fault. People in their zeal to make case overpromise. There isn't a substitute for educating your own leadership. They have to understand what is and isn't deliverable. People go in and say things that they won't be able to follow through on. The easy answer is don't do that. But it's hard, because you're facing the uphill struggle of adopting this new model. The way to short circuit this whole dynamic is to start with a strategic technology plan. If you do that and they're making informed decisions, I expect in most cases with rational people they will understand.

Q: People talk about free as in beer or free as in speech. Well, OSS is free as in kittens. That's a good model to follow.

Q: I come from the other side - our leadership is saying "yeah we need OSS because it's free." But that's a simplistic attitude. Need more strategic planning.

Q: Why aren't strategic technology planning more common? A: Technologists aren't always trained to do this. And at a large organization you learn how to do this. But as CTO in small museum, you may stumble on this, but you're less likely to be trained to do it. And a lot is about opportunity. One reason OSS projects are powerful is that they allow small institutions that don't have resources internally to find resources at the community level and bootstrap themselves. Big orgs often come in for their own reasons -- to demonstrate need for internal developers, etc. Smaller organizations without these resources come in later, but save more because they don't have to do everything internally.

Good end note: need for strategic technology plan!

Labels:

Omeka: Bringing Collections to the Web

Sharon Leon, CHNM, George Mason University


We noticed that people (especially small museums and historical societies) were struggling to bring their collections online. We wanted to build a web publishing system that was targeted at curators, small organizations, etc.

Anyone who is familiar with Wordpress will get the basic idea. The software is based on Dublin Core (unqualified), and supports themes, etc. Last week released the .10 beta with new, redesigned site. We have a growing development community that we'd love for you to join. The API is now set, so those who want to add plugins can now do that. There's a geolocation plugin, an ipaper plugin, a contributed content plugin, etc.

We'd also love people to contribute themes. The system ships with 11 core layouts for exhibit builder. My colleague Sheila Brennan is sitting at the table and has a live demo.

Labels:

Collection Space: A Next Generation Collections Management System

Megan Forbes, Collection Information and Access Manager, Museum of the Moving Image

A collaboration between a variety of institutions with goal of creating a open source web-based collections management system. Began with user design workshops earlier this year. The current project team includes Museum of Moving Image, UC Berkeley, Cambridge U., and U. Toronto.

We are having a very open development process -- all work is done publicly, posted on our wiki, etc. Patrick Schmitz, a developer will also be at table for questions.

Labels:

Redesigning your intranet using open source software (Mediawiki)

Erin Weinman, IT Application Manager, NMAI

Our problem: A few years ago we had an intranet that was 75% out of date and seldom used. And we had a major need for communication between our two museums in NY and DC and our collections center in Suitland, Md.

We had a short timeframe (6 months) limited cash (less than 80k). We had an onside contractor suggest open source - specifically MediaWiki. Pit prototype togther in 8 weeks. Liked what he did and told him to keep going. Meanwhile, we did an internal survey and realized we didn't need to do everything. We did it for about 55k (including internal staff and contractor hours).

We used Mediawiki on Windows 2003 with PHP/MySQL and XHTML, CSS, Javascript, and AJAX. It's called Ohana, which means family in Hawaiian.

It is intergated into existing intranet, redid the MW icons, and applied a design. We liked a tabbed navigation, so we added that, as well as a management dashbard. One of these runs off of webtrends. The others run off excel spreadsheets. (For things like visitor stats. Our project list. We created a template for this to make it easy for our IT staff and executive office project managers. They fill in the form, and then it comes up on the project list.

Our current effort, version 3.0, looking to release in early 2009. We're adding a calendar that would pull from multiple systems. We built a way to pull into one view on Intranet. Add possible webcam of museum activity, etc.

More populare pages are staff directory, the feature story, and fforms. 85% of staff has logged in, not just used anonymous. We did a survey got 52% response. 60 browser homepage, 70% check daily, etc.

Staff directory -- eacy person can add/update their own photo directory info. (We and HR keep an eye on it.) There's a feature story that is updated every few days. Philosophy is participation, not publication. Anyone can post anything without approval or permission. We've never had a problem. We've never had to take anything down. It's secure, we have login.

New content and update alerts are sent to project manager, and if there's an issue everything is recoverable.

The intranet gets better the more people see content and add content.

Average 16 staff hours per week to manage content. (Responsiblity of a GS11 staff member (about 50k salary. An average of 8 contractor hours/week (40.year). Plus about 120/year of web design support.

Some observations: You can teach non-technical people mediawiki code pretty easily. Some of our strongest users are "non-techies."

BUT... Does take time to break resistance to change. People cling to old ways of info sharing like mass e-mails with attachments, and use of public drives for document sharing. There is a need for someone to provide oversight and training.

Serves as portal for DAM system, collections system (for curators), etc. Did want to bring online collections to intranet, but priority went to getting it up on external website. Next we'll do an internal view that includes things that couldn't be presented externally, but can be presented internally.

Not out of box: design is custom. The page where the editing tools are. There is some customization. They give you five or six tools. We wanted to add icons for bullets, numbers tables. We used existing code and just put it behind the buttons.

Forms - they fill in, then it rewrites the pages to fill in mediawiki pages. Project listing pages are locked, so that prevents editing there.

Project has been up for 2 years. Basic Mediawiki design for first 5 months, then redesigned. Now working on redesigned version. (We will be getting rid of tabs because people told us that they didn't realize things were underneath them.) The cost of the redesign, which will come out early next year, is 40-45 hours of web designer time, as well as some time for the custom coding for the calendar integrator gizmo, the slideshow thingy, etc.

When we first put the phone directory, it kept falling out of date. Because HR was responsible. Now everyone is respsible, and it stays far more up to date. Photos are a huge help - because people look up the pictures of who they're going to meet with. If we were doing it well it would be great to make that round trip.

Labels:

steve.museum

(came in late)

Jennifer Trant. Archives and Museum Informatics


Could tagging and folksonomy improve access to art museum collections online?

When someone wants a particular painting and doesn't know details. We tend to use the "things you happen to know in your head" retrival method. But the requester already knows a lot about it -- just not the name or how to find it.

What JP knows or the Met Curator knows -- there's only one word worth of intersection. Our question was whether volunteer supplied tags could help bridge this gap.

How do users search online collections. FOr example, if you search for "shark" in the Met's catalog, you'd only find one, and it's not the one you saw in the gallery. Homer's sharks painting is well cataloged. But the curatorial analysysis is all about the price the piece commanded when it was sold. A very significant historical fact, but not what visitors remember. We need a way to get to the sharks!

Curators and art catalogers don't think the way the public thinks. THey don't describe what's in front of them because they know that already.

Steve experiment had 1784 works from 11 museums. An attempt to be as diverse as possible in terms of art museums and range of media. 1700 works tagged 36k times. One at steve.museum, and a single institution installation at The Met. People who just stumbled on it at steve.museum did 22 tags each. People invited by the Met to help on its site did 82 per. So there's a sense of the loyalty that can help the organization.

Need to look more at what makes people do the tagging and what makes them stick around. There's a whole question of motivation that we didn't really address at all in the research question.

Different types of work attract different quantities of tags. Some things are just harder to tag. A square white campus is far harder to tag than a representational painting.

Tags were 84% novel and 16% redundant. This was surprising. Most tagging research shows a quick clustering. But there were 36981 tags, and 311,944 distinct. But when you line up words with tags, there are 31k. [Not sure I understand what she means by this.]

We wanted to know whether tags differ from museum documentation. The middle of the venn diagram. 86% of the tags were new terms! THis is a really signficant number. Less than 15% overlap with what we know already.

We looked at where they did match, and its in object type. People agree that something is a photograph. But not much beyond that. We're not seeing taggers supply creator names.

Top 100 not matched in descending order: subeject (39), genre, color, geocultural, material, stype/period (few)

Are tags in other forms of Museum documentation? Label text? Docent talking points? Scholarly articles. Etc. So we got those in various genres. We found frequent false matches -- the tag would match, but not in the right context. (IE, an article would mention "house he grew up in" not house in painting.) We didn't find much different between scholarly sources and public sources like press releases.

What we came away with was that art historians write humanistic texts that are difficult to mine. They are not necessarily the right words.

So to help our colleagues understand whether they would like to include tags in their indexes, we looked at whether they would be useful in supporting search. Every team in every participating looked at every term work pair and evaluated whether it was good. But we didn't want to be judgemental. SO we asked "If you foun d this work using this term in a query would you be surprised?"

88% of the terms were seen as useful. And 46% of the users ALWAYS contributed useful terms. We wondered if some users were always just crap. About 6% of users never contributed a useful term. But the majority were useful. (The evaluation was also affected by corporate culture, because some curators probably graded harder than others.)

For shark painting, there were only two terms judged not useful. And one turned out to be a dutch language term that was appropriate.

Any term that was assigned more than once was up to 97% useful. Any assigned three times was almost always useful. So if two people say the same thing about a work, then it's almost definitely worth using.

Need to think about this in terms of interface design, though. BEcause if people see a term is already there, then they may not type it again. Need to think about the implications of interface on algorithmic evaluation.

Are tags in museum vocabulary sources? AAT and ULAN? 70.2 percent matched on some level to AAT. But when we did distinct match, only 37%. And when we looked at the ones that matched they were all in certain facets (materials, styles), etc. A lot of things match in materials, but it's only a few terms used over and over again.

MAtching with ULAN was worse, and matching was really problematic. Lots of matches for people named white -- but the tag white wasn't getting at artist name. So that's not good. So ULAN was a real loss. We didn't find a bunch of closet art historians out there identifying works.

The last question - are these useful for searching? We wanted to know how the tags related to search terms. We didn't know how they related to documentation either, and that's complicated because we don't have completed sets of data for all our works. We looked at search logs from two collections Minneapolis and SFMOMA.

Surprisingly few terms amtched search terms. Mostly artists. There are some problems with the log data becasue it's searches of entiersite. Everyone wants a job at the museum, so that's the top search. And how to find SFMOMA also a top search. We need to find a cleaner data search that's just collections.

At SFMOMA it was mostly artists. At MIA is not the case. Contemporary art is more egocentric. When we categorized the searches we saw more intersection.

(laptop battery died, so didn't take notes on end of session.)

Labels:

The Logistics of Extensive Data Standardization Projects

Carrie Beauchamp, National Museum of Natural History Dept of Anthropology,


We have 2.5 mllion specimens, 557k catalog records, from all over world.

Cataloging started when SI was founded in mid 1800s. First system is series of books - numbers start at 1, go up to 600k. Write down number, where it's from, and drew nice little pictures. Later they started a card file system with the rudimentary fields.

COmputurization started in 70s, and basically was just typing in cards. SO lots of idiosuncracies.

Fields in old database:

HRAF code (basic region, an early standard)
Storage area (plains indian, central asian, etc.)
Culture 1 (Area)
Culture 2 (Culture)
Culture 3 (Subculture)
Culture 4 (Band/clan)

In new database, top three fields were dropped into culture 1, with idea it would bemorestandard, and would be cleaned up. New database also pulled them into an authority file, which is helpful when working on standardization. Deciding whether a term exists is different than deciding whether to apply it to a given object.

Current yes/no is one of the qualifiers on listing of culture/ethnicity in new system (Emu).

When different culture names were merged, duplicates were created, because there are all these different standards mixed together. Ex. 23 culture names for Zuni and Hopi. Should maybe be 6. We also have regular problems: dates and sites mixed in, synonyms/alternate spellings, same term, different level.

6k culture records at start and 60-70% shouldn't be there!!! When we talk internally, we call it cleanup, but to curators we say "data standardization." Because cleanup implies that the IT guys run a magic script, whereas standardization labels it as an intellectual exercise.

Why? Looks better, improves search efficiency, improves data entry efficiency. Already a problem when people had to call me to get answers because of difficulty of searching. And this is a bigger problem now that data is on web!

In some cases you can define a subset. IN this case, it's basically all of our records. So our goals are to eliminate the typos and dupes, identify synonums, verify terms, and establish 2-tier classification system. The goal is NOT to ensure accurate identifications, and not to fill out catalog records that are missing data. We have 500k records, so that would be a much more massive project.

Resources: Published works, content experts. THere is not really a good published thesaurus for this -- we used XXX, but the last edition was published in 1983. SOme others, including Encylopedia of World cultures. Our content experts, but they don't cover all areas of world, so will have to consult with outside experts as well. Our content staff had various levels of, um, technological facility. So for each portion of world had to tailor the way they did the work.

You need to think about staff -- why will actually be working in the database, and need to understand the software - what tools can you use to make this easier? How are you going to get out what you need. And of course budget - which is zero! It's hard to fund things that aren't exciting, sexy, on web, even though this is what makes searching on web possible. How is your audience going to find things if you can't even find them reliably?

First question: clean as you go, or focused project? Clean as you go is tempting. But focused project is more holistic, creates an accountable timeline, and there is more energy and things are less likely to fall through the cracks or have people lose interest. People like to have a specific task that is their portion and turn it in and be done, and not have to follow project through long, convoluted process.

What we did -- and this is just one example, maybe not the best -- create worksheets in excel (our staff doesn't tend to work directly in CMS), then a first pass/culling (fix obvious problems), give to curator, they mark up in pencil, give it back. (I don't recommend this, but it's how things are done in our department.) IT's the simplest possible way to do this. I've seen all these other presentations where people are sucking data here and there. We're not doing that. I feel very low tech.

Ends up being kind of like a Thesaurus - original term, and what we think it should be. Other considerations Documented versus attributed, and do you infer cultures based on localities?

Culture area - do you use natural language, or codes? Natural language is easily understood. But codes are good because they are obviously just classification and not authoritative curatorial judgements. Alternate terms/spellings. Ideally we should have a thesaurus, but we don't. So do we pick one and only term? If so, who picks it? Do we use the popular term, or the one people in the tribe like to apply to themselves? We tend to use popular, because we want people to find it, but that's a politically-charged question. Also, what to do with diasporas, religions. If the tag says jewish, what is the culture area? Is an African-American part of African cultural area? Not really.

Kara Lewis, Collections Information program manager at NMAI
(and in absentia, Collections manager Patricia Nietfeld, NMAI)


1999-2004 - Moved collection from NY to MD, 2004 moved NMAI to mall, 2006, moved records into CIS, 2006, created 5-year plan. Comprehensive training, bringing in other department, intranet, prototype collections site. A few weeks later, CIS team was told that getting info on web was top prioroty -- but 60-70% of data was incorrect or didn't exist. It wasn't standardized or searchable. So the expectations were high, and the three aspects we had to consider were time ( as quick as possible) resources (as cheap as possible) and quality (as nice as possible.) And everyone disagreed as to which of these was most important. IN the end, we decided to compromise bweteen time and quality. We wanted to get as much as possible as quickly as possible, but without putting up anything embarassing. In the long run our ideal is to go out to all these commmunities. But we can't do that now. So we decided to take a bit longer than management really wanted, but not to go whole hog. We got some grant funding from an internal grant, and from a trust fund from our predecessor museum.

After discussion and surveys, we compromised: 5000 records on first launch, only records with images. (most survey responders weren't interested in non-image records). No effort to do new photography. (Most of collection has images already because of move.) We would not publish anything sensitive, use readily available data from publications and exhibit scripts. We would only include basic information (except addition of collection histories, because people want to trace provenance) Verify info with images. And make data searchable -- a great effort.

Even if data isn't perfect, we want to at least be able to find it. (Bowl, pot, jar, vessel) (slide very similar to my silverware slide) Wanted to make consistent.

THis process had to move quickly, so I found myself adapting my own work habits depending on who I was working with. SOme throw self into it (workers), some just want to answer specific questions (advisors), and some want to be in control of intellectual process (directors.) Some want data in way they can manipulate, others want it digested before they get it. You can't assume that everyone is going to fit with the way you're comfortable working.

A particularly good example: The cleanup of geogrpahical daa. PAt would structure the data and decide how she wanted to clean up. I gave it to her in a spreadsheet. Then the contractors would follow her process.

Background on the data. When we implemented Emu, most data came from two databases. All of the data was migrated into catalog module. Descriptions, measurements, links, etc. This module has links to other modules or tables with more detailed info like loans, events, exhibitions, etc. In our version, all geographic info went into sites module. The plus side was that you could record in depth info about particular places. But the downside is that you have to manage that elsewhere from your catalog records, and a slight bit more overhead in searching.

We knew when we migrated that much data was nonstandard. But we decided to wait until after migration to take advantage of some tools in new system, and to avoid having to clean data in two different legacy systems.

The data migrated differently due to an oersight in our migration process. Photo archives records each got a matching site record, even if it used the same term. Also had duplciate data in wrong fields, misspellings, strange punctuations, etc.

When looking at data cleanup for web, made decision to hit sites separately. It was huge, yet finite. We had someone in Pat who knew collection really well, and preferred working through this methodically. Plus we had all the card info migrated and they were easy to check against. Having one person do all of this meant that, whether right or wrong, we were at least moving into a totlly consistent setup.

Produced an excel spreadsheet with all unique combinations. There was no magic bullet. It was a long slog over years, including weekends. It's a testament to how much she cared about this. She wrote 15 page summary about process. COnventions included no abbreviations, names in language of country, alternate verisons in parens, etc.) Use first level political subdivisions, use current ames, extraterritorial posessions=countries. Central America was hard - has no readily accepted boundaries.

Etc.

Did not make a lot of effort on US state archaeological site numbers. If there, went in, but not new effort. THis was cleanup, not verification.

Started with spreadsheet, then split into smaller sheets by state or country. Then sorted by highest level term. Took orphan terms without higher levels and fit them where they belonged. Went through each sheets.

Sources, wikipedia, statoids.com, ITMB, national geographic maps, country's official websites, archaeological websites, indigienous people's websites, government agencies, malandia.com, google, JSTOR, MAI publications.

Provenience data (most specific info) [term? arch site?]

Found TGN to not be helpful - did not have enough place names to be worth searching on. For US, used USGS GNIS. FOr canda, Geographical Names Search Service (GNNS)
Others, GEOnet Names Server (GNS)

The worksheets, 83 in total. Added notes on when she made assumptions in notes field. When finished, she put these in a server folder for contractors. The contractors did not need to be content experts. I made the decision to make new records and then relink them, not to try to overwrite old data. Before doing anything in Emu, they practiced in our training environment.

Had contractors go through and fix groups of records. Involved working in two modules.

Results: down from 90k site records to 15k, and module is now tightly controlled.

To get funding - show empty record and say "this is what you will get on web without funding" and then show complete rec and say this is what you'll get with funding.

My Q: WHy not automate the worker bee part of the process? A: Hmm. Would have meant working with the vendor (kaching) and plus liked the control of being able to watch the process and catch mistakes.

Kara Lewis (Collections INformation Program Manager), lewiskm at si dot edu, Patricia L. Nietfeld, Collections Manager, nietfeldp at si dot edu.











Labels:

Back to the Future for Conservation Documentation

Introduced by Chuck Patch

Lynn Falin (sp) and Dave Thompson
Museum of Fine Arts, Houston



Grant from IMLS 3 years to develop collection documentation system. IMLS-NLG ACD (Art Conservation Database) Project, 2007-2010.

There are several projects going on in this field, and they are all helping each other.

Our goal was to develop a web application for conseravation documentation purposes that enables the creation, management and dissemination of conservation data.

Photogography Conservation Database (PCD) A SQL Server web application developed in house with contract daeveloper. Used by Photo Cons. lab since 2005. 10,124 entries. Core functions: Condition documentation, treatment documentation, preventive conservation reccs, record search, and create reports.

At the same time this was going on, the Heritage Health Index report was published by Heritage PReservation. Spoke to the extreme need for more conservation, particularly in institutions that had no immediate access to art conservation. The IMLS response was the Connecting to Collections initiative. We realized that if we could develop a database that could be used by anyone involved with collections care, it would be a clear communication tool between individuals and conservation, even if they were miles away. And would be a headstart for a CAP report, because they would already have tombstone info on report, as well as general info on collection.

The IMLS recommended that we have a national advisoriy committee. Advisers included Murtha Baca, and two others, who introduced us to protocols like CCO, CDWA, AAT. There are many more documents out there, obviously that can help establish clear communication.

Our first recommendation was to take time planning and look at current practice. A bit widespread. Paintings lab repots are very much text based. Even paintings conservator would say "I don't know how many times I used this adhesive." Without a database that's hard to do. She was desperate, so tried TMS. Unfortunately, the conservation module (at least in the version we had) was awkward and time consuming.

As we know from Boston MFA and Brooklyn Museum of Art, it is possible to customize TMS to come up with user friendly conservation documentation. But you need programmers. Our aim was to be as unproprietary as possible -- keep with the open code, open source mantra. So we can't be tied to any particular collections management system.

One lab (sculpture) does have a database, CDS-Documentation, by John Watson, conservator at Williamsburg Foundation. Is easy to use, but less controlled than we'd like. Mainly for generation of reports.

Then there was tracker -- you'll learn more about this today.

Our team is made of of conservators, registrars, and IT people.

Dave Thompson, Database Adminitrator was brought on to help with this. We're very early on in this process. We don't have much concrete to share. So wanted to share more about our process and the Systems Development Lifecycle.

Planning phase was already done when I came in. We had moved on to analysis phase. The idea was to come up with system proposal. From here we'll go on to the design phase, and then can move on to implementation phase.

[Systems development/systems development stuff omitted here]

Systems development/systems development stuff Unified Markup Language (UML) provided standard diagramming techniques and commonly accepted notation.

Purpose of system isn't to tell conservators how to do their jobs -- it is to document what they do do. Some systems try to force them to work in a different way, and that's not what we want to do.

We're now in the process of trying to get our heads around what the system is supposed to do. Report centric model tended to force people to think in a certain way. We've tried to be object-centric.

Shooting to complete system proposal and RFP by end of March.


Ziad Alsukairy, Manager of Application Development at Harvard Art Museum since October, 2006.


Has been developing browser-based conservation documentation. Here Via GotoPC.

Project is our system, which serves as container for one or more artworks. Each can have a consultation, investigation, and treatment. Each of these can contain images and analytical documents. Also have the ability to create condition reports, which contain objects lab, paintings lab, paper lab.

System also tracks processes and workflows through several steps - proposal-work in progress - completed.

(Straus Center for Conservation)

THere is a test system being used in production capacity, but some of the tools are still in test mode (like spell check, images.)

Web/Ajax based, but server side business layer in Java/Spring Java Framework. Some tools like iTEXT, jSpell, PDFtextStream, MagicZoom. Data layer - there is a separate database for conservation, integrated with collection management (TMS) and the digital repository service at library. Right now the images are being stored on a filesystem, but eventually need to migrate them into Digital Repository

Will create new project in live system to show you. First search collections management database. First search for owner/billing client in TMS. (There is an expenses module as well.)

Then add artworks to proposal. Search for an artwork in CMS via web interface. Queries, returned. Also searches conservation images and analytical documents. Can also add images from staging folder that the conservators have taken. At this point these images are associated with this project and the artwork that was just added. Can classify each as "before, during, or after treatment." These records are stored in the digital repository, but records in the database associate them with the art object. Metadata is also read off of image on ingestion.

Tool for magnifying parts of image. Also annotation view, where you can ass notes on sections, like Flickr.


Chuck - Now going to see a system that has been in use for a few years at Philadelphia Museum of art

Nancy Ash, Senior Conservator of Works of Art on Paper, Philadelphia Museum of Art. (Also programmer Thomas Murphy, who is here today.)

Title: The Conservation Tracker System - A Powerful and Versatile Database for Creation, Storage and Access of COnservation Documentation.

Project began in 2000, launched in 2005. Now have thousands (27k records, 3k individual objets) of records. Not going to show you the old version we're using -- going to show the new one we're just finishing. Why revamp? We identified refinements that would make it more beneficial to us and other users. Redevelopment was sponsored by Mellon Foundation.

To develop old and new version, worked with consulting programmer THomas Murphy. Build using Express App Framework. This is the basis of the new system. Not in true open source code, but gave much greater flexibility to nonprogrammers to customize.

Our original goals. We used another homegrown system in Filemaker first. Key goals were flexibility, incorporating word processing features and images, etc. It was essential that the system not introduce new work for conservators, and give them improved access to their records in object files. Goal was not to replace hard copy files. The database had to be able to interface with TMS. Also security was a basic concern.

Advisors talked about grouping reports around organizing events.

Conservation Summary SCreen - Contains automated information about an object -- a summary, and will serve as conduit to TMS when we can import info back.

windows based, using SQL server, but could be easily adapted to Oracle or MySQL. DB interface progiammed with Visual Studio 2008 along with some third party stuff. Also integrates with Word for some functions (templates, etc.) Uses data dictionary that enables customization of screens, etc. without programming.

Basic tombstone info is imported from TMS, and then becomes static. This is conscious, so that we can archive how the data existed when object was received.





Labels: