The Semantic Web In Practice
Koven J. Smith and Don Undeen
Metropolitan Museum of Art, New York
Interpretive Technologies Team
Tasked with building data infrastructure that would answer questions about collections, not just return lists of documents. In system we looked at (digint'get name) results are statements of fact, not documents.
Application, when ready, will give unprecedented access into museum's brain. To do this need to present large amounts of data from multiple sources. CMS, DAM, bibligraphic records, etc. But also word docs, archival materials, artist letters, publications, etc. Need to present all sources as unified whole.
So we create a list of assertions about resources. "Madame X: is a portrait. Depics VAG wife of Pierre. First shown at Paris Salon. etc."
Semantic Web is a network in which nodes are linked at data level rather than presentation level.
Goals: Find a place to store all of our unstructured content and harvest usable data from it. And pull records together from multiple sources into a single usable data store.
Started to look at Semantic MediaWiki to solve this problem. Allows you to add properties to links. For example, "AquiredConcurrentlyWith"
Creating this sort of annotation in SMW is process of creating a triple -- a simple data elmement containing subject, object, and predicate (or property).
SUBJECT: Madame X
Predicate (Property): AcquiredConcurrentlyWith
Object: Elijah in the Fiery Chariot
These can be expressed using RDF. (Resource Description Format)
But we have thousands of documents and no time to do all this annotation by hand. Started trying to use natural language processing to do this, but was beyond our technical and financial options. So looked at web service called ClearForest, now called Calais. Free, takes unstructured text and returns it as triples. Uses sophisticated NLP and machine learning techniques to return intelligent metadata.
So SMW and Calais give us a framework for storing and "understanding" unstructured text. But then we're left with the problem of how to relate data from multiple sources.
When you have a Semantic web network, you also have a way to incidate classes of things. Madame X is a painting, John SIngler Sargent is an artist, 1884 is a year. These are called classes. Also, you can subclass. So Painting subClassOf Artwork. Same thing with Painter/Artist. PaintedBy is subproperty of MadeBy.
So now we can make the statement that Madame X was "MadeBy" John Singer Sargent, even though this is never explicitly indicated through their direct relationships.
Madame X isA Painting, and Painting subClassOf Artwork, therefore, MadameX isA Artwork. This is an inferred triple. THis allows you to generate new types of relationships.
This entire set of tools is an Ontology. A structured set of information. A CMS also was an ontology. So are other structured sources. Desire was to begin putting these different types of ontologies together.
Ex. MARC, MediaBin DAM, TMS collections management. All three have titles somewhere. But we only know this because we have an understanding of the formats and data structures. So the first step is to get these formats into a triple network. There is a standard called SemanticXML for this. Element names become classes. Individual elements become instances of those classes. Parent elements are connected to children via child property. Etc. So we've converted this XML format to be represented as a network. So we're not necessarily taking data out of original XML format -- we're just using a layer to represent it as triples.
For a relational database, it's similar. Tools like D2RQ (free) make it possible to do granslation to truple network in realtime from SQL database. Very similar to how we did it with XML. Tables become classes. Rows become instances, Relational keys become properties connecting instances.
So now we have three different networks, but it's not clear that we have the same types of things.
CIDOC Center for Institutional documentation. A museum standard (from france?) - triple based ontology for representing museum collections. The system is very event based - a painter participates in an event to produce a painting. So one of our goals is to use subclassing to interface all of our data ontologies under CIDOC.
MARC, for example. CIDOC gives us a class called E31.Document. So a MARC record is a subclass of E31.Document. Subfield in this case is placeholder for title. CIDOC gives us E-35 title. Also tere is P102F.hasTitle. That's inferred because we just saw we had a title. THis allows us to infer new triples as well. (INference engine is run whenever you do a query.)
You can do the same types of CIDOC mapping for the other data sources. So we have another E31.hasTitle property, etc.
So now all of our sources are expressed as a common ontology that is a documented standard. So now that we've done this, we can run queries on it.
This is a SPARQL query. Like SQL for RDF. So we can now query from a lot of sources without having to scoot data around into different repositories and
Resources: Good book: Semantic Wen for the Working Ontologist by Allemang and Hendler. Other resources are on slide. Include Halo extension for SMW, D2rq, TopBraid Composer (not free), Protege (free), Jena, SPARQL,RDF,OWK,Sesame, Mulgara.
Semantic Museum discussion group: http://groups.google.com/grou/semuse
Semantic Museum Wiki: http://semuse.org
These slides: http://kovenjsmith.com/pres/mcn_2008.ppt
Q: Isn't Calais more optimized for pulling out the types of relationships that Reuters cares about? (For example, it knows corporate names better than artist names.) A: We contacted them, and they were very open to working with us to better train their algorithm to understand our type of data. It's a first step. We'd love to develop a much more full featured NLP solution. But for now it's a good first pass. Also, Calais will also start rolling out "user contributed vocabularies" soon.
Q: Store for word docs and stuff? A: We're exploring using SMW as the store. It builds its own triple store.
Q: So if someone does a query in "real life" do they see the backend wiki? A: We're not that far yet. We have the idea that we will build some sort of an application for presentation eventually. I can't imagine in a public way that we would show the raw MediaWiki interface. It's ugly for an unitiated user. We'll probably be looking at ways of showing that information in a more graphical way.
Q: So this was about mapping data into this semantic format. So what are the next steps? What will be the time consuming bit of scaling it. A: Definitely mapping all the fields into CIDOC. The neat thing about this is that everything is expressed as a triple. I like to think about it as a conversation amongst domain experts about what this information means. Feels more conversational than coding procedures to do this, because it's just mapping one term to another term.
Original investigation of SMW was looking at it for conservation documentation.
Q: Are you finding other staff to be receptive to this idea? A: So far the response has been surprisingly positive. It's been in the lab for a while. Our first real exposure was when we did this presentation to a large swath of our staff recently, and the response was fairly positive. When we were investigating SMW for conservators, they liked that they could get data out without databasey overhead.
Metropolitan Museum of Art, New York
Interpretive Technologies Team
Tasked with building data infrastructure that would answer questions about collections, not just return lists of documents. In system we looked at (digint'get name) results are statements of fact, not documents.
Application, when ready, will give unprecedented access into museum's brain. To do this need to present large amounts of data from multiple sources. CMS, DAM, bibligraphic records, etc. But also word docs, archival materials, artist letters, publications, etc. Need to present all sources as unified whole.
So we create a list of assertions about resources. "Madame X: is a portrait. Depics VAG wife of Pierre. First shown at Paris Salon. etc."
Semantic Web is a network in which nodes are linked at data level rather than presentation level.
Goals: Find a place to store all of our unstructured content and harvest usable data from it. And pull records together from multiple sources into a single usable data store.
Started to look at Semantic MediaWiki to solve this problem. Allows you to add properties to links. For example, "AquiredConcurrentlyWith"
Creating this sort of annotation in SMW is process of creating a triple -- a simple data elmement containing subject, object, and predicate (or property).
SUBJECT: Madame X
Predicate (Property): AcquiredConcurrentlyWith
Object: Elijah in the Fiery Chariot
These can be expressed using RDF. (Resource Description Format)
But we have thousands of documents and no time to do all this annotation by hand. Started trying to use natural language processing to do this, but was beyond our technical and financial options. So looked at web service called ClearForest, now called Calais. Free, takes unstructured text and returns it as triples. Uses sophisticated NLP and machine learning techniques to return intelligent metadata.
So SMW and Calais give us a framework for storing and "understanding" unstructured text. But then we're left with the problem of how to relate data from multiple sources.
When you have a Semantic web network, you also have a way to incidate classes of things. Madame X is a painting, John SIngler Sargent is an artist, 1884 is a year. These are called classes. Also, you can subclass. So Painting subClassOf Artwork. Same thing with Painter/Artist. PaintedBy is subproperty of MadeBy.
So now we can make the statement that Madame X was "MadeBy" John Singer Sargent, even though this is never explicitly indicated through their direct relationships.
Madame X isA Painting, and Painting subClassOf Artwork, therefore, MadameX isA Artwork. This is an inferred triple. THis allows you to generate new types of relationships.
This entire set of tools is an Ontology. A structured set of information. A CMS also was an ontology. So are other structured sources. Desire was to begin putting these different types of ontologies together.
Ex. MARC, MediaBin DAM, TMS collections management. All three have titles somewhere. But we only know this because we have an understanding of the formats and data structures. So the first step is to get these formats into a triple network. There is a standard called SemanticXML for this. Element names become classes. Individual elements become instances of those classes. Parent elements are connected to children via child property. Etc. So we've converted this XML format to be represented as a network. So we're not necessarily taking data out of original XML format -- we're just using a layer to represent it as triples.
For a relational database, it's similar. Tools like D2RQ (free) make it possible to do granslation to truple network in realtime from SQL database. Very similar to how we did it with XML. Tables become classes. Rows become instances, Relational keys become properties connecting instances.
So now we have three different networks, but it's not clear that we have the same types of things.
CIDOC Center for Institutional documentation. A museum standard (from france?) - triple based ontology for representing museum collections. The system is very event based - a painter participates in an event to produce a painting. So one of our goals is to use subclassing to interface all of our data ontologies under CIDOC.
MARC, for example. CIDOC gives us a class called E31.Document. So a MARC record is a subclass of E31.Document. Subfield in this case is placeholder for title. CIDOC gives us E-35 title. Also tere is P102F.hasTitle. That's inferred because we just saw we had a title. THis allows us to infer new triples as well. (INference engine is run whenever you do a query.)
You can do the same types of CIDOC mapping for the other data sources. So we have another E31.hasTitle property, etc.
So now all of our sources are expressed as a common ontology that is a documented standard. So now that we've done this, we can run queries on it.
This is a SPARQL query. Like SQL for RDF. So we can now query from a lot of sources without having to scoot data around into different repositories and
Resources: Good book: Semantic Wen for the Working Ontologist by Allemang and Hendler. Other resources are on slide. Include Halo extension for SMW, D2rq, TopBraid Composer (not free), Protege (free), Jena, SPARQL,RDF,OWK,Sesame, Mulgara.
Semantic Museum discussion group: http://groups.google.com/grou/semuse
Semantic Museum Wiki: http://semuse.org
These slides: http://kovenjsmith.com/pres/mcn_2008.ppt
Q: Isn't Calais more optimized for pulling out the types of relationships that Reuters cares about? (For example, it knows corporate names better than artist names.) A: We contacted them, and they were very open to working with us to better train their algorithm to understand our type of data. It's a first step. We'd love to develop a much more full featured NLP solution. But for now it's a good first pass. Also, Calais will also start rolling out "user contributed vocabularies" soon.
Q: Store for word docs and stuff? A: We're exploring using SMW as the store. It builds its own triple store.
Q: So if someone does a query in "real life" do they see the backend wiki? A: We're not that far yet. We have the idea that we will build some sort of an application for presentation eventually. I can't imagine in a public way that we would show the raw MediaWiki interface. It's ugly for an unitiated user. We'll probably be looking at ways of showing that information in a more graphical way.
Q: So this was about mapping data into this semantic format. So what are the next steps? What will be the time consuming bit of scaling it. A: Definitely mapping all the fields into CIDOC. The neat thing about this is that everything is expressed as a triple. I like to think about it as a conversation amongst domain experts about what this information means. Feels more conversational than coding procedures to do this, because it's just mapping one term to another term.
Original investigation of SMW was looking at it for conservation documentation.
Q: Are you finding other staff to be receptive to this idea? A: So far the response has been surprisingly positive. It's been in the lab for a while. Our first real exposure was when we did this presentation to a large swath of our staff recently, and the response was fairly positive. When we were investigating SMW for conservators, they liked that they could get data out without databasey overhead.
Labels: MCN2008

0 Comments:
Post a Comment
<< Home