PiratePad
Full screen

Server Notice:

hide

Public Pad Latest text of pad YJLUIp164y Saved Aug 17, 2011

 
     
Stanford Linked Data Workshop Notes (#sldw)
 
Lightning Talks:
 
Sigfrid Lundberg
  • use of IETF RFCs in digital library architecture: Atom (RFC 4287), Languages (RFC 4646)
  • "RESTafarianism at a Taliban Scale" :-)
 
Stefano  Mazzocchi (Google/Freebase)
  • The importance of human curation to Linked Data
 
Bill Dueber (University of Michigan)
  • using REST and URLs in the HathiTrust API
  • licensing issues around metadata and book data
 
Joan Smith:(Emory University)
  • Open Access Repository Project
  • Need to learn from other people's mistakes, and to establish some Linked Data patterns for people to use.
  • Voluntary Open Access Initiative at the University (first customer is likely to be Scientists not the expected Humanities folks)
  • VIVO: linked data for researchers and their publications
  • Visual Shelf Browser (using OpenLibrary for cover images)
 
EEetu Mäkelä (Aalto University)
  • Providing national content platform for publishing vocabularies and data, and linking between resources.
  • dealing with multitude of metadata schemas
  • homemade triplestore, 10s of billions of triples
  • simple javascript integration for getting concepts from ONKI using a key (similar to Freebase Suggest functionality) places, people, organizations, class ontologies
  • some reconciliation has been done between datasets, but it has been hard
 
Stephen Abrams
  • interested in data quality issues; preventing a data equivalent of Gresham's Law: bad data driving out good data)
  • also interested in enabling network effects: being able to get things done collaboratively that would be impossible alone
  • Merritt curation repository: metadata expressed internally as RDF; model free / extensibility seen as a benefit
  • modeling provenance and change is important
  • how to discover vocabularies?
  • best practices for publishing (identifiers) and technologies (triplestores?) that should be used
 
Adam Soroka (University of Virginia)
  • need to work within the context of legacy systems (ILSes centered on MARC)
  • community vocabularies needed, and need to be easily findable
  • introducing linking into legacy "flat" MARC is a challenge
  • anti-pattern: republishing of entities and vocabularies
  • possible to share workflow tools
 
  • preservation requires more copies
  • problem with linkeddata: less than 10% of linkeddata provide any rights metadata at all; less than 25% provide provenance metadata
  • if you are going to preserve linked data, you need to know if you have the right to keep a copy, and republish it
  • publishing industry is in crisis, and there is the danger that we will see consolidation, not more sharing
  • there is a push in the web-preservation space to preserve things that are not at risk, potentially will see the same problem in linked data space
 
Jamie Taylor (Google/Freebase) http://www.freebase.com/view/en/jamie_taylor
  • 22 million curated entities
  • strong identifiers for entities (harrison ford coming down the pipe--I don't think so ; httpRange-14 who cares)
  • assert rights/license about data even if it isn't open - helps communicate how to use
  • Question needs to be "how to we present our data clearly so that others can make the best use - allow users to understand and take apart"
  • having a single place where reconciliation happes reduces the cost of doing it
  • Where are the really useful applications of linked open data? 
  • Need more focus on users doing stuff, not just about machines
  • html5/microdata has a lot of potential to bring linkeddata to the masses of web publishers
  • interested in working with institutions to bring them into the data map that is Freebase
  • freebase uses language tags, about to release this functionality 
 
  • problem of lots of silos within the institution
  • linking to full texts, too
  • seen as a complement to Gallica, central place where things can be referenced
  • not a prototype, but fully functional from the beginning
  • there are links to items outside of BnF
 
Lars Svensson (DNB)
  • goal is to publish bibliographic and authority data as linked data
  • slightly restrictive license currently, which will go away
  • linking w/ VIAF, Rameau, DBPedia and LCSH
  • approximately 2 million "persons" 
  • working in an environment where data from libraries, archives and museums needs to be integrated
  • building blocks for a linked data ecosystem: persistent identifiers, long-term archives, robust services
 
Hugh Glaser (Seme4.com)
  • lots of resources, with multiple URIs: anarchy is actually kind of fun, need to embrace it
  • "Everything is a URI"
  • Solution is to have an authoritative URI
  • Data-publishers publish co-reference assertions
  • provenance and rollback functionality 
  • plan is to return results not just for owl:sameAs but any predicate
  • sameas as a discovery tool for finding out who is linking to you
  • even with a curated data set there are often multiple URIs (two for Southampton in data.gov.uk)
  • need machine processable way to determine provenance/trust to combat the scenario where one bad player corrupts the data
  • Liberal vs. Conservative characteristics of data providers (*dont know if this is right*)
  • Linked Data Browser is boring, and won't be the "killer app"
  • rkbexplorer for seeing the relations between academic researchers, their publications and their projects
  • rkbexplorer explains how entities are related
  • british museum: data silos within the organization ; converted science and curation databases so that they can be interlinked (uses sameas.org) and enhanced the user experience by creating views that synthesized the data that was previously available separated
  • http://www.researchspace.org/ - users CIDOC CRM ; allows annotations and notes to be made about objects; will support data exchange between CollectionSpace and ConservationSpace
  • worrying about linking shouldn't stop you from publishing linked data
  • build systems that accept the way the world is, not what you would like it to be (hear this ontologists!)
  • Discovering differences - disambiguation - very valuable - more valuable than asserting "sameness"
 
  • Preservation vocabulary will be open soon
  • Name Authority is pending: some scalability problems given the number of names
  • work on persistent urls for bibliographic resources, got a system waiting to go, but is waiting for approval
  • works with Spanish libraries to cross-ref LCSH to Spanish terms
  • Not much linking out at the moment; very time-consuming to produce outgoing links
  • Internal linking being done, e. g. ISO 639-1, 639-2, and 639-5
  • new links done in id.loc.gov do not get reflected back into the MARC files
  • There will eventually be URIs for individual component parts of pre-coordinated LCSH
  • Focus is on benefits for users and institutions
 
Ed Sumers (LC)
  • information about Digital Public LIbrary of America Interoperability meeting
    
  • need to do web 1.0 better (sitemaps)
  • Hugh Glaser: BBC decided to curate their music metadata at musicbrainz, thus contributing to the community and then consume the data back, instead of keeping it all to themselves
 
Richard Boulderstone (BL)
  • bibliographic data: http//www.bl.uk/bibliographic/datasamples.html
  • Vision for 2020: catalogue less than now and instead link out to other datasources
  • challenge to talk about opening up bib data in tough times where BL makes half a million pounds selling the data, but BL is going to do it
  • currently they link to LCSH, Dewey etc; want to extend that to other datasources
  • Talk about trusted sites (what does the trust model looks like?)
  • staff skills might be an issue (lack of linked-data engineers)
  • amount of Linked Data seems extremely large, and somewhat worried about how scalable the technology is over time
  • lots of staffing associated with maintaining legacy systems (350 separate systems)
 
Potential Breakout Topics:
 
  • Identifier reconciliation services (Google WebIDs / sameas.org, culturegraph.org)
  • Importance of provenance, not necessarily perfection of library data (Adam Soroka)
  • triple reification
  • named graphs
  • We have the technology-- we need to establish a praxis, an ontology and share them
  • provenance is one example of paradata/metametadata. also confidence, "trustability", rights/IP, etc. we have to solve the problem generally, but not the general problem.
  • working with multiple metadata schemas (CulturSampo, Freebase)
  • Strategies for linking
  • Strategies that work inside of legacy "flat" systems (RDA inside MARC)
  • Reconciliation is really hard (Stefano, others)
  • Can we reduce the need for it by sharing ontologies earlier? (Eero Hyvönen: "Intellectuals solve problems, geniuses prevent them" -- Einstein)
  • Can we reduce the need for it by reducing the number of situations in which we make demands for conformant behavior across multiple datasources?
  • Concentric circle business case approach (with Venn diagrams to add other organizations
 (overlapping goals)?)
 
  •  "the best use of your data will be done be someone else" (Rufus Pollack OKF) so how can we expose LOD and provide the environment/tools to allow the "best use"????
 
Tuesday, June 28
 
Breakout groups
 
Group 1 - (only group that did not use pirate pad :-))
  • What is the application, only plumbing?
  • No feedback loop from the community
  • Director of British Museum published his business case online
  • In public libraries, patrons start to ask questions that cannot be answered by current flat data
  • not just about creating new data, but to transform / augment / enrich / break down current data
  • add more granularity
  • Over-modelling the data might  make it more difficult to use it: it needs to be dumbed-down
  • We need both humans and machines createing / curating data: Where's the sweet spot?
  • Current business case justification is too long - we need a shorter elevator pitch
  • "more challenges than opportunities"
  • balance between centralization and decentralization
  • balance between human- and machine-generation of (linked) data (how do you get to 100% with machines, or where do you stop?)
  • is it okay for the balance to be different in LOD from what's acceptible in current GLAMs metadata practice?
  • meta-meta-data (provenance, trust, rights)
  • meta-meta-metadata
  • meta-meta-meta-metadata
  • meta-meta-meta-meta-metadata
  • {grin}
  • how do you link prototypes across institutions?
  • value proposition: when you see more of your stuff by linking to other people's data
  • expand community to people with little or no IT resources
  • or large but fully-loaded resources
  • how to reconcile forks of data?
  • how to create a linked data community around something that isn't the focus of corporate or popular interests?
  • maybe begin with a pre-existing communty around that interest? (e.g. what's happening in bioinformatics)
  • how will linkages work in an international setting across different cultural contexts?
  • How to create useful communities and tap volunteers?
  • How to engage with vendors and OS communities?
  • Demonstrate usage, benefits, effort and thus ROI!
  • What value does library metadata bring to the LD world?
  • Idea:
  • integratete 8-10 libraries
  • Collections in large institutions seem to develop across the same lines, so 8-10 large institutions will together contribute a lot!
  • Mike Keller points out that 51% of the unique titles in the original Google Books scanning set were sourced from a single institution. That's both a lot of common ground and a lot of unique institutional flavor.
 
Group 2 (Niz)
  • Where are the consuming (flagship) applications? The business case is different from Web 1.0, where it was enough just to put things out there, now the bar is higher
  • Do we need a general purpose application, or is it more efficient to have domain-specific ones?
  • Academic libraries have good access to the research community, but are those our obvious customers?
  • Is the library's case to just publish data for anyone to consume, or do we want to consume data from e. g. researchers and put it into our own system? We'll l probably have to build the first applications ourselves
  • Libraries can sponsor an application in order to find out what people need.
  • Use scenario might be to develop a plug-in for Blacklight (http://projectblacklight.org/), generalizing disparate systems to a linked data level and then pull them into Blacklight
  • "If load is your problem, you've fulfilled your mission" (because then the service is succesful)
  • If the data is valuable to someone (and they see an opportunity in having it), the'll come to you ans ask for more
  • Killer application might be a discovery browser, most people are comfortable using keyword searching, though (coming from Google and others of that ilk)
  • Researcher profiles, using library data to link to other researchers and their publications
  • The personal data exposure on the web differs between disciplines (humanities people tend to much less visible online than e. . g. biomedicals)
  • One reason for slow takeup of portals could be that the people benifiting from it are not the ones actually curating the data
  • "Reading list picker"
  • Are ORCID looking at linked data, and can we make them connect to library authority files?(+1)
  • Libraries should care about microdata (in spite of nose-holding in LD community) and take an active part
  • Provenance about assertions is critical (one definite challenge now)
  • business case: findability; BBC is a success story
  • "It's not about the plumbing, but about the water": People (end users?: yes, end users (and people like data curators)) shouldn't really know about how the application works, nor if it's based on linked data
 
  • Conclusions:
  • Can we place more bets just by lowering the costs?
  • Corollaries are: provenance and trust
  • So far, libraries have been concerned with the supply side, not caring about the consumption. The question is, if we have enough domain knowledge to help with writing those
  • new challenges: many entities just produce junk triples, what are the incentives for creating quality data?
  • legal/rights issues + performance are out of scope: if that's our main problem next year, we've been successful
  • Scope is cross-institution. You don't do linked data for the sake of doing it, but because you see a benefit
  • Microdata is one way of serializing this information and we might pivot on that
  • Business case is increased findability and visibility
 
 
Group 3
  • What is the killer app?
  • There is a difference between users who are linked data publishers (exposing linked data)  and those who are seeking linked data to build enriched/linked resources.
  • Legal issues will effect workflow decisions
  • do not pass data around - how can we refernce/co-reference/ work with the data on the web
Challenges/Opportunties:
  • the simple user experience
  • co-opetition with google
  • complicates collaboration metadata consortia
  • differentiaing content vs. description
  • how to build something people to complain about
  • pick and choose when to use LOD vs. purposeful workflow
  • Changes to lirbary workflow - what is the benefit?
  • gap between linked data sets and what users are trying to acheive
  • what is the context? 
  • Linked data provides opportunity to create global context to local collections and vice versa
  • ****LOD Provides the library the opportuntiy to provide infinite potential(path) for scholarly inquiry*****
  • make LOD app usable on an iphone
  • flexible way for user to edit search/find results
  • Provenance/Trust++
  • same as, see also, Different from
  • Simplicity vs. fit for purpose (goog vs. good enough)
  • allows users to discover and create their own useful context (dynamic browse, creating context as they go)
  • make it easy to find useful stuff in long tail - "I want to find A hilton hotel in Paris, not Paris Hilton in a hotel"
 
Not one killer app - but how do we enable a thousand flowers to bloom - provide simple platform, tools, and primer on "how to do this" to enpower those to experiment with LOD. Enable failure in order to move forward.
 
 
Scope:
  • Audience: Academic Research Libraries(administration and staff)) and its end users (researchers, scholars, students)
  • *need to differentiate between services that support libraries publishing linked data and libraries consuming linked data*
 
 Busiess Case:
 Providing better user expereince to create rich environment that supports and inspires intellectual curiosity and scholarship. (LOD creates infinite path current metadata strings in libraries are dead ends)
     LOD provides more content/ links services  - potential to support research libraries releavance in linked information ecology
     
     Reduced cost of data production through data reuse
     Quality of data (current RDF triple store is low)
 
 
Group 4 (Ed, Phil, Joan, Bill, Madgy, Eetu)
 
Problems / Challenges:
  • no demand
  • Transformation into a true linked data model from MARC or other traditional formats is HARD
  • Inconsistencies in use both inside and outside libraries. Particularly "structured" data in fields as well as interdependant fields
  • people are publishing but not linking to each other
  • someone needs to do reconciliation
  • Established authority
  • Own mappings <- need good tools, plus also staff
  • need use cases: who are the users, and how will they benefit from linked data?
  • lack of guidance on how to do it
  • need for non-prescriptive framework, and to allow differences to exist
  • difficult to link when there aren't shared ontologies and vocabularies
  • permanance of URLs, only able to commit to 25 years
  • licensing issues, profit motives can invoke distrust
  • data overload, ok so we can link to stuff, what are we doing to do with that
  • reconciliation is hard to do, it's costly
  • social problems: accepted authorities need to emerge
  • cost models for tracking millions of things
  • barriers to collaboration: giving money to other organizations, pbs campaign
 
Business Case / Opportunities:
  • lack of a solid business case :-)
  • open pool of shared data, with limited amount of duplication: ONIX ($)
  • co-citation / co-authorship maps; recreate thompson's citation indexes
  • university netids could be unified, or mapped
  • 20-30 different datasets, aggregated: topics without regard to content type
  • linked data allows traveling between context easier(+1)
  • unifying collections: books and articles
  • publishing databases on the web "have your silo and share it too"
  • collections are distributed, not necessarily in one institution
  • opportunity to work together: sfx knowledgebases ($)
  • get rid of local records (+1)
  • you have better data than you did before
  • more eyes on the data will improve it
  • if people publish data in a common format it's easier to integrate
  • institutional repositories: born digital
  • publishing on the web enables a long tail of users
  • Search Engine Optimization: schema.org, etc
  • New functionalities: better search, browsing and data exploration than for example federated search 
  • "just a little easier, with RDF and schema mappings", "a little better, with ontologies and ontology mappings"
  • Reduced duplicate work
 
Scope:
  • compelling use case that could show off benefits of linked data(+1)
  • presentations of how linking can help you learn
  • shareable tools as well as shareable data(+1)
  • information sharing
  • best practices
 
How would a Linked Data Project help your organization achieve its goals:
  • Provide high quality and deep metadata about content we hold.
  • Unlock and share data assets on the Web to continue their relevance.
  • Enable the organization to focus on unique materials, by spending less time.
What are the business cases?
  • Increased visibility of the organization and its unique collections
  • Increased use of unique material makes them more 
What are the use cases?
  • If we digitize the Buckminster Fuller archive we would make the documents available on the web as well as their descriptions (finding aids).
  • Focusing on 
 
 
Tuesday Afternoon notes:
 
Goals, Business & Use cases:
 
1. How would a Linked Data Project help your organization achieve its goals.
 
2. What are the business cases needed to propel enthusiasm and funding for your linked data project?
 
3. What are the use cases?
    Civil war Data 150 http://www.civilwardata150.net/
 
 
Random Practical Ideas:
  • Tools that enable better linking to provide better context for users, and for search engines (something like Reuters OpenCalais http://www.opencalais.com/, or Muddy Boots http://muddy.it/ ?) but for culture heritage organizations?
  • Tools that allow structured data to be published more easily, e.g. Omeka plugins for publishing linked data, that is already available in the database. 
  • "centrality metric" of collections - 
  • LinkedData alternatives to Z39.50-like services that are unintentionally hiding our resources.
  •  
  •  
Linked Open data resources:
 Linked LCCn PPermalink service- http://lccn.loc.gov/
 
So
 
Concrete projects
  • General project structure
  • Take data, see what to link to to get added value either among project partners or in the LOD cloud
  • Convert to "linked" data using available tools (FinnONTO has "alpha/beta-quality" converters for MARC, experience with ~20-30 formats total)
  • Link thesauri, authority files etc using available tools (Freebase/Google probably has best tools, FinnONTO has some, but lots of experience, particularly in cultural heritage, Seme4 has some tools)
  • Create demonstration system ("killer app") demonstrating added value using available tools (FinnONTO has end-user tools for scalable search, recommendation & exploration as well as experience with demonstration systems for ~5 domains)
  • Responses to problems
  • FinnONTO has killer end-user apps whose functionality can easily be adapted to new data
  • FinnONTO also has environments for publishing thesauri/ontologies as well as widgets to integrate with editing environments / a whole editing environment for prototyping in general 
  • FinnONTO tools have been tested to scale to 300 million items and 4 billion triples on a single desktop machine (Freebase/Google are probably better in scaling in general, but we probably have some advantages in using inferencing in search, for example)
  • FinnONTO has scalable methods to experiment with attribution/authority
  • FinnONTO has experience with QC problems
  • also specifically with problems in authority&gazetteer reconciliation
 
Manifesto for Linked Libraries
 
We are uncovering better ways of publishing, sharing and using information by doing it and helping others do it. Through this work we have come to value:
  • Publishing our data on the web for others to use over specific share/use contracts
  • Continuous improvement of data over perfect specifications
  • Machine-processable data over human-parseable data
  • Providing well-curated facts about entities to a common distributed web over exposing monolithic flat records
  • Referencing over copying
  • Services based on fields of interest over services based on institution boundaries
  • URIs over strings (URI for every entity, not just books)
  • Broadly used standards over library-specific standards
  • Responding to change over following a plan
  • Common open licenses over specific licenses
  • Shared metrics over flying blind.
 
That is, while there is value in the items on the right, we value the items on the left more.