Open Data #ioe12

Tim Berners-Lee talks on his TEDtalk about how he originated the World Wide Web. The memo he wrote about it was read by his boss (who after his boss’s death was found to have the words “Vague but exciting” written on it). I’m thinking of adopting that as one of my own taglines because of the amount of ideas I propose and receive little interest and even less understanding. (I’m not saying my own ideas are of Tim Berners-Lee’s scale of impact.)

It was a grassroots movement that launched the Web and it was this community that excited Tim. He asked people to put their documents onto the Web, and people did. In his words, “It’s been a blast.” Now he asks us all to put our Data onto the Web.

He refers to Hans Rosling, who also says that it’s important to have a lot of data available. I’d previously seen Hans Rosling’s TEDtalk and also his presentation on a BBC television programme. Tim is going further, however, with his concept of Linked Data, where everybody is putting everything on the web, and therefore virtually everything you could image is on the web. He asks for three things to be observed in this process:

  1. Everything has a http ‘name’ – events, products, things, people, etc.
  2. When someone fetches something with a http name it returns some standard data (information) in a format that people will find useful – something about the event, that person, etc.
  3. The data returned shows relationships; importantly the data has relationships and these related data also carry http names that allow them to be looked up, etc.

Google, Microsoft and Yahoo have agreed to use standardized formats for data as outlined at Schema.org

Another interesting phrase to come out of the videos is ‘Database Hugging’, coined by Hans. Tim demonstrated it as though he was actually hugging something with his arms. The idea is that people, governments, institutions don’t want to release their data until they’ve created a lovely website to display it. Tim says, by all means make a beautiful website, but first give us the unadulterated data, emphasized with the chanted phrase – ’Raw Data Now’.

Linked Data’ useful links:

The site that draws together useful information about Linked Data is linkeddata.org
There you’ll find, amongst other things:

  • Guides and Tutorials
    • Key Reference Documents
    • Textual Guides/Tutorials
    • Video Tutorials
    • Introductory Slide Sets
    • Frequently Asked Questions
  • Tools
    • Linked Data Publishing Platforms/Frameworks
    • Linked Data/RDF Editors and Validators
    • Tools for Consuming Linked Data
    • Linked Data Applications for End Users

A point that Tim makes in his video at t=11m30s is about the sharing of data to enable Open Science activities to occur, which links in nicely to the previous topic in the Openness in Education course about Open Science.

Tim Berners-Lee returns to do a short TEDtalk ‘updating’ the situation with Open Data:
http://www.youtube.com/watch?v=3YcZ3Zqk0a8

It is interesting to see that over this time we have witnessed more of a willingness by governments in areas across the world to share data openly in what has become know as Open Government or ‘opengov’. So in the US you have sites like the one in the topic readings, and in the UK you have ones including data.gov.uk from HM Government; a BBC interview with Tim Berners-Lee on the project that brought about data.gov.uk is available.

Areas of the media are making it easier and more convenient to use open data and allow individuals to interrogate and interpret the data to provide useful information to them. Examples being The New York Times in the US and The Guardian in the UK (which enables questions to be asked of government data from around the world). The Guardian also has a section online that is dedicated to the journalistic use of open data, called the Data Store.

Managing & Sharing Research Data Part 1

I attended a presentation this morning given by Martin Donnelly, Digital Curation Centre (DCC), University of Edinburgh covering ‘Managing & Sharing Research Data: Good practice in an ideal world … in the real world’  held at The University of Sheffield and promoted by the Research Ethics Committee there. It was a two hour presentation, with the first part made up of a presentation and the second of a demonstration of an online resource produced by the DCC called the Data Management Planning (DMP) Tool to enable easy production of DMPs to meet research funding council requirements.

I attempted to make notes during the presentation in the form of this blogpost; so the following is just that, my notes but you might find some use in them.

Background

DCC was founded in 2004 for UK HE & FE sectors. Its major funder is the JISC. It provides support for JISC projects as well as producing tools, providing guidance, case studies, consultancy, etc.

Body of Presentation

When considering data management there are a number of areas to focus on:

  • Ensure the physical integrity of the files
  • Ensuring safety of the content (read and understood by your target audience but not accessible by other people / Data Protection / file format / etc.)
  • Describing the data (metadata), and what’s been done to the data
  • Access at the right time – make data available only after publication (embargo)
  • Transferring custody of data from the field to storage, archiving and possibly on to destroying (this process needs managing and is not necessarily done by the data collector)
  • Research Ethics & Integrity.

However, there is also the concept of Openness, Open Science, Open Data that needs to be considered. Martin touched on the Panton Principle with respect to Open Science. This was a Principle drafted in Cambridge in July 2009 and officially launched in February 2010. Originally based out of the discipline of chemistry, the concept of the Principle as taken from their website is:

Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.

By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.

[Aside: I shall be returning to this, not least for the ioe12 course.]

Martin also pointed to an article in The Guardian, ‘Give us back our crown jewels‘, Arthur & Cross, 9 March 2006.

Our taxes fund the collection of public data – yet we pay again to access it. Make the data freely available to stimulate innovation, argue Charles Arthur and Michael Cross

The Research Councils UK (RCUK) is the strategic partnership of the UK’s seven Research Councils. It has produced a Common Principles on Data Policy, which Martin summarised as having Key Messages:

  1. Data is public resource
  2. Adhere to standards & best practice
  3. Metadata for ease of discovery and access
  4. Constraints on what data to release
  5. Embargo periods delaying data release
  6. Acknowledge of / compliance with Terms & Conditions
  7. Data management & sharing activities should be explicitly funded

There are an increasing number of things influencing the management of reasearch data some of which I managed to jot down:

  • Research outputs are often based on the collection, analysis, etc of data
  • Some data is unique (e.g. date & time specific weather conditions data) and can’t be reproduced
  • Data must be accessible and comprehensible
  • There’s a greater demand for open access to publicly funded data
  • Research today is technology enabled and data intensive
  • Data is a long-term asset
  • Data is fragile and there is a cost to digital data; curate to reuse and preserve
  • Data sharing and research pooling might be more cost-effective: cross-disciplinary and increased global partnership
  • Costs of technology and human infrastructures
  • Increasing pressure to make a return on public investment

Most (but not all) Research Councils are broadly the same in their approach to data management. They are generally requiring a Data Management Plan prior to funding being granted. The NERC Research Council has a Data Policy & Guidance (pdf), and also provides data centres for managing funded research data.

EPSRC is the odd one out; they are requiring all institutions to provide a roadmap for data management by 1st May 2012 and implemented by 1st May 2015.

RCUK has a Policy and Code of Conduct on the Governance of Good Research Conduct (available as a pdf).

Martin highlighted how some universities have got into difficulty with regards to Freedom of Information (FOI) requests. He mentioned Queen’s University Belfast and a request about Irish tree rings that was made under FOI. He also said about how Stirling University had received a request from a tobacco company about the take up of smoking amongst teenagers, useful data for a tobacco company.

The University of Edinburgh has developed a Research Data Management Policy.

The question Martin then put was Why? Why do this? And he outlined the incentives in the form of carrots and sticks.

It’s a good thing

  • Data as a public good (the RCUK common principles)
  • others can build on your work  (Isaac Newton “If I have seen farther it is by standing on the shoulders of giants.”)
  • Passing on custody so making effective use of resources.

Direct incentives to researchers are:

  • Increased impact of your work
  • making publications online increases citations

These are covered more fully in:

More incentives:

  • Increase citations helps REF
  • Research councils are increasingly rejecting on the grounds of poor data management plans
  • You receive more funding if you do this right

And the ‘Sticks’:

There is a concern often raised by academic researchers about how their data will be used or misconstrued if it is out in the open. Martin emphasised the importance of appropriate metadata to try to prevent this. However, he did say that even then if the data was going to be misconstrued it will be anyway. Files need to be labelled in an understandable, meaningful, standard and appropriate fashion, to include the project title and date. It would also be useful to maintain a separate log describing the data, to include

  • research context
  • data history
  • where & how to access the data
  • access rights
  • etc.

Backup is also a consideration. It is different from archiving. Backup is about loss, damage and recovery of data during the research process. (Archiving is about retaining and providing access at the end of the research process.) There should be some means of off-site backup. There should be an implemented, automatic backup process at the University, Faculty or School level. If not, then a manual backup process is required with set repeat reminders.

Archiving is a case of depositing data for the long-term. However, it does require things like checking copyright, consent and data protection. You should use the appropriate archive for your subject discipline. It’s also important to publicise your archived data for increased citations. The point was made that there isn’t yet a standard for data referencing, and that some work needs to be done in this area. The other concerns about use of data without knowledge are just the same as if your published work is plagiarised.

Rachel Kane from RIS in Sheffield highlighted that specific Sheffield resources will be made available soon. She also provided some useful examples of what people where doing at the University, including:

  • Prof. Steve Banwart in Civil and Structural Engineering approach to open data
  • Dr Bethan Thomas in Geography SASI
  • HRI Digital – data management services – from application to archiving stages – consultancy