Open Data #ioe12

Tim Berners-Lee talks on his TEDtalk about how he originated the World Wide Web. The memo he wrote about it was read by his boss (who after his boss’s death was found to have the words “Vague but exciting” written on it). I’m thinking of adopting that as one of my own taglines because of the amount of ideas I propose and receive little interest and even less understanding. (I’m not saying my own ideas are of Tim Berners-Lee’s scale of impact.)

It was a grassroots movement that launched the Web and it was this community that excited Tim. He asked people to put their documents onto the Web, and people did. In his words, “It’s been a blast.” Now he asks us all to put our Data onto the Web.

He refers to Hans Rosling, who also says that it’s important to have a lot of data available. I’d previously seen Hans Rosling’s TEDtalk and also his presentation on a BBC television programme. Tim is going further, however, with his concept of Linked Data, where everybody is putting everything on the web, and therefore virtually everything you could image is on the web. He asks for three things to be observed in this process:

  1. Everything has a http ‘name’ – events, products, things, people, etc.
  2. When someone fetches something with a http name it returns some standard data (information) in a format that people will find useful – something about the event, that person, etc.
  3. The data returned shows relationships; importantly the data has relationships and these related data also carry http names that allow them to be looked up, etc.

Google, Microsoft and Yahoo have agreed to use standardized formats for data as outlined at Schema.org

Another interesting phrase to come out of the videos is ‘Database Hugging’, coined by Hans. Tim demonstrated it as though he was actually hugging something with his arms. The idea is that people, governments, institutions don’t want to release their data until they’ve created a lovely website to display it. Tim says, by all means make a beautiful website, but first give us the unadulterated data, emphasized with the chanted phrase – ’Raw Data Now’.

Linked Data’ useful links:

The site that draws together useful information about Linked Data is linkeddata.org
There you’ll find, amongst other things:

  • Guides and Tutorials
    • Key Reference Documents
    • Textual Guides/Tutorials
    • Video Tutorials
    • Introductory Slide Sets
    • Frequently Asked Questions
  • Tools
    • Linked Data Publishing Platforms/Frameworks
    • Linked Data/RDF Editors and Validators
    • Tools for Consuming Linked Data
    • Linked Data Applications for End Users

A point that Tim makes in his video at t=11m30s is about the sharing of data to enable Open Science activities to occur, which links in nicely to the previous topic in the Openness in Education course about Open Science.

Tim Berners-Lee returns to do a short TEDtalk ‘updating’ the situation with Open Data:
http://www.youtube.com/watch?v=3YcZ3Zqk0a8

It is interesting to see that over this time we have witnessed more of a willingness by governments in areas across the world to share data openly in what has become know as Open Government or ‘opengov’. So in the US you have sites like the one in the topic readings, and in the UK you have ones including data.gov.uk from HM Government; a BBC interview with Tim Berners-Lee on the project that brought about data.gov.uk is available.

Areas of the media are making it easier and more convenient to use open data and allow individuals to interrogate and interpret the data to provide useful information to them. Examples being The New York Times in the US and The Guardian in the UK (which enables questions to be asked of government data from around the world). The Guardian also has a section online that is dedicated to the journalistic use of open data, called the Data Store.

Advertisements

Open Science #ioe12

Michael Nielson in his TEDtalk begins by talking about Tim Gowers1, a renowned (famous) mathematician, and Cambridge (UK) professor, who asked on his blog in 2009 whether science could be done collectively out in the open. There was a mathematical problem that he would like to solve, and so he set out via his blog to make all his workings open and invite contributions from anyone and everyone, with the anticipation that multiple people working collectively by expressing their ideas and studying the workings and ideas of others would lead to a solution. This experiment was the Polymath project. Michael says that he observed the blog at the time and was amazed by the speed of activity; how ideas would quickly develop and be elaborated upon by others, and sometimes be discarded. It took 37 days for the core problem, and even a harder generalization to be solved.

Michael believes what the Polymath Project demonstrated is the potential of the internet to enable us to expand our ability to solve some of the most intellectually challenging problems. It follows from this that there can be an expansion in the range of scientific problems we can go on to tackle. It means that the rate of scientific discovery can be increased. And Michael suggests that it means ‘a changing in the way we construct knowledge itself’.

There are challengers and problems with this approach. One area is development of the community and the lack of contributions by others. Many times wikis have been suggested and developed to encourage the sharing of knowledge and problem solving in different scientific areas only for them to falter due to lack of participation. Similarly social networks along the same lines have failed. Primarily the current reward structure for researchers in higher education institutions is focused on the publication of academic papers in journals, and consequently researchers are much more likely to put their efforts there than contributing to a collective, community project. So even though the concept might be appealing, and you might think that it would advance scientific endeavour more rapidly, the rewards structure doesn’t allow, or actively discourages participation.

Arguably the Polymath Project succeeded because, even though it was carried out in an unusual way, it was still inherently conservative because academic papers would be ultimately published as a consequence.

Michael suggests that the Open Science movement wants to change the perceptions that data should be locked away even though it could be potentially useful to others, about the hording of scientific ideas, and even the hording of descriptions of problems that researchers believe to be interesting.

The movement is intent on changing this culture of science so that there is greater motivations to share; to share all these different kinds of knowledge. they want to change the values of individual scientists so that they start to see it as part of their job to be sharing their data, to be sharing their code, to be sharing their best ideas and their problems.

Michael Nielson

This can then lead to changes in the system that then incentivize this kind of activity. It’s not easy, but there are things that scientists and non-scientists can do; as Michael outlines at the end of the video.

http://youtu.be/DnWocYKqvhw#t=13m30s

All the ideas expressed above are extracted from the video and attributed to Michael Nielson.

The Open Science Project is ‘dedicated to writing and releasing free and open source software for scientific use’. The blog for the project is a very useful source of information. In one particular post it quotes an informal definition of Open Science provided by Michael Nielson:

Open science is the idea that scientific knowledge of all kinds should be openly shared as early as is practical in the discovery process.

One element in the work of scientific researchers is the Lab Notebook. I’m familiar with the concept from my own scientific training in my days as a science undergraduate and postgraduate researcher. It’s not the first time that I’ve been interested in the concept of Open Notebook Science which makes up part of the reading for this topic in the course. I wrote a blog post about it back in 2008 when I was experimenting running multiple blogs (a bit of a daft idea), and received a response from Jean-Claude Bradley the concept originator. Open Notebook Science certainly fulfils the ideal of openly sharing as early as is practically possible.

Since 2004 the Creative Commons has been looking to expand Creative Commons Licensing to the area of science. It had a section know as Science Commons between 2005-2010, but is now called Science at Creative Commons. By licensing in this way, science has a greater chance of being practiced more openly. There are interesting links to organisations that have adopted a Creative Commons Licence to enable Open Science activities to happen.

1 I’ve already encountered Tim Gowers more recently with regard to Open Access, and I’ll be writing about that in more detail later.

In the writing of this post, I’ve also seen writings by Michael Nielson that are relevant to the Open Access section of the course, and I hope to bring those into my later posts.

Managing & Sharing Research Data Part 1

I attended a presentation this morning given by Martin Donnelly, Digital Curation Centre (DCC), University of Edinburgh covering ‘Managing & Sharing Research Data: Good practice in an ideal world … in the real world’  held at The University of Sheffield and promoted by the Research Ethics Committee there. It was a two hour presentation, with the first part made up of a presentation and the second of a demonstration of an online resource produced by the DCC called the Data Management Planning (DMP) Tool to enable easy production of DMPs to meet research funding council requirements.

I attempted to make notes during the presentation in the form of this blogpost; so the following is just that, my notes but you might find some use in them.

Background

DCC was founded in 2004 for UK HE & FE sectors. Its major funder is the JISC. It provides support for JISC projects as well as producing tools, providing guidance, case studies, consultancy, etc.

Body of Presentation

When considering data management there are a number of areas to focus on:

  • Ensure the physical integrity of the files
  • Ensuring safety of the content (read and understood by your target audience but not accessible by other people / Data Protection / file format / etc.)
  • Describing the data (metadata), and what’s been done to the data
  • Access at the right time – make data available only after publication (embargo)
  • Transferring custody of data from the field to storage, archiving and possibly on to destroying (this process needs managing and is not necessarily done by the data collector)
  • Research Ethics & Integrity.

However, there is also the concept of Openness, Open Science, Open Data that needs to be considered. Martin touched on the Panton Principle with respect to Open Science. This was a Principle drafted in Cambridge in July 2009 and officially launched in February 2010. Originally based out of the discipline of chemistry, the concept of the Principle as taken from their website is:

Science is based on building on, reusing and openly criticising the published body of scientific knowledge.

For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.

By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.

[Aside: I shall be returning to this, not least for the ioe12 course.]

Martin also pointed to an article in The Guardian, ‘Give us back our crown jewels‘, Arthur & Cross, 9 March 2006.

Our taxes fund the collection of public data – yet we pay again to access it. Make the data freely available to stimulate innovation, argue Charles Arthur and Michael Cross

The Research Councils UK (RCUK) is the strategic partnership of the UK’s seven Research Councils. It has produced a Common Principles on Data Policy, which Martin summarised as having Key Messages:

  1. Data is public resource
  2. Adhere to standards & best practice
  3. Metadata for ease of discovery and access
  4. Constraints on what data to release
  5. Embargo periods delaying data release
  6. Acknowledge of / compliance with Terms & Conditions
  7. Data management & sharing activities should be explicitly funded

There are an increasing number of things influencing the management of reasearch data some of which I managed to jot down:

  • Research outputs are often based on the collection, analysis, etc of data
  • Some data is unique (e.g. date & time specific weather conditions data) and can’t be reproduced
  • Data must be accessible and comprehensible
  • There’s a greater demand for open access to publicly funded data
  • Research today is technology enabled and data intensive
  • Data is a long-term asset
  • Data is fragile and there is a cost to digital data; curate to reuse and preserve
  • Data sharing and research pooling might be more cost-effective: cross-disciplinary and increased global partnership
  • Costs of technology and human infrastructures
  • Increasing pressure to make a return on public investment

Most (but not all) Research Councils are broadly the same in their approach to data management. They are generally requiring a Data Management Plan prior to funding being granted. The NERC Research Council has a Data Policy & Guidance (pdf), and also provides data centres for managing funded research data.

EPSRC is the odd one out; they are requiring all institutions to provide a roadmap for data management by 1st May 2012 and implemented by 1st May 2015.

RCUK has a Policy and Code of Conduct on the Governance of Good Research Conduct (available as a pdf).

Martin highlighted how some universities have got into difficulty with regards to Freedom of Information (FOI) requests. He mentioned Queen’s University Belfast and a request about Irish tree rings that was made under FOI. He also said about how Stirling University had received a request from a tobacco company about the take up of smoking amongst teenagers, useful data for a tobacco company.

The University of Edinburgh has developed a Research Data Management Policy.

The question Martin then put was Why? Why do this? And he outlined the incentives in the form of carrots and sticks.

It’s a good thing

  • Data as a public good (the RCUK common principles)
  • others can build on your work  (Isaac Newton “If I have seen farther it is by standing on the shoulders of giants.”)
  • Passing on custody so making effective use of resources.

Direct incentives to researchers are:

  • Increased impact of your work
  • making publications online increases citations

These are covered more fully in:

More incentives:

  • Increase citations helps REF
  • Research councils are increasingly rejecting on the grounds of poor data management plans
  • You receive more funding if you do this right

And the ‘Sticks’:

There is a concern often raised by academic researchers about how their data will be used or misconstrued if it is out in the open. Martin emphasised the importance of appropriate metadata to try to prevent this. However, he did say that even then if the data was going to be misconstrued it will be anyway. Files need to be labelled in an understandable, meaningful, standard and appropriate fashion, to include the project title and date. It would also be useful to maintain a separate log describing the data, to include

  • research context
  • data history
  • where & how to access the data
  • access rights
  • etc.

Backup is also a consideration. It is different from archiving. Backup is about loss, damage and recovery of data during the research process. (Archiving is about retaining and providing access at the end of the research process.) There should be some means of off-site backup. There should be an implemented, automatic backup process at the University, Faculty or School level. If not, then a manual backup process is required with set repeat reminders.

Archiving is a case of depositing data for the long-term. However, it does require things like checking copyright, consent and data protection. You should use the appropriate archive for your subject discipline. It’s also important to publicise your archived data for increased citations. The point was made that there isn’t yet a standard for data referencing, and that some work needs to be done in this area. The other concerns about use of data without knowledge are just the same as if your published work is plagiarised.

Rachel Kane from RIS in Sheffield highlighted that specific Sheffield resources will be made available soon. She also provided some useful examples of what people where doing at the University, including:

  • Prof. Steve Banwart in Civil and Structural Engineering approach to open data
  • Dr Bethan Thomas in Geography SASI
  • HRI Digital – data management services – from application to archiving stages – consultancy

Experiment and discussion

I’ve recently written a couple of blog post that are receive a bit of attention, the first was about the changing role of education and the second about Nurphy a new online service for conversations. I’ve decided to see if I can combine then by asking question about one on the other and seeing what happens. It’s a bit of an experiment really.

So, here goes. I’ve posted the following up as a conversation that anyone can join, once registered with Nurphy. Will people be willing to sign up for an untested service at this early stage? I’ll find out. The conversation starts here.

Whatever, I’d still like people’s opinions about the following.

Is the rise of the Professional Amateur Pro-Am, the increase in open educational resources (OER), personal learning environments (PLE), and greater significance of informal learning and research going to lead to a move away from an emphasis on institutional, formal learning?

As people are able to continually express their skills, abilities and achievements via social media, will formalized accreditation, with potentially out-dated assessment systems, be less relevant?

Or are formal learning and research institutions able to adapt quickly enough to the new requirements of society?

Open research – Professional Amateurs – Science in Action

I recently wrote a post that touched upon openness of and elitism in education. I just wanted to express a few more quick thoughts on this, though it is something I intend to return to with a more in-depth look at open education and resources.

I feel the elitism of universities doesn’t lie with who is allowed to become a students, it is more related to the fact that resources are securely tied up within universities making those resources inaccessible to the majority. Resources in this context could be books or journals (hard copies or online with paid for institutional subscriptions), the academic discourse, the talents of faculty, the research equipment and facilities, past Ph.D. theses, etc. In addition, it relates to the subjects and specific topics that are deemed to be worthy of teaching or researching, or what the funders deem so.

Universities deal in the currency of degrees, a passport in society. Why in times of recession, such as at present, should it be that otherwise capable individuals are denied their chance of a degree passport because the government puts a squeeze on the number of places available in order to balance the books? A further point is the question of assessment, and is it really a useful measure, or is the ongoing presentation of someone’s work, either within a university or indeed outside it (informal learning), a better reflection of their capabilities and abilities? Indeed, evidence is beginning to accumulate indicating that those who present their work using social media place themselves in a more advantageous position for employment. And shouldn’t publicly funded research be in the public domain anyway? I’ve previously written about Open Notebook Science.

I can envisage how much of this could be opened up to greater access, but I was having a problem with scientific equipment and facilities and how that might be liberated.

There have been some interesting examples where institution based science projects have reached out to the public for assistance. There was SETI were you signed up and your computer were utilized while it was on (and you weren’t using it) to process data to search for extra-terrestrial life. Then I recall a project were public volunteers were called for to look for new astronomical bodies in tens of thousands of photographs of space; these were provided online and after doing a test to see how accurately you could assess the images you could process the live data. It was discovered that humans were much better at seeing differences in the data than if the processing was done electronically with image recognition.

Therefore, in a rather detached way people were participating in scientific research.

However, I then heard the repeat of the Friday 25 September Science in Action programme* on BBC World Service at 4:32GMT on Sunday morning. (Sometimes I’m awake in the night or wake up early.) Listen to the programme. The significant part where this blog post is concerned is the DIYbio article. The article talked about people who are undertaking scientific research, bio-engineering in this case, in their own homes using inexpensive equipment, some bought secondhand on Ebay(R) for a fraction of its cost new to a research lab. They are able to design and create new biological parts, devices and systems. Integral to this approach is the support from online communities, DIYbio.org for example, sometimes with professionals voluntarily assisting these communities.

Clay Shirky has talked about the increase in mass amateurization, without being amateurish. This is the breaking down of the dichotomy between ‘experts’ and amateurs, with the creation of a new category – the Professional Amateurs or Pro-Ams. Charles Leadbeater in his book We-Think talks about how mass creativity has seen sites including Wikipedia and Youtube, and the Linux operating system rise in prominence and signal a shift in the way we and society can organise ourselves; participation becoming the key element.

All of this, for me raises the question, “Are universities, education systems and society more generally getting ready for the future of learning and research?”

* This particular programme doesn’t seem to be archived, though you can usually listen to the previous two recent episodes, so I guess you’ve probably got a couple of week to hear it before the link is broken.