Page 115 - AC/E's Digital Culture Annual Report 2015
P. 115
AC/E Digital Culture Annual Report 2015115in a transaction, the bibliographic reference for a book in a library, or the demographics of the users who register in a website).Data can be proprietary or open. Proprietarydata is generally collected by organisations from their own websites and platforms, and often kept private. Open data11 is, on its part, publicly available for anyone to use. This data, often gathered by academic researchers, public bodies and the government12, can be downloaded or accessed via Application Programming Interfaces (APIs) from the web.When we start to collect and combine data, we can end up with Big Data. This popular concept was first used by technology consultants Gartner in 200113 to refer to datasets that are hard to manage using traditional database and analytical technologies. The reasons for this are that these datasets:• Have high volume: They are too large to be stored in a single computer, or even in a single server. Instead, they need to be distrib- uted across a company’s data infrastructure, using multiple computing clusters.Data can be proprietary or open. Open data are publicly available for everyone to use.• Have high variety: these datasets often bring together data from many different sources, including ‘unstructured’ data like video, audio and text which are very different from ‘well structured’ tables of data (like financialinformation). “Messy” data has to be cleaned before one can work with it, and can be hard to store in ‘relational’ database architectures such as SQL.• Have high velocity: These datasets are generated at a high velocity (remember those 40,000 Google searches every second I mentioned before), and in order to create value, they need to be analysed and acted upon “in real time” (faster than any human could do, which requires automation in analysis).Nowadays, the term “Big Data” is used to talk about the ever-expanding collection of tech- nologies that help organisations deal with large volumes, velocities and varieties of data. They include Hadoop14 (a framework for distributed data processing), Cassandra15 (a big database system) and Hive16 (a big data warehouse) among many others.So far I have focused on data inputs and the technologies to manage them. However, ifdata is going to generate benefits, it has to be analysed. This involves a variety of methods coming under the umbrella of Data Analytics (see this Booz, Allen Hamilton’s Field Guide to Data Science for a summary17), including:• Statistics to summarise and test hypotheses about the data, and see how generalizable anything we find is beyond the data we already have.• Text mining to explore unstructured infor- mation, for example through “sentiment analysis”18 that can determine whether theJuan Mateos García