The Data Life Cycle
Let’s take a step back from data science and look at the larger picture of the data life cycle.
The cycle starts with the generation of data. People generate data: Every search query we perform, link we click, movie we watch, book we read, picture we take, message we send, and place we go contribute to the massive digital footprint we each generate. Walmart collects 2.5 petabytes of unstructured data from 1 million customers every hour. (One petabyte is equivalent to 20 million filing cabinets.) Sensors generate data: More and more sensors monitor the health of our physical infrastructure, e.g., bridges, tunnels, and buildings; provide ways to be energy efficient, e.g., automatic lighting and temperature control in our rooms at work and at home; and ensure safety on our roads and in public spaces, e.g., video cameras used for traffic control and for security protection. As the promise of the Internet of Things plays out, we will have more and more sensors generating more and more data. At the other extreme from small, cheap sensors, we also have large, expensive one-of-a-kind scientific instruments, which also generate unfathomable amounts of data. The Large Hadron Collider generates tens of petabytes a year. The Large Synoptic Survey Telescope is expected to generate 1.28 petabytes a year. The latest round of Intergovernmental Panel on Climate Change (IPCC) will produce 100 petabytes of data.
After generation comes collection. Not all data generated is collected, perhaps out of choice because we do not need or want to or for practical reasons because the data streams in faster than we can process. Consider how data are sent from expensive scientific instruments, such as the IceCube neutrino detector at the South Pole. Since there are only five polar-orbiting satellites, there are only certain windows of opportunities to transmit restricted amounts of data from the ground to the air. Suppose we drop data between the generation and collection stages: Could we possibly miss the very event we are trying to detect? Deciding what to collect defines a filter on the data we generate.
After collection comes processing. Here I mean everything from data cleaning, data wrangling, and data formatting to data compression, for efficient storage, and data encryption, for secure storage.
After processing comes storage. Here the bits are laid down in memory. Today we think of storage in terms of magnetic tape and hard disk drives, but in the future, especially for long-term, infrequently accessed storage, we will see novel uses of optical technology and even DNA storage devices.
After storage comes management. We are careful to store our data in ways both to optimize expected access patterns and to provide as much generality as possible. Decades of work in database systems have led us to optimal systems for managing relational databases, but the kinds of data we generate are not always a good fit for such systems. We now have structured and unstructured data, data of many types (e.g., text, audio, image, video), data that arrive at different velocities, and data that stream in continuously and in real-time. We need to create and use different kinds of meta-data for these dimensions of heterogeneity to maximize our ability to access and modify the data for subsequent analysis.
Now comes analysis. When most people think of what data science is they mean data analysis. Here, I include all the computational and statistical techniques for analyzing data for some purpose: the algorithms and methods that underlie data mining, machine learning, and statistical inference, be they to gain knowledge or insights, build classifiers and predictors, or infer causality. For sure, data analysis is at the heart of data science. Large amounts of data power today’s machine learning algorithms. The recent successes of the application of deep learning to different domains, from image and language understanding to programming to astronomy, continue to astound me.
But it is not enough to analyze data and spit out an answer. We need data visualization to help present the answer in a clear and simple way a human can readily understand and visualize. Here a picture is worth not a thousand words (that comes later) but a thousand petabytes! It is at this stage in the data life cycle when we need to consider, along with functionality, aesthetics and human visual perception to convey the results of data analysis.
And also, it is not enough just to show a pie chart or bar graph. By interpretation, we provide the human reader an explanation of what the picture means. We tell a story explaining the picture’s context, point, implications, and possible ramifications. It was only after I came to Columbia and began talking to my colleagues in the journalism school that I understood the importance of story-telling for the end user.
Finally, in the end, we have the human. The human could be a scientist, who through data, makes a new discovery. The human could be a policymaker who needs to make a decision about a local community’s future. The human could be in medicine, treating a patient; in finance, investing client money; in law, regulating processes and organizations; or in business, making processes more efficient and reliable to serve customers better.
In the diagram, I omitted the arrows that show that it is a cycle and that there are many intermediate cycles. Inevitably, after we present some observations to the user based on data we generated, the user asks new questions and these questions require collecting more data or doing more analysis.
I also omitted throughout the diagram the importance of using data responsibly—at each phase in the cycle. We must remember to consider ethical and privacy concerns throughout, from privacy-preserving collection of data about individuals to ethical decisions that humans or machines will need to make based on automated data analysis. This dimension of the data life cycle is so important, that I plan to devote future posts to it.
And so, from raw bits to a rich story, we have the data life cycle.
Jeannette M. Wing is the Data Science Institute's Director, and professor of computer science at Columbia University. For more data commentary and analysis, visit datascience.columbia.edu/voices.