Robert Grossman, Founder of Open Data Group, sat down with WashingtonExec to discuss the emerging discipline of data science during financial struggles and whether big data analytics and mining are disruptive technologies.
Grossman also talked of his new book, The Structure of Digital Computing: From Mainframes to Big Data, the mistakes he’s made over his long career, and the 12 rules he must follows when starting a new project.
WashingtonExec: What are the differences between the way that Google and Facebook view big data technology and the way that the government does?
Robert Grossman: Compliance and security are much easier on the commercial side. A lot of times those most successful at using big data, like Google and Facebook, have green field implementations. For government agencies, it is much more challenging because the metric for success is not profit but mission impact and mission impact is harder to quantify than profit. Compliance for government big data applications is also extremely complex – a lot of the laws that are applicable were written for technology of a different generation. Government big data applications typically interoperate with more legacy systems than their counterparts do. There are some important differences between how Google and Facebook deal with big data versus how a government agency might deal with big data.
————————————————————————————-
“There is pretty good practice out there of how to build data warehouses. There is not a lot of good practice or knowledge out there about how to build statistical models over big data.”
————————————————————————————-
WashingtonExec: Do you consider big data, data mining or data analytics a disruptive technology? Google started using data mining in ’95…would you consider the big data technology today a “new”?
Robert Grossman: That’s a great question. I’m not an expert in disruptive technology, so please take my response with a grain of salt.
Using the general definition of disruptive, then using big data and predictive analytics for those products and services that have not used these technologies before can be quite disruptive.
Clayton Christensen defines a disruptive innovation as a process by which a new product or service takes root initially in simple applications at the bottom of a market and then moves ‘up market’ to eventually displace established competitors. Disruptive innovations typically create new markets.
In contrast, the term sustaining innovation is sometimes used to describe the improvement of a product or a technology in an existing market.
From this perspective for those products and services that already use predictive analytics over the available data, it may be better to think of data mining, predictive analytics and big data as enabling sustaining innovations. These technologies date back over forty years and progress has been evolutionary and in this sense these technologies are not new, but instead are constantly being improved. All that has changed are the names: from computationally intensive statistics in the 1980’s, to data mining in the 1990’s, to predictive analytics in the 2000’s.
What is different today is that if a product or service is not already using analytics or using analytics over just some of the relevant data, then it is much easier with the current technologies to create applications that leverage analytics and indeed to create applications that leverage analytics over all the data. Some of the applications created can be disruptive innovations.
Examples include:
- Big data and real time predictive analytics applied to advertising have created the field of computational advertising, which has led to several new billion dollar markets, including search advertising.
- Big data and predictive analytics applied to network security are starting to have an impact on security products and services. Log files are one of the poster children of big data and log files are ubiquitous in network security applications.
Google’s data stack of a decade ago, consisting of the Google File System, MapReduce and BigTable, showed that data analysis could scale out to the size of a data center. Although it was standard among experts at that time to use Beowulf clusters and scale out to analyze large datasets, the scale at which Google did it and the integrated stack that they used changed the way that many people approached analyzing large datasets and set the stage for today’s broader interest in big data.
The application of this technology to new markets has created and will continue to create disruptive innovation, as the two examples above illustrate.
WashingtonExec: Speaking of the government; what do you think of the government’s OSTP’s Big Data Initiative and what can the Administration do to make bigger strides in big data?
Robert Grossman: I wasn’t involved with it, but I was at the event at which it was launched. I think it’s an extremely important initiative. It calls attention to the emerging discipline of data science. The challenge is that most of the agencies have constant or shrinking dollars. Initiatives, such as the Big Data Initiative, have to be balanced against an agency’s entire portfolio of projects and initiatives and there are some hard choices to make. I think if you look historically, we are in the middle of a transition of how we make online data useful. Several years ago when you wanted data, you basically had to screen scrape it or get it from a pdf file. Things are obviously much better now. On the other hand, for example, although federal budget is available online, it is not available in a format such that it is easy to make inferences or to integrate automatically with other datasets. I’d like to see a time in which not only was data available but it was also available in such a way that we could access it easily with an API and create distributed environments that support data discovery.
WashingtonExec: You’re writing a book on big data based on your experiences. What was the biggest mistake you made and what did you learn from it?
Robert Grossman: Whenever I do a big data project, I usually follow about 12 rules. Occasionally, I make a judgment call that in a particular case I can violate one of these rules and it usually is a disaster. The three most important rules that I have are: Can you get access to the data? Do you have an environment where you can build models? Do you have an environment where you can deploy the models you build into operational systems? For example, if the operational environment is not ready to receive a model, you can build the world’s best model but if you can’t deploy it, it doesn’t matter. Today, building models over big data is about where we were with data warehousing approximately ten or twenty years ago. There is pretty good practice out there of how to build data warehouses. There is not a lot of good practice or knowledge out there about how to build statistical models over big data. My biggest mistake is just violating one of those first three rules.
WashingtonExec: In your book: The Structure of Digital Computing you talk about the “data gap”…how large or significant is the gap between the amount of Data available and the number of people able to analyze it (trained workforce)?
Robert Grossman: The gap is indeed significant and this is a major problem. It has always been a challenge to train enough people to analyze data and to build predictive models over it. To analyze big data and to build statistical models over it requires in addition, specialized knowledge of how to work with big data, and this “double burden” further limits the work force.
On the bright side, since the knowledge is somewhat specialized, there are great opportunities for those who do have this knowledge.
In addition to training the people to analyze the data and to build analytic models over it, there is the related problem of how to manage and how to govern this process so that the right analytic models can be built and can be deployed efficiently into operational systems.
Over the next ten years, I expect that we will see a growing body of knowledge around the management and governance of analytics. Analytics today is where software was in the 1960’s before software development was thought as a process that needed to be managed and governed. I am one of those that are trying to work out best practices for not only building and deploying analytic models, but also how to manage and govern these processes. I am currently writing a book about this.
WashingtonExec: Can you tell me a little bit about being on the NASA Advisory Council Information Technology Infrastructure Committee?
Robert Grossman: NASA is one of the lead agencies in the federal government for doing cloud computing and cloud computing is just a wonderful platform for analyzing, managing, storing, and building models over big data. As NASA has moved ahead on some of the technology in cloud computing, it’s been exciting to watch the progress. The science side of big data is beginning to be called data science. As that discipline is developed, it is going to be transformational for the science oriented agencies, including NASA. I think some of the challenges with constant and shrinking dollars is that spending money on big data has to balanced against all of the other computing research initiatives. I think it’s going to be an interesting challenge for everyone as big data emerges to balance it into the research portfolio.
WashingtonExec: What is your favorite part of being a professor at the University of Chicago?
Robert Grossman: I lead a big data research group and I’ve talked a little bit about the emerging discipline of what’s being called data science. It’s really exciting to be around for the birth of a new science. It doesn’t happen all that often. It happened around the turn of the last century with quantum mechanics and I think it is happening at the beginning of this century with data science. I’m just excited to be working in a field that is still formulating the core concepts, the foundations, and the basic algorithms. This is the most fun I’ve had in quite a while.
WashingtonExec: Are you at all worried about the new open data initiative spearheaded by Todd Park, Federal CTO? Do you think the security standards need to be updated with all of this new big data analytics or do you think we are pretty on-track?
Robert Grossman: I think we are on track, but I also think big data creates an opportunity to improve security and compliance. With big data you typically have to build new infrastructure. Whenever you build new infrastructure, you can ask the question, knowing what I know today about best practices: how can I build in security? How can I build in compliance? How can I build in privacy?. For example, I’m working on a project that is building what we call a biomedical cloud containing petabytes of integrated biology, medicine and healthcare data. We are using the opportunity to put in place an infrastructure that can easily support best practices for security, compliance and privacy.
WashingtonExec: If you think using the term big data kind of waters down what the actual meaning is, what other term do you think should be used when we are describing similar things that we’ve talked about today?
Robert Grossman: A more accurate name might be data driven decision support, but this is certainly not a name that’s going to catch. For specific applications; say, for example in advertising, we don’t talk about data driven decision support for advertising, but instead we usually use the term computational advertising. For scientific applications, the term used is data science. Each field typically creates their own name. Unfortunately, in general, there really isn’t a better term that I can suggest.
WashingtonExec: Is this hyperconnected world more or less safe than the fragmented one we use live in?
Robert Grossman: There are certainly more opportunities for bad players to create problems in a hyperconnected world, and the bad players need less resources, and can easily do harm remotely. On the other hand, big data and analytics can be used to improve how we protect our data, our web interactions, our systems, and our networks. As an example, the larger companies in the financial services industry make use of these technologies to protect the services they offer.