Google and IBM say we need to train more supercrunchers

There was an article in the New York Times today about the effort that companies like Google and IBM are making to allow university students access to very powerful computing environments to allow engineers and scientists to plow through massive data sets. Their argument is that students are being trained right now to think on a gigabyte scale (if they’re lucky enough to be trained how to analyze real data at all), when all the breakthroughs are happening with datasets in the tera and peta-byte scales.

I couldn’t agree more with this analysis. If people are serious about analyzing those “very rare events”, “long tails” or whatever that can make the difference between a profit and loss, success or failure, or even life or death, then we can’t continue running around assuming things because the model fits 80% of the time and anyways, it’s too hard to do that level of analysis. We all saw what happened with that idea.

When I was working at Lincoln, we created a highly accurate model of U.S. near mid-air collisions. We did this by analyzing about 5 terabytes worth of radar data from across the country (about 8 months worth). Nobody had ever done this before on anything close to that scale.

As a result, we had orders of magnitude more data on near mid-air collisions (a very rare event) than the last model in the early 90’s. Without this data, and the high-powered systems available at Lincoln that we used to analyze it, our model would have suffered from the same assumptions and modeling error as previous attempts, and that is just not good enough for developing something as important as the next generation of collision avoidance systems for manned and unmanned aircraft, which people are now doing at Lincoln, largely as a result of that effort.

The ability to analyze massive data sets has been proven again and again as a competitive advantage in bio-tech, finance (those who do it correctly), internet, and even marketing, making those companies who developed those competencies hundreds of billions of dollars.

Is it then a stretch to say that the next lucrative opportunity in operations management will be to develop the capabilities to harness the massive amounts of data companies already generate every day? I’m talking about everything from inventories to machine control outputs and even to intra-company emails.  There are signals in that data, just as there are signals in everything from our DNA to the stock markets, if you look hard enough.

To be honest, I don’t know (I’m new to this stuff!) but that’s why I and several of my classmates are trying to start a new track for LGOs in the EECS department this year called Information and Decision Systems. The focus in this track is to develop the theoretical, practical and communication skills for students who want to take on this operations challenge in the real world, for real companies. That means not just studying and learning the algorithms, but also getting a design background in the networking, database and parallel computing systems that are critical enablers of this type of work. It also means developing specialized communication skills to explain the opportunities and the results, because like the NYT article said, most people have not been trained to think on this scale before.

I could talk for pages more about this topic, but lets just leave it at that for now. I just had to write something because I’m obsessed with this idea, and this article got me all excited. I’m definitely going to look into Hadoop…

About chewbarod
I am generally interested in the application of machine learning to aid decision making, and specifically interested in developing probabilistic models of systems in order to aid modelling and simulation. Recently, I have done research in the area of airspace safety, working in the Surveillance Systems group at MIT Lincoln Laboratory. I mainly worked in the TCAS and UAV mission areas, specifically in terms of collision avoidance systems. In the past, I have researched the application of machine learning algorithms to the care of patients with chronic disease such as diabetes and hypertension. As an LGO fellow, I hope to research how machine learning can improve efficiency and lower risk in manufacturing and/or supply chain operations.

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!