Steven Hillion, Vice President of Analytics at EMC Greenplum

Steven Hillion, Vice President of Analytics at EMC Greenplum

As the mounds of data pile up from every direction, businesses are going to be differentiated increasingly by how they use that data. The essential skill set needed for the enterprise of the 21st century is that of the “data scientist,” a role dedicated to understanding and making use of data to help a business or other organization. But, just like “big data,” “consumerization” and other new trends, the “data scientist” has many definitions. I’ve been asking leading practitioners about their take on the emerging role, in the hopes of arriving at a conclusive and constructive definition. (See this problem statement for my larger research agenda: Growing Your Own Data Scientists.)

Steven Hillion is vice president of analytics at EMC Greenplum, and has led engineering at firms such as Siebel, KANA and QRS. His educational background includes a major in number theory at the University of California, Berkeley, which got him thinking about moving software engineering closer toward mathematical applications. At EMC Greenplum he is in charge of the design of their Chorus product which aims to support collaborative analysis of data and also helps customers improve their data analysis practices. (See “Does Big Data Always Require Big Money” for more on the need for agility in data analysis.)

To Hillion, data scientists are “analytically-minded, statistically and mathematically sophisticated data engineers who can infer insights into business and other complex systems out of large quantities of data.”

The skill set of the data scientist goes beyond the capabilities of what many would call “traditional business intelligence (BI).” Traditional BI is interested in the “what and the where,” while data scientists are interested in the “how and why,” Hillion says. “They’re interested in inferring things that are not already present in the data.”

For example, traditional BI, and the skill set of people who have grown up with it and been trained on it, will tell you how many widgets you sold in a region compared to last year. A data scientist can tell you why sales plummeted in the Northwest compared to every other region - or at least would have a hypothesis. He thinks of the people on his team at EMC Greenplum as “data craftsmen,” he says, “because what they do is take the raw material of the data and skillfully work it into structures that are useful and striking. Like the Craftsman style [of architecture], for better or worse, the process is often manual and laborious, and is not just the repeated application of standard templates and techniques.”

Data scientists are not a common breed, which is why Hillion (and I) think they will be in demand. They are equal parts engineer, statistician and investigative journalist / forensic reporter.

“Data scientists are examples of those rare professionals who bring in talents from a lot of different areas,” Hillion says. “The first thing they need to be able to do is understand the business. They need to listen to people, understand what questions they’re asking, but then sort of read between the lines. So, if business tells you they want to understand year-on-year sales, you want to dig a little deeper and actually figure out it’s because sales are plummeting in the Northwest that they’re asking that, and then you want to be able to ask them about their hypotheses. ‘Why do you think that might be happening?’ So you have to be a good interviewer to be a data scientist.”

Additionally, data scientists need to be a strong domain expert, with skills honed to the industry in which they work. A data scientist working for a medical insurance company needs to understand that domain, just as a data scientist working for a retail company needs to have a keen understanding of how prices and promotions work, Hillion says.

Skill in mathematics, statistics, modeling and data mining are of course essential. The best data scientists, according to Hillion, come from physics, bioinformatics and other applied fields, because modeling and experimentation with real data sets is important in these fields. For the time being, the hunt for data scientists will involve combing graduates in a number of related fields, because data science is currently an underdeveloped degree at universities. There are vestigial signs that this is changing - the software company SAS has partnered with North Carolina State University to create the Institute for Advanced Analytics, which offers a master of science in analytics degree. But it’s still early days for data science in education, Hillion says.

“I’m sure in 30 years’ time, there will be lots and lots of degrees in data science and that’s where [data scientists will] come from, but right now it’s coming from all these different buckets,” Hillion says.

And, just as the early days of computing were born in the garages of Silicon Valley do-it-yourself-ers, data science is likely to develop first in an ad-hoc, hands-on way.