What is a Data Scientist

25 mayo, 2016

It turns out that this question is harder to answer than you might think.

Having just performed a quick Google search, the top hit came back with a page giving 14 different definitions of what a data scientist is and what they do. As someone who purports to be a Data Scientist, I’m constantly irked by the liberal use and abuse of the job title so I thought it was high time I waded into the debate.

Apparently (Wikipedia research alert), the term “data science” has been around since the sixties and was used interchangeably with “computer science” to describe someone who did any machine aided computer processing. Of course, these days the field of computer science is very far removed from just the processing of data (at least that’s what computer scientists like to think) and the title of data scientist has been repurposed to describe people who do something to do with data… And, no, I’m not talking about statisticians!

Data? Statistics? What’s the Difference?

This does raise the interesting question about what the difference is between a data scientist and a statistician. Many definitions of a data scientist suggest that they derive insight from data and present it to others in a form which is more readily accessible. Arguably, this is exactly what an applied statistician does, so what make data scientists so special?

A data scientist is a statistician but they also have other skills

In my opinion, this confusion between statisticians and data scientists is precisely why the definition of a data scientist as “someone who derives insights from data” is literally not sufficient (that was a statistics pun for those of you that weren’t paying attention!) In my book, a data scientist is a statistician but they also have other skills and knowledge that sets them apart. For me, a data scientist has three main attributes:

1) Knowledge of statistics and statistical methods.

2) Computer programming experience

3) Business analysis skills

I suspect that the first two are fairly uncontroversial but, in my experience, very little is made of the fact that data scientists are constantly having to analyse processes and procedures to understand the context of the data that they’re working with and also how any insights or models that they produce can actually be incorporated to either improve the existing processes or complement them; propensity modelling on a website is a classic example of this.

Suppose a data scientist is tasked with designing a propensity model to suggest which clothes a customer might be interesting in purchasing next, a statistician’s approach would be to gather all the available data and then design a model which, for a given set of behaviours, gives a prediction of what the customer would next go on to buy. A data science approach to the same problem could be to use the same data but then develop a model which operates interactively, while the customer is browsing the website, to try to influence the purchase decision towards a favourable outcome for the business. The difference can be quite subtle, and may lead to the same conclusions, but I would suggest that the mind-set is quite different.

So What Does A Data Scientist Do?

For me, where a data scientist proves their worth, is by being an active consumer of data - not just providing a static analysis of the data at hand and then producing a report. A successful data scientist has to be able to do all of the following:

Analyse the data in context to develop insights from it.
Prototype models to show why a particular outcome occurred.
Design algorithms to exploit the underlying insights from the data.
Develop validation strategies and tests to prove the design works.
Understand the business and technical challenges involved in implementing their design.
Understand the limitations and assumptions of the final implementation.

Most of what I’ve talked about so far has been to do with analysing the data and developing models, but this is only half of the job. There is no point in developing a clever model which solves all of a business’s problems if it can’t actually be incorporated into the business’s procedures, IT infrastructure, software design, etc. This is why it is important that a data scientist is able to collaborate with developers, business analysts, and technical architects to understand the physical limitations and non-functional requirements that their algorithm has to work with. In extreme situations this may mean that the original solution is unworkable and has to be completely redesigned but this is exactly why data scientists shouldn’t just develop a model and then throw it over the wall to some developers to implement. This is another reason why data scientists are more than just statisticians.

It is widely acknowledged that there are issues when it comes to developers implementing algorithms that have been designed by data scientists. Often this problem is solved by the data scientist restricting themselves to simple methods that, while not as effective as a more sophisticated technique, they know should be straightforward to implement by a developer that doesn’t have a data science background. I have frequently heard that data scientist end up implementing linear regression models because at least they know that they aren’t going to be problematic to implement. For me, this is the wrong way of going about things.

Ideally the developers that are tasked with implementing the latest creation from a data scientist should have some understanding of the methods involved and, by the same token, the data scientist should have enough understanding of the developer’s environment to be able to work with them to modify their algorithms to fit requirements surrounding execution time, data structure, data timeliness, etc. If this effort is not undertaken then there is the danger that the potential solutions are too limited or, in the worst case, the developer implements something that doesn’t actually do what is intended.

The Cerberus of Data

Okay, so I admit that finding one person who is an expert in all of the different areas I’ve just outlined is hard to come by so what do you do in reality? Fortunately, for sophisticated projects at least, you wouldn’t just have one person who is responsible for anything that might have some sort of data science aspect to it; the different facets are shared out within a data science team. By ensuring that your team has a disparate set of these skills, different people with different specialities can take up the different tasks involved and hopefully fulfil all of the requirements of the data scientist brief. It does require a certain amount of team working skills (not necessarily a universal trait among data scientists!) but it is often the only way to overcome the problem.

So, while it appears that there are a set of skills that a data scientist must have, it turns out that actually data scientists actually come in various different flavours. Depending on what their specialities are, you will have different data scientists that are better suited to doing different parts of the job, so you wouldn’t expect one person to be able to do everything. In retrospect, this does seem to undermine the whole subject of this blog somewhat but then the analysis of my self-defeating nature is the topic of another blog post. Maybe I need to get a data scientist to look into that… Or maybe two…