It’s one of the highest paid jobs in tech: Data Scientist. This year, Matthew Renze is introducing Data Visualization with R at Devoxx UK. Ahead of his talk, we asked him about Data Science and the art of data manipulation.


What is data science and why is it important?

Data science is an interdisciplinary field composed of computer programming, math and statistics, and domain expertise. It’s goal is to transform raw data into actionable insight, i.e. to create knowledge that can be used to make rational decisions.

Data science has become increasingly important in recent years. The world is going through a second industrial revolution, specifically an information revolution. Essentially, we are transitioning from what is primarily an industrial economy into an information economy. This new economy is one that is largely driven by data.

Due to various technological and economic factors, the cost to create, store, and process data continues to decrease. As a result, the amount of data being generated each year is growing at an exponential rate. In addition, the value that we can derive from the data continues to increase as we develop more powerful machine learning algorithms.

So, there is currently a high demand for people with the skills necessary to transform data into actionable insight. Data scientists have these necessary skills. They are able to help traditional organizations transition into data-driven enterprises.

How much of data science is data manipulation?

I’ve spoken with several data scientists about this question. From their responses, and my own personal experience, I’d estimate that we spend roughly 80% of our time transforming and cleaning data to make them usable for analysis.

Most data sets I’ve worked with are relatively messy, disjointed, and in formats that are not initially ready for analysis. This means that we have to do quite a bit of data munging (i.e. transforming and cleaning the data) in order to prepare them for analysis.

The unfortunate reality is that data scientists, who supposedly have the sexist job of the 21st century, spend the majority of their time performing “data-janitor” tasks. After this, then they get to do the super cool data science stuff.

What are the different tools and techniques used by data scientists?

Data scientists use a variety of tools and techniques to work with data. In addition, these tools and techniques are changing all the time. As far as popular tools go, the ones that are used most often in the industry are SQL, R, Python, and Excel. The techniques that are used most often are statistical techniques like numerical analysis, data visualization, hypothesis testing, and statistical modeling.

Beyond this, we have more powerful tools and techniques to extract insight from very large data sets including Big Data tools (like Hadoop and Spark), data-mining tools, and machine learning tools.

I’m personally very interested in the latest breed of deep-learning algorithms, like deep neural networks. These algorithms can produce highly accurate predictions given very noisy data. In fact, all of the most impressive machine learning advances I’ve seen in the past few years have come from deep-learning algorithms. Deep-learning toolkits like Tensor Flow, Torch, and the Microsoft Cognitive Toolkit are going to be all the rage in the next few years.

What knowledge of statistics is a prerequisite?

This is a tricky question and you’ll get different answers depending upon who you talk to. I, personally, feel that having a deep knowledge of statistics is necessary in order to perform statistically rigorous analyses and in order to avoid specific types of statistical bias. Essentially, it’s really easy to lie with data either intentionally or by accident. Having a deep understanding of statistics is how we protect ourselves from making these types of errors.

On the other hand, a lot of tools are embedding the necessary knowledge of statistics into the tools themselves. This means that you just need to know how to use the tool. In order to get a correct answer, you don’t need to know exactly how the tool works.

In addition, because most businesses are primarily concerned with making money as their key measure of success, the business often doesn’t care exactly how the predictions are being made, as long as the predictions are generating a profit. This can be a bit dangerous. This is especially in heavily regulated industries. There, transparency is necessary to justify business decisions for legal or regulatory reasons.

If you’re interested in learning more about data science, please check out my website. I have dozens of articles, presentations, and online courses to help you learn how to transform your data into actionable insight.


For more, see Matthew’s talk at Devoxx UK: