What Does a Data Scientist Really Do?

Hadoop? R? Machine learning algorithms? A data scientist must know a lot more than programming and platforms.

All too often the emphasis is placed on the word “data” within the title “data scientist.”

While there is robust reason for this – after all data is the currency of the data scientist – it’s more accurate to emphasize the word “scientist.” If you run a Google search “what does a data scientist do?”, you’ll be presented with varying descriptions of what a data scientist does.

What’s the reason for so many descriptions?

Isn’t there a one-size-fits all encapsulation of what data scientists do? Yes and no. Such is the reason it’s far wiser to focus on the word “scientist.” Certainly, data scientists work with big data but it’s not that simple. Not all data is relevant to every industry. It’s also difficult to disaggregate the “do” of data science with the “need to know” since they are intricately connected elements of the profession. Add to this that the needs of the particular organization define their data scientist duties, and we encounter another definition roadblock.

So what does a data scientist do? We’ll focus on specifics of the data science process as the primary definition of what data scientists “do.” Keep in mind, this is a summary and some aspects of what a data scientist does will be dimensionally reduced for definition purposes.

Five things every real data scientist does

  1. Ask a specific question: This isn’t merely any random question; think of this as the hypothesis generating step of the traditional scientific method. Many job description requirements for data scientists include having a PhD. This is partially due to the fact that during a PhD program, you’re taught to ask questions and devise ways to test them using specific parameters.
  1. Begin to analyze the data you have: At times an exploratory data analysis comes first, other times you ask the question and then perform an EDA. At that point, data scientists must decide if the data they have is relevant to the question, if it’s revealing a more important question, or if they need more data. There may be situations where it’s not possible to answer the initial question.
  1. Hypothesis and model creation: Firming up the hypothesis and then deciding on which statistical model best fits both the data and the question can be the most time-consuming aspect of a data scientist’s job. Also, there’s the possibility that the data scientist must build an entirely new model to answer the question(s).
  1. Model validation and performance measurement: After the model is tested and outputs analyzed, it’s now time to determine a) whether the model used was the appropriate one and/or b) whether the model needs to be tooled due to a varying degree of mismatch between the results and the initial question(s).
  1. Reporting the results: This can occur at any stage throughout the process unless we’re speaking in terms of official internal and external reports relative to the organization. The “not all data” is relevant also applies to reporting. If a data scientist is merely generating an internal report for themselves and others directly related to the project, then this information might not be applicable to any other department at that time. As such, the who, what, where, when, why, and how of generating visual, qualitative, and quantitative reporting is yet another analytical decision squarely situated in the realm of the data scientist.

Does this look at all familiar? It should. It’s a similar systemic process undertaken in the overall world of science. Physicists, biologists, and chemists all work with data posing questions related to that data. The difference is minimal except for the focus of data scientists is the “science of data” whereas all others mentioned are centered in field of study where gathering data is a mere part of their discipline. For data scientists, the data – arguably all data – is their expertise.