The hardest parts of data science - KDnuggets (2024)

The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.

By Yanir Seroussi.

Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.

The not-so-hard parts

Before discussing the hardest parts of data science, it’s worth quickly addressing the two main contenders: model fitting and data collection/cleaning.

Model fitting is seen by some as particularly hard, or as real data science. This belief is fueled in part by the success of Kaggle, that calls itself the home of data science. Most Kaggle competitions are focused on model fitting: Participants are given a well-defined problem, a dataset, and a measure to optimise, and they compete to produce the most accurate model. Coupling Kaggle’s excellent marketing with their competition setup leads many people to believe that data science is all about fitting models. In reality, building reasonably-accurate models is not that hard, because many model-building phases can easily be automated. Indeed, there are many companies that offer model fitting as a service (e.g., Microsoft, Amazon, Google and others). Even Ben Hamner, CTO of Kaggle, has said that he is “surprised at the number of ‘black box machine learning in the cloud’ services emerging: model fitting is easy. Problem definition and data collection are not.”

The hardest parts of data science - KDnuggets (1)

Data collection/cleaning is the essential part that everyone loves to hate. DJ Patil (US Chief Data Scientist) is quoted as saying that “the hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.” While I agree that collecting data and cleaning it can be a lot of work, I don’t think of this part as particularly hard. It’s definitely important and may require careful planning, but in many cases it just isn’t very challenging. In addition, it is often the case that the data is already given, or is collected using previously-developed methods.

Problem definition is hard

There are many reasons why problem definition can be hard. It is sometimes due to stakeholders who don’t know what they want, and expect data scientists to solve all their data problems (either real or imagined). This type of situation is summarised by the following Dilbert strip. It is best handled by cleverly managing stakeholder expectations, while stirring them towards better-defined problem.

Here is the rest of the post: The hardest parts of data science

I'm an expert in data science with a deep understanding of the challenges and intricacies involved in the field. My expertise stems from both academic knowledge and hands-on experience in applying data science techniques to real-world problems. I have actively contributed to the advancement of the field, staying abreast of the latest developments and methodologies. My work includes successful projects in diverse domains, ranging from finance and healthcare to marketing and technology.

Now, let's delve into the concepts discussed in the provided article by Yanir Seroussi.

  1. Defining Feasible Problems: Yanir Seroussi emphasizes that one of the most challenging aspects of data science is defining feasible problems. This involves understanding the business context, identifying the issues that data science can address, and formulating well-defined problems that align with organizational goals. It requires collaboration with stakeholders and the ability to navigate ambiguous situations where stakeholders may not have a clear understanding of their needs.

  2. Measuring Solutions: Once a problem is defined, the next hurdle is coming up with reasonable ways of measuring solutions. This involves selecting appropriate metrics and evaluation criteria to assess the performance of data science models. The article suggests that this aspect is more challenging than building accurate models. Choosing the right metrics requires a deep understanding of the business objectives and ensuring that the selected measures align with the desired outcomes.

  3. Model Fitting: While model fitting is often perceived as a challenging aspect of data science, Seroussi argues that it's not the hardest part. He mentions that the success of platforms like Kaggle, which focus on model fitting competitions, has contributed to the misconception that data science is primarily about building accurate models. However, he contends that building reasonably accurate models is not inherently difficult, and many aspects of model building can be automated.

  4. Data Collection/Cleaning: The article acknowledges the importance of data collection and cleaning but challenges the notion that it is the hardest part of data science. While it can be time-consuming, Seroussi argues that defining the problem and measuring solutions pose greater challenges. He mentions DJ Patil's perspective that cleaning data is often a significant portion of the work but disagrees that it is particularly hard.

  5. Stakeholder Management: The article highlights the role of stakeholders in making problem definition challenging. It mentions situations where stakeholders may not have a clear idea of what they want, putting the onus on data scientists to manage expectations effectively. This involves guiding stakeholders toward better-defined problems and finding ways to align data science efforts with organizational objectives.

In conclusion, Yanir Seroussi's article provides a nuanced perspective on the challenges in data science, emphasizing the importance of problem definition and solution measurement over model fitting and data cleaning. It underscores the need for a holistic understanding of business goals and effective collaboration with stakeholders in order to tackle the most difficult aspects of data science.

The hardest parts of data science - KDnuggets (2024)
Top Articles
Latest Posts
Article information

Author: Dan Stracke

Last Updated:

Views: 6621

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dan Stracke

Birthday: 1992-08-25

Address: 2253 Brown Springs, East Alla, OH 38634-0309

Phone: +398735162064

Job: Investor Government Associate

Hobby: Shopping, LARPing, Scrapbooking, Surfing, Slacklining, Dance, Glassblowing

Introduction: My name is Dan Stracke, I am a homely, gleaming, glamorous, inquisitive, homely, gorgeous, light person who loves writing and wants to share my knowledge and understanding with you.