Skip to content

Agile Data Science: A Step Towards an Effective Process

Managing data science projects — or more specifically engineering teams building data products — is in my experience an art form, and a mysterious one at that. Anyone who has brought a data product to life will undoubtedly have tales of the trials and tribulations of imposing a process on these types of teams. In some cases, there is no process, leaving individuals to operate without a contract with stakeholders beyond a vague agreement. Increasingly, Agile practices, honed over the years for software projects, are adapted and applied with varying degrees of success. Too often the result is frustrated individuals and teams that fail to perform to their full potential. I’ll review some ideas my team is experimenting with that we hope will make us more effective at delivering data products.

Why are we in this state?

The application of data science in products is still relatively new. Where teams have adopted a process, they have continued to use the Agile software development processes they are familiar with, and for good reason. While not perfect, these processes, with their focus on iteration, continuously deliver customer value, and people have delivered remarkable results in the face of high complexity and uncertainty. So why have they not been as successful when data science is involved? At a fundamental level, my sense is that data science brings with it a certain research orientation not fully accounted for in these processes. In practice, I’ve observed at least three specific obstacles in applying these processes to the development of data products: 1) indecomposable stories; 2) high dependency; and 3) culture.

Decomposing user stories into smaller ones with true user value is more difficult when data science is involved. User value is often the output of some black box, and the steps to producing the black box are not valuable to users. Thus, stories that involve data science often have a large amount of technical work that needs to be done to deliver the minimally decomposable unit of user value.

Regardless of how stories are decomposed, the dependency between these smaller units of work is typically higher than conventional user stories. This makes sense as the work is fundamentally research, where the result of one unit of work impacts your decision on what the next unit of work is. Unfortunately, this makes it challenging, if not impossible, to effectively plan long sequences of work (e.g., sprints). Finally, because there is a higher degree of research involved than most traditional software development projects, the people and culture of these teams are often research-oriented. My experience driving these types of teams to a common goal is that there is a need for more, but not unchecked, freedom to explore. This balance between creative freedom and being goal-driven is one of the most challenging dynamics to get right and can have an enormous impact on a team’s performance.

What is the solution?

What process can help a team developing data products maximize its performance? Realistically, it’s going to take more experimentation to answer that, but I will describe some key aspects of the process my team has developed and is finding some success with.

The first step is to adopt a process. The goal of a process is to maximize the value a team creates in a sustained way over time. If a team is more than a few individuals, a process is needed to establish how collective decisions are made and how individuals can effectively work together on common goals.

Agile and Lean practices are the de facto standard these days and are in principle well aligned with research (e.g., iterative). After a brief stint with Scrum, we quickly pivoted to Kanban, although we have kept many of the roles and ceremonies (e.g., planning) of Scrum. Nothing surprising here, many teams have made this same realization. Our experience is that Kanban helps address the dependency and to some degree the culture challenges, providing more flexibility but still maintaining a focus on getting work done.

By far the most impactful innovation in our process, especially for me as a manager, has been how we define units of work. Instead of the typical user story (As a X I want Y so that Z), we write a research question or hypothesis (e.g., is the decrease in pirates causing global warming?). At the epic level, we capture a more typical user story to ensure user value is always top of mind (e.g., like everyone I want to stop global warming because it’s really bad). No story should be started before the team agrees there is a clear research question and value in answering it. The person working on a story should always be able to clearly articulate the question they are working on and why it is important. We are still experimenting with this process, but it has been an effective method of decomposing large tasks that would otherwise be tackled in isolation without checks and balances. Instead, the exercise of defining and reviewing research questions engages the collective intelligence of the team and increases the probability that only the most valuable work is pursued. By utilizing a Kanban process that enables questions to be developed and reviewed on the fly, you ensure that creativity and innovation are not stifled. However, you may encounter resistance to a level of oversight some may not be used to. Hopefully, they see the value to both the team and their professional development in being able to clearly articulate and justify their work.

Once in progress, the key is not to stray from the research question, which is particularly challenging at the start of the project when you won’t have enough knowledge or experience to define good questions. The tendency will be to pivot to another avenue of work once the flaws of the question are realized. This tendency should be resisted in favor of drawing whatever conclusions are possible within the scope of the original question. Be open to the conclusion that the original question was ill-defined and a new one is needed.

The last and arguably most experimental aspect of our process relates to estimation. My experience is that teams find it hard to estimate the effort or complexity of a research story. Perhaps with more experience and a canon of past research questions, estimations might be possible; however, I’ve seen a lot of software teams invest significant effort in this with questionable value. Hold on you say, estimation is part of Scrum, not Kanban. Correct. The reason it’s relevant in the context of Kanban is in defining the scope of a research question. Our approach to date has been to treat each research story, in Scrum terms, as a spike, and time box the effort to a week or less. Thus, some sense of effort is needed. The goal here is to again encourage individuals to leverage the team and prevent the situation where individuals work in isolation for extended periods of time.

My sense is a lot of teams, managers, and executives are struggling with how to manage data science teams and effectively leverage the new capabilities they bring to an organization. If nothing else, I hope this article spurs discussion on this important topic. The article speaks to the daily operations of a team but not higher-level processes like CRISP-DM or TRL that are employed in tandem to manage the life cycle of the project. Nor does it speak to how we manage both research and software stories. However, we are seeing some success building a culture around Kanban and defining good research questions. It’s helping us to stay focused on delivering value while not stifling innovation. Whatever process you adopt, be sure the team is engaged and buys in, as it’s meant to benefit them.

Originally Published on Towards Data Science