Dantotsu, Radical Software Quality Improvement - A Summary of Fabrice Bernhard's Talk From Craft Conf 2022
As you might already know I have visited the Craft Conf 2022 Software Developer conference recently. You can find my report about it here. There were many outstanding talks and I have decided to highlight some of them by writing a summary of those talks.
As I am a test automation engineer it was no question I will listen to all the quality assurance related talks. My first pick is Fabrice Bernhard's talk called Dantotsu Radical Software Quality Improvement which definitely raised my curiosity with the unusual word Dantotsu and of course also with the promise of radically improving quality.
Fabrice Bernhard is the Co-Founder and Group CTO of Theodo a fast-growing custom digital software consultancy. They have adopted DevOps, Scrum, and lean and with their aid, they were able to grow from 10 to 600 people delivering great quality work despite the rapid growth.
The Dantotsu Book
In today's world where software is taking over basically everything, it is quite a big challenge to scale without sacrificing quality as Fabrice explained. One year ago a book by Sadao Nomura has been published with the title "The Toyota Way of Dantotsu Radical Quality Improvement" which immediately raised Fabrice's attention. His talk was about the Dantotsu method and how they adapted it for software development. Please let me share his talk in my own words.
Software is eating the world
As Marc Andreessen summarized in 2011: Software is eating the world. Starting around 20 years ago startups like Netflix, Amazon, and Airbnb disrupted whole sectors using software and this tendency is rising continuously many new players coming raising a huge amount of funds. This is of course good news for us software developers as a portion of that money will land in our pockets. It is good news for us up to the point that introducing bugs has very expensive consequences.
I think most of us heard about the Ariane 5 catastrophe. Does not sound familiar? What if I tell a rocket with 7 billion dollar development cost and 500 million dollar value exploded 40 seconds after its launch in 1996 due to an integer conversion error in a single line of code in which a 32-bit integer was passed to a function that expected a 16-bit integer.
The code which resulted in the catastrophe was responsible for the horizontal alignment of the rocket. Just a few lines above the bug the conversion for the vertical alignment was handled correctly, but for some reason, the developer(s) did not find it important to handle the same for the horizontal alignment.
When we are talking about hundreds of millions of dollars it is not surprising space industry has extreme quality processes. To mention an example adding 6000 lines of code for the GPS feature needed 2500 pages of specification and the 400k line of code for the whole system required 40000 pages of specifications which means one page of specification for 10 lines of code.
Like it or not software is everywhere nowadays, not just in your thermostat, coffee machine, and television but also in more mission-critical things like self-driving cars. Cause of this in many cases we need especially high-quality software as we are not only talking about money but even about human lives.
Unfortunately writing 4 pages of documentation for 10 lines of code is not really scalable, especially in our fast-paced agile world where you have to deliver value for your customer frequently.
On the other end of the scale there is the move fast and break things approach with which of course you can scale but you have to sacrifice quality in order to do that.
The software practices used for the shuttle produced one defect per 400k lines while the industry average based on studies is around 10-20 defects per 1k lines of code which means 5000 times more defects.
As both of the approaches has their downsides we need another approach and here comes Dantotsu into the picture.
The Dantotsu Method
The Dantotsu method has been developed by Sadao Nomura at Toyota. Toyota is well known for its high-quality standards and I think there are no software developers who have not heard about lean thinking which also originated from Toyota.
A Brief History of Dantotsu
I am pretty sure this part won't be as fun as in Fabrice's talk so I am keeping it short. Sadao Namura an executive at Toyota regularly visited each branch to understand the individual situations of those branches. As an outcome of the visits, he wrote an A3-sized paper about the problems and also about their countermeasures.
When he visited months later to his surprise he realized nothing has changed, so he gave them a similar A3-sized page and asked them to implement it this time, just to realize a few months later nothing has changed.
As Sadao realized it does not work this way he decided to make a major change in his policy and that was which lead to the birth of the Dantotsu Quality Activities.
The Method
First, let's see what are the key points of the Dantotsu method.
- Visual quality management is highly emphasized. Tracking daily, and monthly defects on the vehicles visualized on a wall, not on a computer. And also tracking the ambitious goals on the monthly view.
- Defects are classified by stages of outflow instead of priority. No defects should reach the customer, so they are focusing on how to detect that earlier in the flow. There are four categories of these defects as you can see in the below image. What is very important to mention is that they also classify based on the source of the defect determining whether it is coming from production engineering, from the supplier, from welding, ...
- The team leader is highly involved in examining the product defect which is an 8-step procedure. The procedure's steps are:
- Team leader examines the defective part to identify the faulty process
- Team leader checks other parts in stocks for the same effect
- Team leader investigates the defect's cause through interviews with workers
- Team leader implements countermeasures
- Team leader reports on defect analysis and countermeasures during a daily meeting
- Team leader creates/improves standards and deploys horizontally to handle similar processes/items
- Team leader train workers
- Team leader checks if workers are performing according to standard throughout go and sees As you can see the team leader takes a significant part in all 8 steps. There is one more very important aspect of this procedure. The whole process is conducted in 24 hours. The first four steps happen during the first day and the other four happen during the next day.
- Systematic defect analysis. They collect defect details, determine causes, and then define the required countermeasures. You can see an example below. At this point, Fabrice mention they had an ongoing debate in his team for years. The testing passionates said we could have avoided bugs by introducing more unit tests, and better end-to-end tests, and the other half of the group was of the opinion to design a better architecture, doing a better domain-driven design. Fabrice was more on the prevention side than on the testing side, but thanks to the Dantotsu book he realized the whole debate was wrong. They should do both of them.
- Dantotsu invests in systems to leave no chance for defects to appear again. As soon as a defect appears two or three times they invest in systems that prevent the appearance of those bugs.
This was a short summary of the Dantotsu method. In order to get a better grasp on the method, it is highly recommended to read the book.
At this point you might think, it is great they found a method that can be used to radically improve quality in a factory but how can we apply it in software development. Please, keep on reading to find out as Fabrice gave an answer to that in the second part of his talk by telling what they did in order to implement the process.
Dantotsu and Software Development
The first thing that was done by his team was to standardize the definition of a defect. In the end, they agreed everything is a defect that was unexpected by the user.
Standardize the way the group measure bugs. They tried different approaches to measure the bugs they even tried it on paper as Sadao Namura emphasized you should not use a computer for that, but you should do it on paper/wall. This approach might work in a factory but in the era of remote work is not feasible, so they have built a tool with which they can track defects on different projects.
They have standardized their outflow classification by defining 5 groups:
- A defects: Defects detected by developers
- B defects: Defects detected by the team during code reviews or functional reviews
- C defects: Defects detected during an inspection by PO or QA
- D defects: Defects detected in production
- E defects: Defects that generated a customer complaint
They also determined the source of the defects as you can see in the below diagram.
After defining the basics they went on by putting the Dantotsu method into practice. Their team lead became their tech lead whose responsibility was to analyze one bugfix a day.
The tech leads not just analyze the bugfix but if it is not clear who should fix it then the dev lead's responsibility is to fix it. After fixing the bug, all the information collected about it, even a screenshot of the code which fixed the bug. After that, the tech lead does a root cause analysis and defines the needed countermeasures. He/she analyzes how they could have detected it earlier and also how they could prevent it in the future. In terms of timing, the tech lead fixes the bug, analyzes the defect, checks other parts of the codebase for similar defects, and comes up with suggested countermeasures. During the next day, the tech lead shares his findings and the suggested countermeasures the next day during the daily bugfix analysis session. And during the next sprint, the team implements the countermeasures on which they agreed.
To mention a few examples of potential countermeasures:
- Creating Lint rules
- Adding tests
- Rewrite parts of the system
- Create or improve training material
- Train team members on identified skill gaps.
Fabrice mentioned they even experimented with A defects. They call a bug A defect when the code does not work as expected at the first manual test. They even gamified it. If your code works on the first try you get a "right the first time" award. They even have a Slack bot congratulating every time. It not just promotes the celebration of quality but also thinking before coding.
That was the summary of how you can use the Dantotsu method in your software development project. Before I close this article let me write a few words about the learnings Fabrice shared with us.
Learnings
- Try to connect efforts better with the big picture: Putting so much effort at the defect level might result in losing the big picture. In order to avoid it, the team has a very good visual representation of the architecture on which they can map the defects to the architecture.
- On large ongoing projects the result will take months: They have introduced the method on large ongoing projects and the sad reality is it might take several months to see the results. When they did root cause analysis there were cases when the root cause dated back 9 months.
- They also have experience with a project which used the Dantotsu method right from the beginning. It is a small project with 5789 lines of code developed in 118 engineer days. They found 2 defects in production (D defect) and 2 defects in inspection (C defect) which means 0.3 defects in 1000 lines of code or 0.02 defects per engineer-day which is one order of magnitude better than what they saw in other projects.
To sum it up I found Fabrice's talk about how they implemented the Dantotsu method very fascinating. I really like the idea and I will definitely share this article with my team. We work on a much bigger project and I do not see much chance to have a dedicated person responsible for analyzing bugfixes and determining countermeasures daily, but having 1-2 sessions weekly by cherry-picking some of the bugs might be feasible. As Fabrice's project and their experiences are relatively fresh I am pretty curious about how their project evolved over time and I am sure Fabrice will share their learnings with us.