At LeanKit, we have a very thorough process for quality assurance, but sometimes, issues with unforeseen or uncontrollable circumstances (such as DNS providers, hosting services, etc.) can result in a critical issue affecting our software, which can have a significant impact on our users. This is why we choose to stop the line when critical issues arise — to stop, assess, and resolve the issue, and learn how to prevent it from occurring again.
We take these kinds of issues very seriously, since a relentless focus on delivering customer value is at the core of everything we do. Read to learn how we employ Lean concepts at the organizational level to tackle potential issues head-on, address them quickly, and use them as opportunities to make our product — and our people — stronger.
Tackle Potential Issues Head-On
Pull the Andon Cord
A well-known quality control mechanism from Toyota is the Andon cord. The cord is a way to alert others of issues on the production line; everyone has the authority to pull the cord. Pulling the cord immediately stops production and broadcasts a signal, alerting others that there is a critical issue that needs an immediate response.
In Lean manufacturing, a supervisor would then help the worker review the issue and determine next steps. At LeanKit, teams and leaders work together to resolve the issue and learn how to prevent it from occurring again.
Pulling the Andon cord is encouraged in Lean environments because it blocks defects from reaching customers, and creates an opportunity to improve the system to prevent future defects.
This is why we employ a virtual Andon cord at LeanKit. Anyone in the company who finds a possible issue in our product can call pull our virtual Andon cord. While most of these come from within Product Development, all of our departments use our own product to manage their work, so they have opportunities to call attention to possible issues as well.
When this happens, we make sure the entire company is made aware of the situation. Notifications are sent to everyone via Slack as well as an all hands email. We’re careful to ensure that these notifications only include known information, to make sure that we properly identify the true issue and find an appropriate, sustainable solution.
Swarm Around the Issue (Obeya)
Once the initial communication goes out to the entire company, we see an amazing thing happen: employees from different teams and departments start to offer their assistance. It’s quite common for, say, a Product Manager from an unaffected product segment saying, “I am available to help test if you need me.”
The Team Leads and Product Managers normally will gather in a room (for those who are local) and start a screen sharing session with their remote peers. This is what we typically call a “war room” scenario; in Lean manufacturing, it’s referred to as obeya. Everyone who’s needed to make critical decisions to fix the issue is immediately available.
We do this to speed communication and decision making, which is critical to solving the issue quickly. The people fixing the issue don’t have to wait for permission or more information before acting, since everyone they need to communicate with is readily available. It also ensures everyone is on the same page concerning the issue at hand, its underlying cause, the chosen resolution, and any established success criteria.
Go to the Gemba
A slight variation to the “war room” concept is that we don’t pull all the team members from their various rooms into a big conference room. Usually, they will gather in the room of the team affected and set up shop there. This reflects the spirit of the Lean concept gemba, which literally means “the real place”.
This is the idea that when trying to address an issue, you go to the place where the issues exists. In a manufacturing scenario this would usually mean the shop floor, but in a SaaS company, this translates to the team room where the work is being done. It’s important to note that our leadership is always present in the “war room”, since that is where they can see the situation most clearly and offer the best guidance. Learn more about the role of Lean leadership here.
Make the Work Visible
When the team is in place to address the issue, they’ll immediately post a card to our Product Development Roadmap Kanban board. This board has a specific lane at the very top for these types of issues. We strive to make the issue and its resulting work as visible as possible to the entire company, so everyone has an understanding of what went wrong and how it can be prevented in the future. It also enables customer-facing employees to be able to effectively communicate pertinent information as quickly and accurately as possible.
Limit Downstream Impact
Limit Other WIP
The card is tagged with the teams that are affected by the issue and any other cards for those teams are blocked to show that this card is everybody’s top and only priority until it’s resolved. Blocking other cards also makes it clear to other teams what work is being affected by the issue.
This lane is considered an Expedite lane, meaning that the team should work on cards in this lane before resuming or starting any other work, in order to move the card through the board as quickly as possible. We also put a WIP (work in process) limit of 1 on the Expedite lane to keep the team focused on resolving the issue before tackling anything else. This prevents context switching and ensures that any issue affecting our customers is resolved as quickly as possible.
Stop the Line (Jidoka)
The next step is the hardest for any organization: stopping the line. In Lean, this concept is called jidoka. It follows the theory that to have the best quality for your product and the best opportunity for continuous improvement, you must stop all production when an issue occurs and fix the issue before resuming work.
Stopping the line might sound crazy to many, since no value can be delivered from any part of the organization if all activities stop — but this kind of thinking is short-sighted. Failing to resolve issues as they occur only results in an insurmountable pile of technical debt, which keeps your organization from being able to move forward.
The only way to handle issues in a continuous improvement environment is to see each issue as an opportunity for improvement, and a stepping stone for sustainable growth.
Idea in Practice
This is essential to quality, since work done by other teams can not only interfere directly with the efforts to fix the issue, but it may also create downstream work for the people on the team fixing the issue. For instance, let’s say team A is working on resolving a critical issue. Meanwhile, team B is implementing changes that require review or some other type of involvement from members of team A.
Hopefully, team A is focused entirely on fixing the critical issue. If team B continues to implement the work they’re doing, they run the risk of creating even more work for team A once the issue is resolved. Team A will not be up to speed on the decisions that have been made, so they’ll need to spend time ramping up, and might miss some of the information they need in order to implement the work properly, which could affect the success of the implementation.
Now imagine this happening with a few other teams while the line is stopped. You can get into a situation where work is queueing up, waiting on the team members who are addressing the issue. The longer this work sits, the more opportunity there is for it to become stale and have a greater risk of its own issues once it is implemented.
It’s the risk of these downstream issues that warrant stopping all this other work during a “stop the line” issue. Fixing issues later, downstream of any process, is almost always more expensive. We have to realize that the cost of stopping the line and having under-utilization can often outweigh the cost of lost productivity during stop the line issues.
This does not mean anyone not working on the issue simply sits idle. This is an opportunity for them to work on continuous improvement projects, as long as those projects will not create a negative downstream effect on those working on the issue. These idle team members can work on automation, testing, professional development, etc. We ask our teams to keep a backlog of these types of projects so they can easily pick them up when a stop the line occurs. An important facet of this type of work is that it needs to be easily paused once the line starts again as well.
Practice Continuous Improvement
Once the issue has been resolved, we make sure that the resolution is communicated to the entire company. However, we don’t immediately start the line again yet. Fixing the immediate issue is only the first part. In an effort to strive for continuous improvement, or kaizen in Lean terminology, the teams involved hold a retrospective to review the issue and how effective our process was around resolving it.
During this retrospective, the team members strive to find the root cause of the issue using techniques such as the 5 Whys. Then, they discuss possible improvements to our product development process to prevent issues like this in the future. It’s very important that the team comes away with actionable items that can be implemented immediately, along with some idea of what success will look like — and a timeline for when we might start to see those benefits. The outcomes of the retrospective are recorded and reviewed with relevant teams and organizations.
Use Metrics Wisely
Each stop the line incident is monitored and various data is collected during the resolution process. We then use this data to understand how well we handle such incidents. How often to they occur? How long does it take us to resolve them? How often do we get stuck while resolving the issue? What areas of the application have the most issues?
These metrics are periodically reviewed by Product Development leadership to determine if there are opportunities for improvement at higher levels of our process. It’s imperative that we use these metrics carefully, because the last thing anyone wants is to discourage employees from stopping the line when it’s necessary (learn more about using metrics wisely here). If management were to react poorly to metrics about the frequency or duration of stop the line incidents, this could be the case.
Metrics about these rare, but significant, incidents should be seen as opportunities for improvement. We want to encourage all employees to pull the Andon cord and potentially stop the line if they see any major issues in our product, because this is how we practice continuous improvement. It’s everyone’s responsibility to create the environment where employees feel safe to do so.
LeanKit’s Core Values
LeanKit’s core values are continuous improvement, respect for people, and a relentless focus on delivering customer value. Stopping the line to resolve issues as they occur is an excellent example of living these values.
We show respect for our customers by immediately tackling any issue that might keep them from being able to use our product. Out of respect for our peers, when we have a stop the line issue, we halt any activities that would increase the workload of those affected. We allow them to put down everything they’re doing so that they can be completely focused on resolving the critical issue.
We also work to create an environment where anyone feels comfortable escalating an issue to this critical level. This is, of course, challenging — because no one wants to be the whistleblower. But it’s critical if we hope to continue on a path of healthy, sustainable growth.
We stop the line so that we can practice continuous improvement at the organizational level. With the resolution of each stop the line issue, we become smarter, our product becomes better, and we’re better able to prevent issues from occurring in the first place.
All of these efforts are rooted in our commitment to delivering customer value, which drives every decision we make.
To learn more about how Lean principles improve quality, productivity, and organizational health, check out these resources: