Is the culture at your facility one that seeks to understand why something failed or is it in a mode where you need to get back up and running as fast as possible?
Consider this scenario. Suppose you were working on your taxes on April 14th. In order to take full advantage of every deduction, you wanted to make sure you accounted for all charitable contributions. You log in to your bank account to get a record of these contributions and to your horror, you notice that $10,000 has been wired out of your account without your consent. What are you going to do? Are you going to continue doing your taxes or are you going to stop, call the fraud department and get them busy looking into the problem? A likely guess is the latter would occur and you probably would spend some time investigating it yourself. Granted, you might return to your taxes in order to get them done by April 15th to avoid penalties. However, once completing your taxes, you would probably return to the issue of the $10,000. You would probably stay in contact with the bank regarding what happened and how to prevent it from happening again.
This is exactly what workers are not doing with failures that are costing their companies money when they do not fully investigate and seek to understand them. They are not stopping what they are doing to investigate these failures and determine the physical, human and systemic root causes. Why are they doing this? Why hasn’t anyone articulated the importance of this issue to the organization and the value of learning from its failures and preventing recurrence?
Suppose for a minute you did nothing about the lost $10,000. What do you think would happen a few months down the road? You guessed it. The criminal would come back and steal another $10,000. That is exactly what happens at facilities when they don’t fully investigate production failures. When you do not eliminate the defects from your system by getting to the systemic causes, you allow a similar failure to occur later on down the road.
So, what should be done when a failure occurs?
First, you must preserve evidence. Evidence is key to any investigation. Without evidence, you do not have an investigation.
Second, you must study the evidence. If you are responsible for investigating a failure, it is imperative that you follow up expediently to study evidence. It is not right to ask operations or maintenance to preserve evidence if you are not prompt at studying it.
Third, you must do your best to understand the physical root cause before putting the equipment back in service. This is hard, as there is always pressure to get the equipment back up and running. This means the culture of the organization must be one where folks are prompt at looking at the failed equipment. You must have a sense of urgency around analyzing the evidence, thinking about possibilities as to how the equipment failed and ruling these in or out based on the evidence you see. Once you have a good idea of the physical root cause, then you need to do your best at not reintroducing this defect back into the equipment when putting it back together. You also need to have a management philosophy, whereas if you are prompt at responding to a failure, then the organization will give you the breathing room to dig into the issue to prevent recurrence. This usually equates to a few extra hours…not days.
Fourth, you must convene a team to investigate the failure. Conducting failure investigations with just one person is just plain sloppy. Conducting an investigation with just one person is basically pencil whipping the investigation to satisfy a requirement and not taking it seriously. You cannot properly investigate a failure by simply relying on the reliability engineer to do it alone. The team should have at least an operations representative, a maintenance representative and a reliability engineer.
Fifth, you need to use a process for conducting the investigation. Using a fault tree and 5 Whys is usually sufficient. Again, evidence drives the investigation. Asking “why” or “how” and then using evidence to either rule in or rule out possibilities is a practical way to conduct the investigation.
Sixth, you must identify the three types of root causes: physical, human and systemic. Investigations often stop at physical root causes. Why? Because it is easy to stop there. Physical root causes identify what flaw caused the particular failure. However, simply identifying this cause does not necessarily eliminate future failures from occurring.
You must identify the human root cause: what someone did to introduce the flaw into the system. This is a hard one since no one wants to place blame on a coworker. That is why it is imperative that you not stop there. Most people do not show up for work to do a bad job. You must understand why this individual introduced a flaw into the equipment. Understanding this leads to the final and most important type of root cause.
You must identify the systemic root cause. This cause answers the question as to why an individual made the decision he or she made. Identifying this root cause and putting mitigating actions in place will not only prevent failures from occurring in the equipment being investigated, but it will also prevent future failures from occurring in other equipment. Identifying this root cause has far-reaching positive consequences.
Many companies have become serious about eliminating safety events, whether personal or process. They have done a great job in understanding these events and putting systems in place to eliminate future events from occurring. Due to this dedication, most industries are much safer.
It is time to have this same dedication about reliability. It is time to start learning from production losses to prevent future failures from ever occurring. In doing so, companies can become even more profitable through increased reliability.
Just remember: If it were your money that was lost, how would you respond?