Maintenance Software Information

Please let me introduce myself, most of you know me as Mac Smith, but you may also recognize me as Anthony M. Smith, author of the McGraw Hill book titled Reliability-Centered Maintenance. I've been in the RCM business now for some 16 years, and over that time have had the privilege of working with hundreds of craft personnel and maintenance professionals from a host of Fortune 500 companies, as well as NASA and DOD.

The purpose of this column is to share with you some of my experiences and views on RCM and maintenance issues, and to receive your comments or questions that may be of interest to the maintenance community. In future columns, I intend to comment on topics such as the 80/20 rule, classical vs. streamlined RCM, failure mode and failure cause confusions, hidden vs. evident failure, the difference between failure finding and run-to-failure, selecting PM intervals and other such topics. I'd like to hear from you on topics that should be considered for future columns.

Some Terminology Clarifications

In the RCM process, Step 5, we examine failure modes, failure causes and failure effects in great detail in order to ascertain their link to loss of function and to define the priority with which each failure mode should be addressed for some form of PM action. Without a clear understanding of the above terminology, the analysis not only becomes confusing, but can also lead to decisions that are incorrect. The purpose of this column is to clarify these terms.

Preventive Maintenance (PM) deals with scheduled activities that are generally aimed at avoiding unexpected failures in our equipment. For as long as I can recall (some 46 active years in the engineering profession), there have been varying degrees of confusion about what people mean when they use terminology that involves the word "failure", and this confusion frequently carries over into discussions about maintenance and RCM decisions.

Let¹s start with a very basic notion: what do we mean by "failure"? Since this is a very unpleasant word in some industry vocabularies, we often find several substitute words employed since they tend to sound less threatening or less severe (e.g., anomaly, defect, discrepancy, irregularity, etc.). Of course, some of these words have their own meaning quite apart from their use as a substitute for failure. Furthermore, this then creates its own avenue of confusion, But how often do you hear people discuss (even argue) about the definition of a failure? The spectrum of interpretations runs from a negligible glitch to an absolutely catastrophic occurrence. Might I suggest that the meaning is really quite simple: a failure is the inability of an equipment, system or plant to meet its expected performance. This expectation is always spelled out in a Specification in our engineering world, and when properly written, leaves no doubt as to exactly where the limits of satisfactory performance reside. So, failure is the inability to meet Specifications. Simple enough, I believe, to avoid much of the initial confusion.

Continuing here are several important and frequently used phrases that include the word "failure". I refer here to failure symptom, failure mode, failure cause and failure effect as the four phrases of primary interest in maintenance. The definitions of these terms are as follows:

  • Failure Symptom: This is a tell-tale indicator that alerts us (usually the operator) to the fact that a failure is about to exist. Our senses or instruments are the primary source of such indication. Failure Symptoms may or may not tell us exactly where the pending failure is located or how close to the full failure condition we might be. In many cases, there is no Failure Symptom (or warning) at all. Once the failure has occurred, any indication of its presence is no longer a symptom - we now observe its effect (see Failure Effect).
  • Failure Mode: This is a brief (2-5 word) description of what is wrong. It is extremely important for us to understand this simple definition because, in the maintenance world, the failure mode is what we try to prevent, or, failing that, what we have to physically fix. There are hundreds of simple words that we use to develop appropriate failure mode descriptions - e.g., jammed, worn, frayed, cracked, bent, nicked, leaks, clogged, sheared, scored, ruptured, eroded, shorted, split, open, torn and so forth. The main confusion here is clearly distinguishing between failure mode and failure cause - and understanding that failure mode is what we need to prevent or physically fix.
  • Failure Cause: This is a brief (2-5 word) description of why it went wrong. Failure Cause is often very difficult to fully diagnose or hypothesize. If we wish to attempt a "permanent" prevention of the failure mode, we usually need to understand its cause (thus the term, root cause failure analysis). Even though we may know the cause, we may not be able to totally prevent the failure mode - or it may cost too much to pursue such a path. As a simple illustration, a gate valve "jams closed" (failure mode), but why did this happen? Let¹s say that this valve sits in a very humid outside environment - so "humidity induced corrosion" is the failure cause. We could opt to replace the valve with a hi-grade stainless steel model that would resist (perhaps stop) the corrosion - i.e., a design fix, or, from a maintenance point of view, we could periodically lubricate and operate the valve to mitigate the corrosive effect, but there is nothing we can do to eliminate the natural humid environment. Thus, PM tasks cannot fix the cause - they can only address the mode. This is an important distinction to make, and several people do not clearly understand this distinction.
  • Failure Effect: Finally, we briefly describe the consequence of the failure mode should it occur. To be complete, this is usually done at three levels of assembly - local, system and plant. In describing the effect in this fashion, we clearly see the buildup of the consequences, or conversely, can picture the lack of any serious impact due to, say, redundancy. With our jammed gate valve, the local effect at the valve is "stops all flow". At the system level, "no fluid passes on to the next step in the process", and finally, at the plant level, "product production ceases (downtime) until the valve can be restored to operation".

Thus, without a clear understanding of the above terminology, the analysis not only becomes confusing, but can also lead to decisions that are incorrect.

Please visit our website often for new columns on topics related to RCM and maintenance.