• Home
  • Management
  • Reliability Centered Maintenance and its Role in Maintenance Management

Reliability Centered Maintenance and its Role in Maintenance Management

Reliability Centered Maintenance and its Role in Maintenance Management

Reliability centered maintenance (RCM) and variations thereof are used by several organizations world-wide to address a host of reliability issues in order to improve ‘overall equipment effectiveness’ (OEE) while controlling the ‘life-cycle cost’ (LCC) inherent with physical asset management. RCM techniques are applied to any machinery system for which a preventative maintenance plan applying risk-based principles is desired.

Nowadays, RCM is defined through international standards. But it is the work done in the 1960s and 1970s which culminated in the Nowlan and Heap report in 1978 which all modern-day RCM maintenance approaches can be traced back to. The principles of modern maintenance as developed in the journey to RCM are not always known or understood, let alone applied. Moubray defined RCM as ‘a process used to determine what must be done to ensure that any physical asset continues to function in order to fulfill its intended functions in its present operating context’.

The importance of maintenance management and performance, and the need for effective and efficient maintenance methods, have promoted the development of RCM. RCM is defined as a ‘method to identify and select failure management policies to efficiently and effectively achieve the required safety, availability and economy of operation’ as per International Electrotechnical Commission (IEC) standard IEC 60300-3-11, 2009. RCM basically combines several well-known techniques and tools, in a systematic approach managing risks, as a basis for maintenance decisions. RCM was initially developed for the commercial aviation industry in the late 1960s, resulting in the publication of ATA-MGS-3 [1]2. RCM is now a proven and accepted methodology used in a wide range of industries.

RCM is a process to establish the safe minimum levels of maintenance while ensuring an equipment continues to perform to its design function within the current operating context. It achieves this by providing a means for determining optimal maintenance and operational strategies based on the probability and consequence of the analyzed failure modes.

RCM is a systematic approach to determine the maintenance requirements of plant and equipment in its operation. It uses preventive maintenance, predictive maintenance, real-time monitoring, and run to failure and proactive maintenance. These techniques are used in an integrated manner to increase the probability which a machine or component is going to function in the required manner over its design life cycle with a minimum of maintenance. The aim of RCM is to create such maintenance strategy which helps minimize the total operating costs while increasing reliability of the system.

RCM is not a stand-alone process. It is to be an integral part of the operations and maintenance programmes. The introduction of the RCM process involves changes to established working processes. For the successful introduction of such changes, it is important that management demonstrate its commitment to the changes, possibly in the form of a policy statement and personal involvement and that measures are taken to establish the engagement of those who are to be involved or affected by the changes. RCM works best when employed as a bottom-up process, involving those working directly in the operation and maintenance of the plant and equipment.

RCM is based on nine principles. These are (i) accept failures, (ii) majority of failures are not age related, (iii) some failures matter more than others, (iv) parts can wear out, which results into equipment break-down, (v) hidden failures are to be found, (vi) identical equipment does not mean identical maintenance, (vii) one cannot maintain way to reliability since If the equipment’s inherent reliability or performance is poor, doing more maintenance does not help and no amount of maintenance can raise the inherent reliability of a design, (viii) good maintenance programmes do not waste the resources, and (ix) good maintenance programmes become better maintenance programmes.

RCM is a process of systematically analyzing an engineered system to understand (i) its functions, (ii) the failure modes of its equipment which support these functions, (iii) how then to choose an optimal course of maintenance to prevent the failure modes from occurring or to detect the failure mode before a failure occurs, (iv) how to determine spare holding requirements, and (v) how to periodically refine and modify existing maintenance over time. The objective of RCM is to achieve reliability for all of the operating modes of a system.

The goal of an RCM approach is to determine the most applicable cost-effective maintenance technique to minimize the risk of impact and failure and to create a hazard-free working environment while protecting and preserving capital investments and their capability. This goal is accomplished through an identification of failure modes and their consequences for each system. This allows system and equipment functionality to be maintained in the most economical manner.

RCM provides a decision process to identify applicable and effective preventive maintenance requirements, or management actions, for equipment in accordance with the safety, operational and economic consequences of identifiable failures, and the degradation mechanism responsible for those failures. The end result of working through the process is a judgement as to the necessity of performing a maintenance task, design change or other alternatives to effect improvements. The basic steps of an RCM programme are (i) initiation and planning, (ii) functional failure analysis, (iii) task selection, (iv) implementation, and (v) continuous improvement.

Specific RCM objectives are (i) to ensure realization of the inherent safety and reliability level of the equipment, (ii) to restore the equipment to these inherent levels when deterioration occurs, (iii) to get the information necessary for design improvement of those items where their inherent reliability proves to be inadequate, and (iv) to accomplish these goals at a minimum total cost, including maintenance costs, support costs, and economic consequences of operational failures.

An RCM analysis, when properly conducted, is to answer the seven questions which are (i) what are the system functions and associated performance standards, (ii) how can the system fail to fulfill these functions, (iii) what can cause a functional failure, (iv) what happens when a failure occurs, (v) what are the consequence be when the failure occurs, (vi) what can be done to detect and prevent the failure, and (vii) what is to be done if a maintenance task cannot be found.

Typically, the tools and expertise which are used to perform RCM analyses are (i) FMEA (failure modes and effects analysis) / FMECA (failure modes, effects, and criticality analysis) which helps answer first five questions, (ii) RCM decision flow diagram which helps answer last two questions, (iii) design, engineering and operational knowledge of the system, (iv) condition-monitoring techniques, and (v) risk-based decision making (e.g., the frequency and the consequence of a failure in terms of its impact on safety, the environment and commercial operations).

All tasks are based on safety in respect of personnel and environment, and on operational or economic concerns. However, it is to be noted that the criteria considered depends on the nature of the product and its application. As an example, a production process is needed to be economically viable, and can be sensitive to strict environmental considerations, whereas an item of defence equipment is to be operationally successful, but can have less stringent safety, economic, and environmental criteria.

Maximum benefit can be achieved from an RCM analysis if it is conducted at the design stage, so that feedback from the analysis can influence design. However, RCM is also worthwhile during the operation and maintenance phase to improve existing maintenance tasks, make necessary modifications or other alternatives.

Successful application of RCM needs a good understanding of the equipment and structure, as well as the operational environment, operating context, and the associated systems, together with the possible failures and their consequences. Maximum benefit can be achieved through targeting of the analysis to where failures have serious safety, environmental, economic, or operational effects.

RCM improves maintenance effectiveness and provides a mechanism for managing maintenance with a high degree of control and awareness. Potential benefits can be (i) system dependability can be increased by using more appropriate maintenance activities, (ii) overall costs can be reduced by more efficient planned maintenance effort, (iii) a fully documented audit trail is produced, (iv) a process to review and revise the failure management policies in the future can be implemented with relatively minimum effort, (v) maintenance managers have a management tool which improves control and direction, and (vi) maintenance organization gets an improved understanding of its objectives and purpose and the reasons for which it is performing the scheduled maintenance tasks.

The maintenance programme is a list of all the maintenance tasks developed for a system for a given operating context and maintenance concept, including those arising from the RCM process. Maintenance programmes are normally composed of an initial programme and an ongoing, ‘dynamic’ programme. There are some principal factors which need to be considered in the development stage, that is before operation, and those which are used to update the programme, based on operational experience, once the product is in service.

The initial maintenance programme, which is frequently a collaborative effort between the supplier and the user, is defined prior to operation and can include tasks based on the RCM methodology. The on-going maintenance programme, which is a development of the initial programme, is initiated as soon as possible by the user once operation begins, and is based on actual degradation or failure data, changes in operating context, advances in technology, materials, maintenance techniques and tools. The on-going programme is maintained using RCM methodologies. The initial maintenance programme is updated to reflect changes made to the programme during operation.

An initial RCM programme can be initiated when the product is in service, in order to renew and improve on an existing maintenance programme, based on experience or manufacturer’s recommendations, but without the benefit of a standard approach such as RCM.

Fig 1 shows the overall RCM process given in IEC standard IEC 60300-3-11, 2009.The process is divided into five steps. It can be seen from this figure that RCM provides a comprehensive programme which addresses not just the analysis process but also the preliminary and follow-on activities necessary to ensure that the RCM effort achieves the desired results. The RCM process can be applied to all types of systems.

Fig 1 Overview of the RCM process

RCM process is formalized by documenting and implementing (i) the analyses and the decisions taken, (ii) progressive improvements based on operational and maintenance experience, and (iii) clear audit trails of maintenance actions taken and improvements made Once these are documented and implemented, this process is an effective system to ensure reliable and safe operation of an engineered system. Such a maintenance management system is called an RCM system.

RCM integrates preventive maintenance, predictive testing and inspection, repair (also called reactive maintenance), and proactive maintenance to increase the probability which a machine or component is going to function in the required manner over its design life-cycle with a minimum amount of maintenance and downtime. These principal maintenance strategies, rather than being applied independently, are optimally integrated to take advantage of their respective strengths, and maximize facility and equipment reliability while minimizing life-cycle costs. The goal of this approach is to reduce the life-cycle cost of a facility to a minimum while continuing to allow the facility to function as intended with the required reliability and availability. The basic application of each strategy is shown in Fig 2.

Fig 2 Components of a reliability centered maintenance programme

RCM needs maintenance decisions to be supported by sound technical and economic justification. The RCM approach also considers the consequence of failure of a given component. For example, an identical make and model of exhaust fan can be used to support restroom operations or as part of a smoke / purge system. The consequence of failure and the maintenance approach of the two units are different, based on the system used.

RCM programmes can be implemented and conducted in several ways and use different kinds of information. One technique is based on rigorous FMEA / FMECA, complete with mathematically-calculated probabilities of failure based on a combination of design data, historical data, intuition, common-sense, experimental data, and modelling. This approach is broken into two categories namely (i) rigorous, and (ii) intuitive. The decision as to how the RCM programme is implemented is to be made by the end user based on (i) consequences of failure, (ii) probability of failure, (iii) historical data, and (iv) risk tolerance (objective criticality).

A rigorous RCM approach (also known as classical RCM) provides the majority of the knowledge and data concerning system functions, failure modes, and maintenance actions addressing functional failures. Rigorous RCM analysis is the method which produces the most complete documentation, compared to other RCM methods. A formal rigorous RCM analysis of each system, sub-system, and component is normally performed on new, unique, high-cost systems and structures. This approach is rarely needed for most facilities and collateral equipment items since their construction and failure modes are well understood.

Rigorous RCM is based primarily on the FMEA / FMECA and includes probabilities of failure and system reliability calculations but with little or no analysis of historical performance data. Rigorous RCM is labour intensive and frequently postpones the implementation of obvious predictive testing and inspection tasks. A rigorous RCM approach has been used extensively in those industries where functional failures have the potential to result in large losses of life, national security implications, or extreme environmental impact. The analysis is used to determine appropriate maintenance tasks or possible redesign requirements in order to address each of the identified failure modes and their consequences.

A rigorous RCM approach is to be limited to three situations namely (i) the consequences of failure result in catastrophic risk in terms of environment, health, safety, or complete economic failure of the unit, (ii) the resultant reliability and associated maintenance cost is still unacceptable after performing and implementing a streamlined-type FMEA /FMECA, and (iii) the system or equipment is new to the organization and insufficient corporate maintenance and operational knowledge exists on its function and functional failures.

Candidates for rigorous RCM analysis include, but are not limited to, wind tunnel drive motors, super-computer facilities, and facility support systems where single points of failure exist. In addition, a rigorous RCM analysis can be needed for those systems and components where the intuitive RCM approach has been utilized and the resultant reliability is still unacceptable in terms of security, safety, environmental, life-cycle cost, or mission impact

An intuitive RCM approach (also known as streamlined or abbreviated RCM) is typically more appropriate for facility systems because of the high analysis cost of the rigorous approach, the relative low impact of failure of most facilities systems, the type of systems and components maintained, and the quantity of redundant systems in place. The intuitive RCM approach uses the same principles as the rigorous RCM approach, but recognizes that not all failure modes are to be analyzed.

An Intuitive RCM approach identifies and implements obvious, condition-based maintenance (CBM) tasks with minimal analysis. Low value maintenance tasks are discarded or eliminated based on historical data and ‘maintenance and operations’ (M&O) personnel input. The intent is to minimize the initial analysis time in order to help offset the cost of the FMEA / FMECA and condition monitoring capabilities development. Errors can be introduced into the RCM process by reliance on historical records and personnel knowledge, creating a possibility of not detecting hidden, low-probability failures. The intuitive process needs that at least one individual has a thorough understanding of the different predictive testing and inspection technologies.

An Intuitive RCM approach is to be applied in situations such as (i) the function of the system / equipment is well understood, and (ii) functional failure of the system or equipment does not result in loss of life, catastrophic impact on the environment, or economic failure of the organizational unit. An intuitive RCM approach is desired for facilities and collateral equipment with the understanding that a more rigorous approach can be warranted in certain situations.

RCM principles – RCM focuses on several principles. The first is function-oriented, i.e., RCM seeks to preserve system or equipment function, not just operability for operability’s sake. Redundancy of function through redundant equipment improves functional reliability but increases life-cycle cost in terms of procurement and operating costs.

The second is system-focused, i.e., RCM is more concerned with maintaining system function than individual component function.

he third is reliability-centered, i.e., RCM treats failure statistics in an actuarial manner. The relationship between operating age and the failures experienced is important. RCM is not overly concerned with simple failure rate. It seeks to know the conditional probability of failure at specific ages (the probability that failure occurs in each given operating age bracket).

The fourth is acknowledgement of design limitations, i.e., the objective of RCM is to maintain the inherent reliability of the equipment design, recognizing that changes in inherent reliability are the province of design rather than maintenance. Maintenance can only achieve and maintain the level of reliability for equipment which is provided for by design. RCM recognizes that maintenance feedback can improve on the original design. RCM recognizes that a difference frequently exists between the perceived design life and the intrinsic or actual design life, and addresses this through the ‘age exploration’ (AE) process.

The fifth is safety, security, and economics. Safety and security are to be ensured at any cost while life-cycle cost-effectiveness is a tertiary criterion.

The sixth is failure as any unsatisfactory condition. Failure can be either a loss of function (operation ceases) or a loss of acceptable quality (operation continues).

The seventh is the logic tree to screen maintenance tasks. This provides a consistent approach to the maintenance of all equipment.

The eight is tasks are to be applicable. Tasks are to address the failure mode and consider the failure mode characteristics.

The ninth is tasks are to be effective. Tasks are to reduce the probability of failure and be cost-effective.

Three types of maintenance tasks are (i) tasks are time-directed (preventive maintenance), (ii) tasks are condition-directed (predictive testing and inspection directed), and (iii) failure-finding (one of several aspects of proactive maintenance) tasks. Time-directed tasks are scheduled when appropriate. Condition-directed tasks are performed when conditions indicate they are needed. Failure-finding tasks detect hidden functions which have failed without giving evidence of failure and are normally time directed. Run-to-failure (RTF) is a conscious decision and is acceptable for some equipment.

Living system – RCM gathers data from the results achieved and feeds this data back to improve design and future maintenance. This feedback is an important part of the proactive maintenance element of the RCM programme. It is to noted that the maintenance analysis process, as shown in Fig 3 has only four possible outcomes, namely (i) perform interval- (time or cycle) based actions, (ii) perform Condition-based (predictive testing and inspection directed) actions, (iii) perform no action and choose to repair following failure, and (iv) determine no maintenance action is going to reduce the probability of failure and failure is not the chosen outcome (redesign or redundancy). Regardless of the technique used to determine the maintenance approach, the approach is to be reassessed and validated. Fig 3 depicts an iterative RCM process which can be used for a majority of facilities and collateral equipment.

RCM analysis – RCM analysis carefully considers these questions (i) what does the system or equipment do and what are its functions, (ii) what functional failures are likely to occur, (iii) what are the likely consequences of these functional failures, and (iv) what can be done to reduce the probability of the failure, identify the onset of failure, or reduce the consequences of the failure. Fig 3 shows the RCM approach and the interactive streamlined process.

Fig 3 RCM analysis considerations

Benefits of RCM – Benefits of RCM include safety, security, cost, reliability, scheduling, and efficiency. The safety policy of an organization is normally to avoid loss of life, personal injury, illness, property loss, property damage, and environmental harm, and to ensure safe and healthy conditions for persons working at or visiting its facilities. The RCM approach supports the analysis, monitoring, early and decisive action, and thorough documentation which are characteristic of the organizational safety policy. Also, a RCM approach provides improved reliability of physical barriers (such as potential barriers and motor / hydraulic gates) and emergency power supplies (such as generators and UPS systems) by adding predictive testing and inspection tasks.

Because of the initial investment needed for getting the technological tools, training, and equipment condition baselines, a new RCM programme typically results in an increase in maintenance costs. This increase is relatively short-lived, averaging two to three years. The cost of repair decreases as failures are prevented and preventive maintenance tasks are replaced by condition monitoring. The net effect is a reduction of both repair and total maintenance costs. Frequently energy savings are also realized from the use of predictive testing and inspection techniques. Fig 4 shows the effect on maintenance and repair costs.

Fig 4 Effect on maintenance and repair costs

Fig 4 shows the Bow wave effect experienced during the RCM implementation life-cycle. Initially, maintenance and operational costs increase for two reasons namely (i) RCM implementation costs include procurement of test equipment, software, training of existing employees, and in some cases the recruitment of new personnel and use of consultants, and (ii) as more sophisticated testing techniques are used, more faults are detected, resulting in more repairs until all of the potential failures have been addressed and / or mitigated. The cost of specific cost of maintenance per unit of product is cheaper as it changes from run to failure to preventive maintenance, to predictive maintenance, and finally to an RCM-based practice.

RCM places high emphasis on improving equipment reliability through the feedback of maintenance experience and equipment condition data to facility planners, designers, maintenance managers, operators, and manufacturers. This information is instrumental for continually upgrading the equipment specifications for increased reliability. The increased reliability which comes from RCM leads to fewer equipment failures, higher availability for organizational support, and lower maintenance costs.

The ability of a preventive maintenance programme to forecast maintenance provides time for planning, getting replacement parts, and arranging environmental and operating conditions before the maintenance is done. Predictive testing and inspection reduce the unnecessary maintenance performed by a time-scheduled maintenance programme which are driven by the minimum ‘safe’ intervals between maintenance tasks. A principal advantage of RCM is that it gets the maximum use from equipment. With RCM, equipment replacement is based on actual equipment condition rather than a pre-determined, generic length of life.

Safety is the primary concern of RCM. The secondary concern of RCM is cost-effectiveness. Cost-effectiveness takes into consideration the priority or organization’s criticality and then matches a level of cost appropriate to that priority. The flexibility of the RCM approach to maintenance ensures that the proper type of maintenance is performed on equipment when it is needed. Existing maintenance which is not cost-effective is identified and not performed.

Impact of RCM on the facility life-cycle – The life-cycle of a facility is frequently divided into two broad stages namely, (i) acquisition (planning, design, construction, and commissioning), and (ii) operations. RCM affects all phases of the acquisition and operations stages to some degree, as shown in Fig 5. Decisions made early in the acquisition cycle affect the life-cycle cost of a facility. Even though expenditures for plant and equipment can occur later during the acquisition process, their cost is committed at an early stage.

Fig 5 Life cycle cost commitment

Planning (including conceptual design) phase of RCM facility life-cycle implications fixes two-thirds of the facility’s overall life-cycle costs. The subsequent design phases determine an additional 29 % of the life-cycle cost, leaving only around 5 % of the life-cycle cost which can be impacted by the later phases. The decision to include a facility in an RCM programme is best made during the planning phase. An early decision in favour of RCM allows for a more substantial, beneficial impact on the life-cycle cost of a facility. Delaying a decision to include a facility in an RCM programme decreases the overall beneficial impact on the life-cycle cost of a facility. RCM is capable of introducing considerable savings during the operation and maintenance phase of the facility’s life. Savings of 30 % to 50 % in the annual maintenance budget are frequently achieved through the introduction of a balanced RCM programme.

RCM and failure analysis – Failure is the cessation of proper function or performance. RCM examines failure at several levels such as the system level, sub-system level, component level, and the parts level.

The objective of an effective maintenance in the organization is to provide the needed system performance at the minimum cost. This means that the maintenance approach is to be based upon a clear understanding of failure at each of the system levels. System components can be degraded or even failed and still not cause a system failure. For example, a failed parking lamp on an automobile has little effect on the overall performance of the automobile of the car as a transportation system.

Fig 6 depicts the entire life-cycle of a system from the moment of design intent (minimum required performance + design safety margin) through degradation, to functional failure, and subsequent restoration. In Fig 6, failure occurs the moment system performance drops below the point of minimum needed performance. The role of the operation and maintenance personnel is to recognize the margin to failure, estimate the time of failure, and pre-plan needed repairs in order to minimize the ‘Mean Time to Repair’ (MTTR) and associated downtime in order to achieve the maximum overall equipment effectiveness (OEE) within budgetary constraints.

Fig 6 System performance curve

Predictive testing and inspection measure the base-line system, component performance, and the quantity of degradation. It forecasts impending failure in a timely manner so repairs can be performed prior to a catastrophe. As shown in Fig 7, the point of initial degradation and the point of initial detection rarely coincide. It is necessary that the interval between tasks be initially established conservatively. At least the maintenance tasks are to be performed between the point of initial degradation and point of initial detection.

Fig 7 Modified P-F curve

Preventive maintenance task intervals are to be set based on location, application, and operating environment. Maintenance schedules are to be modified for remote locations where parts and service are not readily available. For the predictive testing and inspection and the preventive maintenance tasks, the difference between the actual system performance and the minimum needed performance are to be considered. For example, a fan operating at an overall vibration level of 5 millimetres per second (mm/sec) is closer to the point of functional failure than an identical unit operating at an overall vibration level of 2.5 mm/sec and is to be monitored more frequently.

Fig 7 and Fig 8 provides a conceptual degradation detection graph which shows the baseline state as well as the onset of degradation from initial detection, to alert status, and to finally remove from service since failure is imminent. Although the actual moment of failure for the majority of the systems and components is not known, the fact that failure is imminent is known.

Fig 8 suggests there is a steady, non-linear progress from baseline condition until potential failure and recommended removal. Analysis of the data is needed to observe any changes in slope of the plotted data. As the failure point is approached, the resistance of the object to failure frequently decreases in an exponential manner. In this situation, catastrophic failure occurs almost without warning. Once the alert limit has been exceeded, the monitoring interval is to be reduced to between one-third and one-quarter of the prior interval. For example, if vibration data was being collected quarterly, the new interval is to be between three and four weeks. As the vibration levels continue to rise, the monitoring interval is to continue to be reduced. Conversely, if the readings stabilize, the monitoring interval can be increased.

Fig 8 Conceptual degradation detection graph

The concepts presented in Fig 6, Fig 7, and Fig 8 are to be fully understood and adhered to if the full benefits of RCM and predictive testing and inspection are to be realized. Condition monitoring activities are required to occur frequently in order to forecast the potential failure. As shown in Fig 8, prediction error bounds do exist and are frequently referred to as a probability of detection and probability of false alarm.

System and system boundary – A system is any user-defined group of components, equipment, or facilities which support an operational function. These operational functions are defined by objective criticality or by environmental, health, safety, regulatory, quality, or other organizational requirements. Majority of the systems can be divided into unique sub-systems along user-defined boundaries. The boundaries are selected as a method of dividing a system into sub-systems when its complexity makes an analysis by other means difficult.

A system boundary or interface definition contains a description of the inputs and outputs across each boundary as well as the power requirements and instrumentation and control. The facility envelope is the physical barrier created by a building, enclosure, or other structure plus 1.5 metre (m), e.g., a cooling tower or tank.

Function and functional failure – The function defines the performance expectation and its several elements can include physical properties, operation performance (including output tolerances), and time requirements (such as continuous operation or limited required availability). A system performance curve similar to Fig 7 exists for each operating parameter. Functional failures describe the different ways in which a system or sub-system can fail to meet the functional requirements designed into the equipment.

A system or sub-system which is operating in a degraded state but does not impact any of the requirements addressed, has not experienced a functional failure. It is important to determine all the functions of an item which are significant in a given operational context. By clearly defining the functions’ non-performance, the functional failure becomes clearly defined. For example, it is not enough to define the function of a pump to move water. The function of the pump is required to be defined in terms of flow rate, discharge and suction pressure, and efficiency etc.

Failure modes – Failure modes are equipment-specific failures and component-specific failures which result in the functional failure of the system or sub-system. As an example, a machinery train composed of a motor and pump can fail catastrophically because of the complete failure of the windings, bearings, shaft, impeller, controller, or seals. A functional failure also occurs if the pump performance degrades such that there is insufficient discharge pressure or flow to meet the operational requirements. These operational requirements are to be considered when developing maintenance tasks. Dominant failure modes are those failure modes responsible for a considerable proportion of all the failures of the item. They are the most common modes of failure. Not all failure modes or causes warrant preventive or predictive maintenance since the likelihood of occurrence can be remote or the effect inconsequential.

Reliability – Reliability [R(t)] is the probability which an item survives a given operating period, under specified operating conditions, without failure. The conditional probability of failure [P(t/t1)] measures the probability that an item entering a given age interval fails during that interval. The item shows wear-out characteristics if the conditional probability of failure increases with age. If the conditional probability of failure is constant with age, the resulting failure distribution is exponential and applies to the majority of facilities equipment.

All physical and chemical root causes of failure have time varying hazard rates h(t) and frequently need time-based tasks, normally inspection and / or measurement. The conditional probability of failure reflects the overall adverse effect of age on reliability. It is not a measure of the change in an individual equipment item. Failure frequency, or failure rate, plays a relatively minor role in maintenance programmes since it is too simplistic to gauge much. Failure frequency is useful in making cost decisions and determining maintenance intervals, but does not tell which maintenance tasks are appropriate or the consequences of failure. A maintenance solution is to be evaluated in terms of the safety or economic consequences it is intended to prevent. A maintenance task is to be applicable (i.e., prevent failures or ameliorate failure consequences) in order to be effective.

Failure characteristics – Conditional probability of failure (Pcond) curves fall into six classic failure rate patterns, as shown (Pcond vs. Time) in Fig 9. The percentage of equipment conforming to each of the six wear patterns as determined in three separate studies is also shown in th figure.

Fig 9 Conditional probability of six classic failure rate pattern

The six classic failure rate patterns are described below.

Pattern A or Type E – This pattern follows bathtub curve, i.e., infant mortality followed by a constant or gradually increasing failure probability and then a pronounced wear-out region. An age limit can be desirable, provided a large number of units survive to the age where wear-out begins. This pattern is typical of a overhauled reciprocating engine.

Pattern B or Type A – In this pattern there is constant or gradually increasing failure probability, followed by a pronounced wear-out region. An age limit can be desirable. This pattern is typical of a reciprocating engine and a pump impeller.

Pattern C or Type F – In this pattern, there is gradually increasing failure probability, but no identifiable wear-out age. Age limit is normally not applicable. This pattern is typical of a gas turbine.

Pattern D or Type C – In this pattern, there is low failure probability when the item is new or just overhauled, followed by a quick increase to a relatively constant level. This pattern is normally found in complex equipment under high stress with test runs after manufacture or restoration such as hydraulic system.

Pattern E or Type D – In this pattern, there is relatively constant probability of failure at all ages. Example of this pattern are roller bearing and ball bearing.

Pattern F or Type B – In this pattern, there is infant mortality, followed by a constant or slowly increasing failure probability. Typical example of this pattern is electronic equipment or components.

Patterns A and B are typical of single-piece and simple items such as tires, compressor blades, brake pads, and structural members. Majority of the complex items have conditional probability curves similar to patterns C, D, E, and F. The basic difference between the failure patterns of complex and simple items has important implications for maintenance. Single-piece and simple items frequently demonstrate a direct relationship between reliability and age. This is particularly true where factors such as metal fatigue or mechanical wear are present or where the items are designed as consumables (short or predictable life spans). In these cases, an age limit based on operating time or stress cycles can be effective in improving the overall reliability of the complex item of which they are a part.

Complex items frequently demonstrate some infant mortality, after which their failure probability increases gradually or remains constant, and a marked wear-out age is not common. In several cases scheduled overhaul increases the overall failure rate by introducing a high infant mortality rate into an otherwise stable system. The failure characteristics have ben first noted in the book ‘Reliability Centered maintenance by Nowlan, F. Stanley, and Howard F. Heap. Follow-on studies in Sweden, and by the US Navy, produced similar results. In these three studies, random failures are between 77 % and 92 % of the total failures and age-related failure characteristics for the remaining 8 % to 23 %.

Preventing failure – Every equipment item has a characteristic which can be called resistance to or margin to failure. Fig 10 shows this concept graphically. The figure shows that the failures can be prevented or item life extended by (i) decreasing the quantity of stress applied to the item (the life of the item is extended for the period f0 – f1 by the stress reduction shown), (ii) increasing or restoring the item’s resistance to failure (the life of the item is extended for the period f1 – f2 by the resistance increase shown), and (iii) decreasing the rate of degradation of the item’s resistance to or margin to failure (the life of the item is extended for the period f2 – f3 by the decreased rate of resistance degradation shown).

Fig 10 Preventing failure

Stress is dependent on use and can be highly variable. A review of the failures of a large number of nominally identical simple items disclose that the majority has about the same age at failure, subject to statistical variation, and that these failures have occurred for the same reason. If preventive maintenance for some simple item is being considered and a way to measure its resistance to failure can be found, then this measurement information can be used to help select a preventive task.

Adding excess material which wears away or is consumed can increase resistance to failure. Excess strength can be provided to compensate for loss from corrosion or fatigue. The most common method of restoring resistance is by replacing the item. The resistance to failure of a simple item decreases with use or time (age), but a complex unit consists of hundreds of interacting simple items (parts) and has a considerable number of failure modes. In the complex case, the mechanisms of failure are the same, but they are operating on several simple component parts simultaneously and interactively. Failures no longer occur for the same reason at the same age. For these complex units, it is unlikely that one can design a maintenance task unless there are few dominant or critical failure modes.

FMEA is applied to each system, sub-system, and component identified in the boundary definition. For every function identified, there can be multiple failure modes. The FMEA addresses each system function, all possible failures, and the dominant failure modes associated with each failure. The FMEA then examines the consequences of failure to determine what effect failure has on the organization or operation, on the system, and on the machine. Even though there are multiple failure modes, frequently the effects of failure are the same or very similar in nature.

From a system function perspective, the outcome of any component failure can result in the system function being degraded. Similar systems and machines frequently have the same failure modes, but the system use determines the failure consequences. For example, the failure modes of a ball bearing are to be the same regardless of the machine, but the dominate failure mode, cause of failure, and effects of failure change from machine to machine.

There are two new terms identified in FMEA. These are criticality and probability of failure occurrence. Criticality assessment provides the means for quantifying how important a system function is relative to the identified objective. Tab 1 provides a method for ranking system criticality. This system, adapted from the automotive industry, provides 10 categories of criticality / severity. It is not the only method available. The categories can be expanded or contracted to produce a organization specific listing.

Tab 1 Criticality / severity categories
1NoneNo reason to expect failure to have any effect on safety, health, environment or objective.
2Very lowMinor disruption to facility function. Repair to failure can be accomplished during trouble call.
3LowMinor disruption to facility function. Repair to failure may be longer than trouble call but does not delay objective.
4Low to moderateModerate disruption to facility function. Some portion of the objective can need to be reworked or process delayed.
5ModerateModerate disruption to facility function. 100 % of objective can need to be reworked or process delayed.
6Moderate to high Moderate disruption to facility function. Some portion of objective is lost. Moderate delay in restoring function.
7HighHigh disruption to facility function. Some portion of the objective is lost. Considerable delay in restoring function.
8Very highHigh disruption to facility function. All of the objective is lost. Considerable delay in restoring function.
9HazardPotential safety, health, or environmental issue. Failure occurs with warning.
10HazardHazard potential safety, health, or environmental issue. Failure occurs without warning

Probability of occurrence – The probability of occurrence (of failure) is based on work in the automotive industry. Tab 2 provides one possible method of quantifying the probability of failure. Historical data provides a powerful tool in establishing the ranking. If historical data is not available, a ranking can be estimated based on experience with similar systems in the facility area. The statistical column can be based on operating hours, day, cycles, or other unit which provides a consistent measurement approach. The statistical bases can be adjusted to account for local conditions.

Tab 2 Probability of occurrence categories
11/5,000Remote probability of occurrence, unreasonable to expect failure to occur.
21/5,000Low failure rate. Similar to past design which has, in the past, had low failure rates for given volume / loads.
31/2,000Low failure rate. Similar to past design which has, in the past, had low failure rates for given volume / loads.
41/1,000Occasional failure rate. Similar to past design which has, in the past, had occasional failure rates for given volume /loads.
51/500Moderate failure rate. Similar to past design which has, in the past, had moderate failure rates for given volume / loads.
61/200Moderate to high failure rate. Similar to past design which has, in the past, had moderate failure rates for given volume / loads.
71/100High failure rate. Similar to past design which has, in the past, had high failure rates which has caused problems.
81./50High failure rate. Similar to past design which has, in the past, had high failure rates which has caused problems.
91/20Very high failure rate. Almost certain to cause problems.
101/10+Very High failure rate. Almost certain to cause problems.

Cause of failure – After the function and failure modes are determined, it is necessary to investigate the cause of failure. Without an understanding of the causes of potential failure modes, it is not possible to select applicable and effective maintenance tasks.

A combination of one or more equipment failures and / or human errors causes a loss of system function. The factors which normally influence equipment failure are (i) design error, (ii) faulty material, (iii) improper fabrication and construction, (iv) improper operation, (v) inadequate maintenance, and (vi) maintenance errors. It is to be noted that maintenance does not influence several of these factors. Hence, maintenance is merely one of the several approaches to improving equipment reliability and, hence, system reliability.

RCM analyses focus on reducing failures resulting from inadequate maintenance. In addition, RCM aids in identifying premature equipment failures introduced by maintenance errors. In these cases, RCM analyses can recommend improvements for specific maintenance activities, such as improving maintenance procedures, improving employee performance, or adding quality assurance / quality control tasks to verify correct performance of critical maintenance tasks. Besides improving maintenance, RCM analyses can recommend design changes and / or operational improvements when equipment reliability cannot be ensured through maintenance.

For effectively improving equipment reliability through maintenance, design changes, or operational improvement, people are to have an understanding of potential equipment failure mechanisms, their causes and associated system impacts. Equipment failure is defined as a state or condition in which a component no longer satisfies some aspect of its design intent (e.g., a functional failure has occurred because of the equipment failure). RCM focuses on managing equipment failures which result in functional failures. For developing an effective failure management strategy, the strategy is to be based on an understanding of the failure mechanism. Equipment shows several different failure modes (e.g., how the equipment fails). Also, the failure mechanism can be different for the different failure modes, and the failure mechanisms can vary during the life of the equipment. To help understand this relationship, Tab 3 shows typical hardware-related equipment failure mechanisms.

Tab 3 Dominant physical failure mechanisms for hardware
Type of failureFailure mechanism
Mechanical loading failureDuctile fracture
Brittle fracture
Mechanical fatigue
Stress corrosion cracking
Temperature-related failureCreep
Metallurgical transformation
Thermal fatigue

Equipment failure rate and patterns – Depending on the dominant system failure mechanisms, system operation, system operating environment, and system maintenance, specific equipment failure modes show a variety of failure rates and patterns. Statistically, failure rate is expressed in terms of operating time (or another pertinent operating parameter) elapsed before an item of equipment fails. Because of the variable nature of failure time, normally a failure density distribution is used to provide the probability of an item failing after a given operating time.

Depending on the equipment failure mode, a variety of distributions (described below) are used to statistically model the probability of failure. Failure density distributions measure the probability of failure within a given interval (e.g., between time zero and 8,000 hours of operation).

Exponential distribution – It is a lifetime statistical distribution which assumes a constant failure rate for the product being modelled.

Failure distribution – It is a mathematical model which describes the probability of failures occurring over time. It is also known as the probability density function. This function is integrated to get the probability which the failure time takes a value in a given time interval. This function is the basis for other important reliability functions, including the reliability function, the failure rate function, and the mean life.

Generalized gamma distribution – While not as frequently used for modelling life data as other life distributions, the generalized gamma distribution does have the ability to mimic the attributes of other distributions such as the Weibull or lognormal, based on the values of the distribution’s parameters. While the generalized gamma distribution is not frequently used to model life data by itself, its ability to behave like other more commonly-used life distributions is sometimes used to determine which of those life distributions is to be used to model a particular set of data.

Lognormal distribution – It is a lifetime statistical distribution which is frequently used to model products in which physical fatigue is the prominent contributor to the primary failure mode.

Mixed Weibull distribution – It is a variation of the Weibull distribution used to model data with distinct sub-populations which can represent different failure characteristics over the lifetime of a product. Each sub-population has separate Weibull parameters calculated, and the results are combined in a mixed Weibull distribution to represent all of the sub-populations in one function.

Normal distribution – It is a common lifetime statistical distribution which has been developed by mathematician CF Gauss. The distribution is a continuous, bell-shaped distribution which is symmetric about its mean and can take on values from negative infinity to positive infinity.

Probability density function – It is a mathematical model which describes the probability of events occurring over time. This function is integrated to get the probability which the event time takes a value in a given time interval. In life data analysis, the event in question is a failure, and the probability density function is the basis for other important reliability functions, including the reliability function, the failure rate function, and the mean life.

Weibull distribution – It is a statistical distribution frequently used in life data analysis. It is a common failure distribution used to model equipment failures. It has been developed by Swedish mathematician Waloddi Weibull, this distribution is widely used because of its versatility and the fact that the Weibull probability density function can assume different shapes based on the parameter values.

Weibull Distribution uses three parameters in order to ascertain the best fitting probability density function given the data provided. These parameters are the shape parameter, the scale parameter, and the location parameter. When referring to a two parameter Weibull dstribution, the location parameter is omitted. The location parameter is utilized when the data does not fall on a straight line, but falls on either a concave up or concave down curve.

Weibull distribution is used when equipment shows a constant failure rate for a portion of its life followed by increasing failure rate because of the wear-out. In addition, Weibull analysis is used when there are a small number of failure data. A Weibull graph can be used to determine if the failure is because of (i) infant mortality or wear-in, (ii) random, (iii) early wear-out, and (iv) wear-out This information is helpful in determining an appropriate maintenance strategy. The Weibull graph can also be correlated between the probability of failure and operating time. These data can be helpful in establishing task intervals for certain types of maintenance tasks (e.g., rebuilding tasks).

Another common statistical measure associated with these distributions is mean time to failure (MTTF). MTTF is the average life to failure for the equipment failure mode. Hence, it represents the point at which the areas under the failure distribution curve are equal above and below the point. Determining the MTTF, hence, depend on the type of failure distribution used to model the failure mode.

MTTF data are helpful in determining when to perform certain types of maintenance tasks. For example, if the appropriate maintenance strategy is to rebuild an equipment item, the MTTF data can be used to help set the rebuilding task interval. If the MTTF is represented by a normal distribution and the interval is set at the MTTF, then one can assume that there is a 50 % chance of the item failing before it is rebuilt. If the interval is set less than the MTTF, then the probability of the item failing before being rebuilt is less than 50 %. If the interval is more than the MTTF, then the probability is more than 50 %. The increase or decrease in probability as the interval is moved before or after the MTTF depends on the standard deviation of the distribution.

A more useful measurement, derived from the failure distribution, is the conditional failure rate or lambda. The conditional probability failure rate is the probability which a failure occurs during the next instant of time, given that the failure has not already occurred before that time. The conditional failure rate, hence, provides additional information about the survival life and is used to show failure patterns. Fig 9 shows six classic conditional failure setup patterns.


Understanding that equipment failure modes can show different failure patterns has important implications when determining appropriate maintenance strategies. For example, rebuilding or replacing equipment items which do not have distinctive wear-out regions (e.g., patterns C through F) in Fig 9 is of little benefit and can actually increase failures as a result of infant mortality and / or human errors during maintenance tasks. For the majority of the equipment failure modes, the specific failure patterns are not known and, fortunately, are not needed to make maintenance decisions.

However, certain failure characteristic information is needed to make maintenance decisions. These characteristics are (i) wear-in failure which is also known as ‘burn in’ or ‘infant mortality’ failure and which is dominated by ‘weak’ members related to issues such as manufacturing defects and installation / maintenance / start-up errors, (ii) random failure which is dominated by chance failures caused by sudden stresses, extreme conditions, and random human errors etc. (e.g., failure is not predictable by time), and (iii) wear-out failure which is dominated by end-of-useful life issues for equipment. Equipment failure characteristics are shown in Fig 11.

Fig 11 Equipment life periods

By simply identifying which of the three equipment failure characteristics is representative of the equipment failure mode, one gains insight into the proper maintenance strategy. For example, if an equipment failure mode shows a wear-out pattern, rebuilding or replacing the equipment item can be an appropriate strategy. However, if an equipment failure mode is characterized by wear-in failure, replacing or rebuilding the equipment item is not advisable. Finally, a basic understanding of failure rate helps in determining whether maintenance or equipment redesign is necessary. For example, equipment failure modes which show high failure rates (e.g., fail frequently) are normally best addressed by redesign rather than applying more frequent maintenance.

Failure management strategy – Understanding failure rates and failure characteristics allows the determination of an appropriate strategy for managing the failure mode (e.g., RCM refers to this as the failure management strategy). Developing and using this understanding is fundamental to RCM and critical to improving equipment reliability. It is no longer considered to be true that the more an item is overhauled, the less likely it is to fail. Unless there is a dominant age-related failure mode, age limits do little or nothing to improve the reliability of complex items. Sometimes, scheduled overhauls can actually increase overall failure rates by introducing infant mortality and / or human errors into otherwise stable systems.

In RCM, the failure management strategy can consist of (i) appropriate proactive maintenance tasks, (ii) equipment redesigns or modifications, or (iii) other operational improvements. The purpose of the proactive maintenance tasks in the failure management strategy is to (i) prevent failures before they occur, or (ii) detect the onset of failures in sufficient time so that the failure can be managed before it occurs.

Equipment redesigns, modifications and operational improvements (RCM refers to these as one-time changes) are attempts to improve equipment whose failure rates are too high or for which proactive maintenance is not effective / efficient. The key issues in determining whether a specific failure management strategy is effective are (i) is the failure management strategy technically feasible, (ii) is an acceptable level of risk achieved when the failure management strategy is implemented, and (iii) is the failure management strategy cost-effective.

The risk-based decision tools and the RCM analysis process provide a more detailed discussion on determining effectiveness of the failure management strategy. In addition to proactive maintenance tasks and one-time changes, servicing tasks and routine inspections can be critical to the failure management strategy. These activities help ensure the equipment failure rate and failure characteristics are as anticipated. For example, the failure rate and failure pattern for a bearing drastically changes if it is not properly lubricated. These proactive maintenance tasks, run-to-failure, one-time changes, and servicing and routine inspections are further described below.

Proactive maintenance tasks – Proactive maintenance tasks are divided into four categories. The first is the planned-maintenance tasks. A planned-maintenance task (also called preventative maintenance) is performed on a specified interval, regardless of the equipment’s condition. The purpose of this type of task is to prevent functional failure before it occurs. Several times this type of task is applied when no condition monitoring task is identified or justified, and the failure mode is characterized with a wear-out region.

RCM further divides planned maintenance into the two sub categories namely (i) restoration task which is a scheduled task that restores the capability of an item at or before a specified interval (age limit) to a level which provides a tolerable probability of survival to the end of another specified interval, and (ii) discard task which is a scheduled task involving discarding an item at or before a specified age limit regardless of its condition at the time. The terms ‘restoration’ and ‘discard’ can be applied to the same task.

The second is condition-monitoring tasks. A condition-monitoring task is a scheduled task used to detect the onset of a failure so that action can be taken to prevent the functional failure. A potential failure is an identifiable condition which indicates that a functional failure is either about to occur or in the process of occurring. Condition-monitoring tasks are only to be chosen when a detectable potential failure condition exists before failure. When choosing maintenance tasks, condition-monitoring tasks are to be considered first, unless a detectable potential failure condition cannot be identified. Condition-monitoring tasks are also referred to as ‘predictive maintenance’.

The third is combination of tasks where the selection of either condition-monitoring or planned-maintenance tasks on their own do not seem capable of reducing the risks of the functional failure of the equipment, it is necessary to select a combination of both maintenance tasks. Normally, this approach is used when the condition-monitoring or planned-maintenance task is insufficient to achieve an acceptable risk by itself.

The fourth is the failure-finding tasks. A failure-finding task is a scheduled task used to detect hidden failures when no condition-monitoring or planned-maintenance task is applicable. It is a scheduled function check to determine whether an item performs its needed function if called upon. Majority of these items are standby or protective equipment.

Run-to-failure – Run-to-failure is a failure management strategy which allows an equipment item to run until failure occurs and then a repair is done. This maintenance strategy is acceptable only if the risk of a failure is acceptable without any proactive maintenance tasks.

One-time changes – One-time changes are used to reduce the failure rate or manage failures in which appropriate proactive maintenance tasks are not identified or cannot effectively and efficiently manage the risk. The basic purpose of a one-time change is to alter the failure rate or failure pattern through (i) equipment redesigns or modification, and / or (ii) operational improvements.

One-time changes most effectively address equipment failure modes which result from (i) faulty design and / or material, (ii) improper fabrication and / or construction, (iii) misoperation, and (iv) maintenance errors. These failure mechanisms frequently result in a wear-in failure characteristic and, hence, need a one-time change. When no maintenance strategy can be found which is both applicable and effective in detecting or preventing failure, a one-time change is to be considered. For failure modes which have the highest risk, a one-time change is mandatory. Two types of one-time change are equipment redesign or modifications, and (ii) operational improvements.

Equipment redesign or modifications consist of redesign or modifications which entail physical changes to the equipment or system. Operational improvements can be modifications to the operation of the equipment and / or modifications to the way in which maintenance is performed on the equipment. Operational improvements normally entail changing the operating context, changing operating procedures, providing additional training to the operating personnel or maintenance personnel, or any combination thereof.

Servicing and routine Inspection are simple tasks intended (i) to ensure that the failure rate and failure pattern remain as predicted by performing routine servicing (e.g., lubrication), and (2) to spot accidental damage and / or problems resulting from ignorance or negligence. They provide the opportunity to ensure that the general standards of maintenance are satisfactory. These tasks are not based on any explicit potential failure condition. Servicing and routine inspection can also be applied to items which have relatively insignificant failure consequences, yet are not to be ignored (minor leaks, and drips etc.).

Leave a Comment