General system failure rate. Reliability and survivability of on-board computing systems (btsvs)

Reliability and survivability of onboard computing systems (BCVS).

Reliability is the property of products to perform the required functions, maintaining their performance indicators within the specified limits for the required period of time.

Survivability - the ability of a computer system to perform its main functions, despite the damage received and the failed hardware elements.

More stringent requirements are imposed on the reliability and survivability of BUVM and BCVS than to the reliability and survivability of universal and personal computers. If the on-board computer fails, the system's operability is disrupted, and the assigned tasks are not performed, which can lead to irreparable consequences, including human casualties.

Re-solving the problem after restoring the on-board computer and the on-board computer is often impossible. So, for example, if the BCVS of an anti-aircraft missile system malfunctions, the defended object will be destroyed. And, if you restore the system to work in a short time, then the destruction will not be able to return in the same way as the lost lives. Failure in avionics can lead to a plane crash or missiles spontaneously. In this case, the restoration of the BCVS operation will also not allow to correct the consequences of the error.

Ensuring high reliability and survivability of the BCVS is complicated by the operating conditions of the equipment on board at large fluctuations in temperature, humidity, mechanical loads and in conditions of high dust content. The same restriction is imposed on the dimensions and weight of the equipment. This mainly applies to aviation, but it is also of great importance for the BCVS in other areas.

Thus, the problem of reliability and survivability of the on-board computer and the on-board computer has a number of peculiarities due to the uniqueness of the on-board computer structure and the nature of the functions they perform.

The task of providing high reliability and survivability in a complex system can be very costly, complex and time-consuming, although difficulties with production and problems arising during operation, due to the need to ensure and maintain the required level of reliability, can cause even more difficulties. ...

For example, with a decrease in the reliability of a missile system by 10%, to ensure the same degree of target destruction, an increase of at least 10% in the actual number of combat missiles will be required. These missiles require additional launch pads, test equipment, launch equipment, maintenance personnel and ancillary equipment, which is expensive and time consuming.

The more complex the structure of a computing system, the more difficult it is to ensure reliability and survivability. It should be noted that most of the failures that have occurred during launches of guided missiles and artificial satellites in the United States were not caused by a malfunction of any exotic device, the design of which has accelerated the progress of the state of the art. On the contrary, many failures were caused by the malfunction of functional and structural elements of a previously approved design. Sometimes the elements were made incorrectly, and in other cases there were errors in the work of programmers or maintenance personnel. There is no such small thing that would be too insignificant in order not to be a possible reason for rejection. The high potential and achievable reliability is largely the result of deep and close attention to detail.

The problem of increasing the reliability and fault tolerance is inherent not only to the BCVS, but also to commercial equipment. For example, in a Google cluster, on average, 1 computer fails per day (that is, about 3% of computers fail over a year). Of course, due to data and code redundancy, these failures are invisible to users, but for the programmer they are a big problem.

The case when a computing system or its part is out of order and further work is impossible without repair is called a failure.

Reliability theory distinguishes between 3 characteristic signs of failures that can be inherent in equipment and appear without any influence from people.

1. Break-in failures. These failures occur during the early period of operation and in most cases are caused by a lack of production technology and defects in the manufacture of elements of computing systems. These failures can be eliminated by the process of rejection, running-in and technological testing of the finished product.

2. Defective or gradual failures. These are failures arising from the wear of individual parameters or parts of the equipment. They are characterized by a gradual change in the parameters of the product or elements. In the beginning, these failures can manifest themselves as temporary failures. However, as wear and tear increases, temporary failures turn into serious hardware failures. These failures are a sign of BCVS aging. They can be partially eliminated with proper operation, good prophylaxis and timely replacement of worn-out equipment elements.

3. Sudden or catastrophic failures. These failures cannot be eliminated by hardware debugging, proper maintenance, or preventive maintenance. Sudden failures occur by chance, no one can predict them, however, they obey certain laws of probability. So the frequency of sudden failures becomes approximately constant over a sufficiently long period of time. This happens in any hardware. An example of random failures is open or short circuits. Such a failure usually leads to the fact that either 0 or 1 is permanently set at the output. In the event of random failures, it is necessary to replace the elements in which they occurred. For this, the computing system must be maintainable and allow for quick preventive maintenance in the field.

Intermittent failures or failures can be distinguished into a separate group. Failure means a short-term disruption of the normal operation of the on-board computer, in which one or more of its elements, when performing one or several adjacent operations, gives a random result. After a failure, the computing system can function normally for a long time.

The cause of failures can be electromagnetic interference, mechanical influences, etc. Failures often do not lead to the failure of the complex, but only change the course of the software due to incorrect execution of one or several commands, which can lead to catastrophic consequences. The difference between failures and failures is that when the consequences of a failure are detected, it is necessary to restore not the hardware, but the information distorted by the failure.

Talking about failures, it is necessary to mention the so-called Schroedinbugs. Schroedinbag is an error in which the computer system functions normally for a long time, however, under certain conditions, for example, setting non-standard operating parameters, a failure occurs. When analyzing this failure, it turns out that the software of the computing system has a fundamental error, due to which it, in principle, should not function.

A schroedinbag can be formed by a complex combination of paired errors (when an error in one place is compensated by an error of the opposite action in another place). Under a certain set of circumstances, the balance of errors is destroyed, which leads to the paralysis of work.

Thus, BCVS is characterized by another property that determines its reliability - error-free or reliable functioning. Consequently, the reliability of the BCVS is a combination of reliability, reliability of functioning, survivability and maintainability.

The following are used as reliability parameters:

1. Failure rate -

2. Mean time between failures -

3. Probability of failure-free operation for a given time - Р

4. Probability of failure - Q

Failure rate

Failure rate is the frequency with which failures occur. If the equipment consists of several elements, then its failure rate is equal to the sum of the failure rates of all elements, the failures of which lead to equipment malfunction.

The failure rate curve versus operating time is shown in the figure below.

At the start of operation (at time t = 0), a large number of elements are put into operation. This collection of elements may initially have a high failure rate due to defective samples. Since the defective elements fail one after another, the failure rate decreases relatively quickly during the running-in period and becomes approximately constant by the time of normal operation (T norms), when the defective elements have already failed and have been replaced with operable ones.

The set of elements that have passed the running-in period has the lowest failure rate, which remains approximately constant until the beginning of the failure of the elements, due to wear (T wear). From this point on, the failure rate begins to increase.

Mean time between failures

Mean time between failures is the ratio of the total hours worked to the total number of failures. During the period of normal operation, when the failure rate is approximately constant, the mean time between failures is the inverse of the failure rate:

Probability of uptime.

Probability of uptime is the likely or expected number of devices that will function without failure for a given period of time:

This formula is valid for all devices that have been running in but are not affected by wear. Consequently, the time t cannot exceed the period of normal operation of the devices.

A graph showing the probability of failure-free operation versus normal operating time is shown below:

Probability of failure.

The probability of failure is the reciprocal of the probability of failure-free operation.

Nominal failure rate.

The elements of the equipment are designed so that they can withstand certain rated ones: voltage, amperage, temperature, vibration, humidity, and so on. When the equipment is exposed to such influences during operation, a certain definite failure rate is observed. This is called the nominal failure rate.

When the total workload or some particular loads or environmental hazards increase beyond the nominal levels, the failure rate rises rather sharply compared to its nominal value. Conversely, the failure rate decreases when the load falls below the nominal level.

For example, if an element is to operate at a nominal temperature of 60 degrees, then by lowering the temperature, as a result of the use of a forced cooling system, it is possible to reduce the failure rate. However, if a decrease in temperature entails a too large increase in the number of elements and the weight of the equipment, then it may be more advantageous to select elements with an increased nominal operating temperature and use them at a temperature below the nominal. In this case, the equipment can become cheaper, and the mass is less (which is fundamentally when working in an aircraft) than when using a forced cooling system.

Methods for determining the reliability of BCVS.

When new products are designed and created by mechanical, electrical, chemical or other measurements, the value of failure rate cannot be determined. Failure rates can be determined by collecting statistical data from reliability testing of this or similar products.

The probability of failure-free operation during any moment of the test time is expressed by the formula:

The failure rate is determined by the formula:

When measuring the failure rate, it is necessary to maintain a constant number of elements participating in the test by replacing the failed elements with new ones.

Thus, in order to obtain data on the quantitative characteristics of the reliability of the equipment, it is necessary to make a special sample of the equipment for reliability tests. Reliability tests should be carried out under conditions corresponding to the actual operating conditions of the equipment for external influences, the frequency of switching on and changing the power parameters.


At the stage of approximate and approximate calculations of electrical devices, the main indicators of reliability are calculated .

The main quality indicators of reliability are:

Failure rate

Mean time to failure.

Failure rate l (t) is the number of people who refused n (t) device elements per unit of time, referred to the average total number of elements N (t) operable by the time Δ t[ 9]

l (t) = n (t) / (Nt * Δt) ,

Where Δt- a given period of time.

for example: 1000 elements of the device worked for 500 hours. During this time, 2 elements failed. Hence,

l (t) = n (t) / (Nt * Δt) = 2 / (1000 * 500) = 4 * 10 -6 1 / h, that is, in 1 hour, 4 elements out of a million can fail.

Failure rates l (t) elements are reference data, Appendix D gives failure rates l (t) for elements commonly used in circuits.

An electrical device consists of a large number of component elements, therefore, the operational failure rate l (t) the entire device as the sum of the failure rates of all elements, according to the formula [11]

where k is a correction factor that takes into account the relative change in the average failure rate of elements, depending on the purpose of the device;

m is the total number of groups of elements;

n і - the number of elements in the і-th group with the same failure rate l і (t).

Probability of uptime P (t) represents the probability that within a specified time period t, device failure will not occur. This indicator is determined by the ratio of the number of devices that have worked reliably up to the point in time t to the total number of devices that are operational at the initial moment.



For example, the likelihood of uptime P (t)= 0.9 represents the probability that within the specified time period t = 500 hours, a failure will occur in (10-9 = 1) one device out of ten, and 9 out of 10 devices will operate without failures.

Probability of uptime P (t)= 0.8 represents the probability that within the specified time period t = 1000 hours, two 2 devices out of a hundred will fail, and 80 devices out of 100 will operate without failures.

Probability of uptime P (t)= 0.975 represents the probability that within the specified time period t = 2500 hours, a failure will occur in 1000-975 = 25 devices out of a thousand, and 975 devices will operate without failures.

Quantitatively, the reliability of a device is estimated as the probability P (t) of an event that the device will perform its functions reliably during the time from 0 to t. The value P (t) is the probability of no-failure (the calculated value of P (t) should not be less than 0.85) work is determined by the expression

where t is the operating time of the system, h (t is selected from the series: 1000, 2000, 4000, 8000, 10000 hours);

λ is the failure rate of the device, 1 / h;

T 0 - MTBF, h.

Reliability calculation consists in finding the total failure rate λ of the device and the MTBF:

The recovery time of a device in case of failure includes the time to find a faulty item, the time to replace or repair it, and the time to test the device's operability.

The average recovery time T in electrical devices can be selected from the range of 1, 2, 4, 6, 8, 10, 12, 18, 24, 36, 48 hours. Smaller values ​​correspond to highly maintainable devices. The average recovery time T in can be reduced using built-in control or self-diagnostics, modular design of components, available installation.

The value of the availability factor is determined by the formula

where T 0 - MTBF, h.

T in - the average recovery time, h.

The reliability of the elements largely depends on their electrical and temperature conditions. To increase the reliability, the elements must be used in light modes determined by the load factors.

Load factor - it is the ratio of the calculated parameter of the element in the operating mode to its maximum allowable value. The load factors of different elements can vary greatly.

When calculating the reliability of a device, all elements of the system are divided into groups of elements of the same type and the same load factors K n.

The failure rate of the і-th element is determined by the formula

(10.3)

where K n i is the load factor, calculated in the maps of operating modes, or set assuming that the element operates in normal modes, in Appendix D the values ​​of the load factors of the elements are given;

λ 0і - basic failure rate of the і-th element is given in Appendix D.

Often, to calculate the reliability, the data on the failure rate λ 0і of analogs of elements are used.

An example of calculating the reliability of a device consisting of a purchased imported complex BT-85W and a power supply developed on the basis of a serial production.

The failure rate of imported products is determined as the reciprocal of the operating time (sometimes the warranty period for servicing the product is taken) based on the operation per day of a certain number of hours.

The warranty period of the purchased imported product is 5 years, the product will work 14.24 hours a day:

T = 14.24 hours x 365 days x 5 years = 25981 hours - MTBF.

10 -6 1 / hour - failure rate.

Calculations and initial data are performed on a computer using Excel programs and are given in tables 10.1 and 10.2. An example of a calculation is given in table 10.1.

Table 10.1 - Calculation of system reliability

Name and type of element or analogue Coefficient, load, K n i
λ i * 10 -6, 1 / h λ i * K n i * 10 -6 1 / h Number n i, n і * λ i * 10 -6, 1 / h
Complex BT-85W 1,00 38,4897 38,4897 38,4897
Condenser K53 0,60 0,0200 0,0120 0,0960
Socket (plug) SNP268 0,60 0,0500 0,0300 0,0900
Chip TRS 0,50 0,0460 0,0230 0,0230
OMLT resistor 0,60 0,0200 0,0120 0,0120
Fusible link VP1-1 0,30 0,1040 0,0312 0,0312
Zener diode 12V 0,50 0,4050 0,2500 0,4050
Indicator 3L341G 0,20 0,3375 0,0675 0,0675
Push button switch 0,30 0,0100 0, 0030 0,0030
Photodiode 0,50 0,0172 0,0086 0,0086
Welded connection 0,40 0,0001 0,0004 0,0004
Wire, m 0,20 0,0100 0,0020 0,2 0,0004
Solder connection 0,50 0,0030 0,0015 0,0045
l whole device å = 39.2313

Determine the overall failure rate of the device

Then the MTBF according to expression (10.2) and, accordingly, is equal to

To determine the probability of no-failure operation for a certain period of time, we will build a dependence graph:

Table 10.2 - Calculation of the probability of failure-free operation

t (hour)
P (t) 0,97 0,9 0,8 0,55 0,74 0,65 0,52 0,4 0,34

The graph of the dependence of the probability of no-failure operation on the operating time is shown in Figure 10.1.

Figure 10.1 - Probability of no-failure operation from operating time

For a device, the probability of failure-free operation is usually set between 0.82 and 0.95. According to the graph in Figure 10.1, we can determine for the developed device at a given probability of no-failure operation P (t) = 0.82, the MTBF T o = 5000 hours.

The calculation is performed for the case when the failure of any element leads to the failure of the entire system as a whole, such a connection of the elements is called logically consistent or basic. Reliability can be increased by redundancy.

for example... Element technology ensures an average failure rate of elementary parts l i = 1 * 10 -5 1 / h ... When used in a device N = 1 * 10 4 elementary parts total failure rate l o = N * li = 10 -1 1 / h ... Then the mean time between failure-free operation of the device is To = 1 / lo = 10 h. If the device is executed on the basis of 4 identical devices connected in parallel, the mean uptime will increase by N / 4 = 2500 times and will be 25000 hours or 34 months or about 3 years.

The formulas make it possible to calculate the reliability of a device if the initial data are known - the composition of the device, the mode and conditions of its operation, the failure rate of its elements.

Distinguish between probabilistic (mathematical) and statistical indicators of reliability. The mathematical indicators of reliability are derived from the theoretical distribution functions of the probability of failures. Statistical indicators of reliability are determined empirically when testing objects on the basis of statistical data on equipment operation.

Reliability is a function of many factors, most of which are random. Hence, it is clear that a large number of criteria are needed to assess the reliability of an object.

Reliability criterion is a feature by which the reliability of an object is assessed.

The criteria and characteristics of reliability are probabilistic in nature, since the factors affecting the object are random in nature and require a statistical assessment.

The quantitative characteristics of reliability can be:
the likelihood of failure-free operation;
average uptime;
failure rate;
failure rate;
various safety factors.

1. Probability of uptime

Serves as one of the main indicators when calculating reliability.
The probability of failure-free operation of an object is called the probability that it will maintain its parameters within specified limits for a certain period of time under certain operating conditions.

In the future, we assume that the operation of the object occurs continuously, the duration of the object's operation is expressed in units of time t, and the operation started at the moment of time t = 0.
We denote by P (t) the probability of an object's no-failure operation over a period of time. Probability, considered as a function of the upper bound of the time interval, is also called the reliability function.
Probabilistic estimate: P (t) = 1 - Q (t), where Q (t) is the probability of failure.

It is obvious from the graph that:
1. P (t) is a non-increasing function of time;
2. 0 ≤ P (t) ≤ 1;
3. P (0) = 1; P (∞) = 0.

In practice, sometimes a more convenient characteristic is the probability of malfunctioning of the object or the probability of failure:
Q (t) = 1 - P (t).
Statistical characteristic of the probability of failures: Q * (t) = n (t) / N

2. Failure rate

The failure rate is the ratio of the number of failed objects to their total number before the start of the test, provided that the failed objects are not repaired or replaced with new ones, i.e.

a * (t) = n (t) / (NΔt)
where a * (t) is the failure rate;
n (t) is the number of failed objects in the time interval from t - t / 2 to t + t / 2;
Δt is the time interval;
N is the number of objects participating in the test.

The failure rate is the density of the distribution of the operating time of the product before its failure. Probabilistic determination of the failure rate a (t) = -P (t) or a (t) = Q (t).

Thus, there is an unambiguous relationship between the failure rate, the probability of failure-free operation and the probability of failures for any law of failure time distribution: Q (t) = ∫ a (t) dt.

Failure is interpreted in the theory of reliability as a random event. The theory is based on the statistical interpretation of probability. Elements and systems formed from them are considered as mass objects belonging to one general population and operating in statistically homogeneous conditions. When we talk about an object, in essence they mean an object taken at random from the general population, a representative sample from this population, and often the entire general population.

For mass objects, a statistical estimate of the probability of no-failure operation P (t) can be obtained by processing the results of reliability tests of sufficiently large samples. The way in which the score is calculated depends on the test plan.

Let the tests of a sample of N objects be carried out without replacements and restorations up to the failure of the last object. Let's designate the duration of time until the failure of each of the objects t 1, ..., t N. Then the statistical estimate is:

P * (t) = 1 - 1 / N ∑η (t-t k)

where η is the Heaviside unit function.

For the probability of no-failure operation on a certain segment, it is convenient to estimate P * (t) = / N,
where n (t) is the number of objects that have failed by time t.

The failure rate, determined under the condition of replacing the failed products with serviceable ones, is sometimes called the average failure rate and is denoted by ω (t).

3. Failure rate

The failure rate λ (t) is the ratio of the number of failed objects per unit time to the average number of objects operating in a given period of time, provided that the failed objects are not restored and are not replaced with serviceable ones: λ (t) = n (t) /
where N cf = / 2 is the average number of objects that worked properly in the time interval Δt;
N i - the number of products that worked at the beginning of the interval Δt;
N i + 1 - the number of objects that worked properly at the end of the time interval Δt.

Resource tests and observations on large samples of objects show that in most cases the failure rate changes non-monotonically over time.

From the curve of dependence of refusals on time, it can be seen that the entire period of operation of the facility can be conditionally divided into 3 periods.
1st period - running-in.

Break-in failures are, as a rule, the result of defects and defective elements in the object, the reliability of which is significantly lower than the required level. With an increase in the number of elements in a product, even with the most stringent control, it is not possible to completely exclude the possibility of elements that have certain hidden defects entering the assembly. In addition, errors during assembly and installation, as well as insufficient development of the facility by the service personnel, can lead to failures during this period.

The physical nature of such failures is random in nature and differs from sudden failures of the normal period of operation in that failures can occur here not at increased, but also at insignificant loads ("burning out defective elements").
The decrease in the value of the failure rate of the object as a whole, with a constant value of this parameter for each of the elements separately, is precisely explained by the “burning out” of the weak links and their replacement with the most reliable ones. The steeper the curve in this area, the better: fewer defective elements will remain in the product in a short time.

To improve the reliability of the facility, taking into account the possibility of break-in failures, you need to:
conduct a more stringent rejection of elements;
to carry out tests of the object in modes close to operational ones and to use only the elements that have passed the tests during assembly;
improve the quality of assembly and installation.

The average running-in time is determined during tests. For especially important cases, it is necessary to increase the running-in period several times compared to the average.

II - th period - normal operation
This period is characterized by the fact that break-in failures have already ended, and failures related to wear have not yet occurred. This period is characterized by extremely sudden failures of normal elements, the MTBF of which is very high.

The retention of the failure rate at this stage is characterized by the fact that the failed element is replaced with the same one, with the same probability of failure, and not the best one, as it happened at the running-in stage.

The rejection and preliminary running-in of the elements going to replace the failed ones is even more important for this stage.
The designer has the greatest capabilities in solving this problem. Often, a change in the design or a lightening of the operating modes of only one or two elements provides a sharp increase in the reliability of the entire facility. The second way is to improve the quality of production and even the cleanliness of production and operation.

III - period - wear
The period of normal operation ends when wear failures begin to occur. The third period in the life of the product begins - the period of wear.

The likelihood of failures due to wear increases as the service life approaches.

From a probabilistic point of view, system failure in a given time interval Δt = t 2 - t 1 is defined as the probability of failure:

∫a (t) = Q 2 (t) - Q 1 (t)

The failure rate is the conditional probability that a failure will occur in the time interval Δt, provided that it has not occurred before λ (t) = / [ΔtP (t)]
λ (t) = lim / [ΔtP (t)] = / = Q "(t) / P (t) = -P" (t) / P (t)
since a (t) = -P "(t), then λ (t) = a (t) / P (t).

These expressions establish the relationship between the probability of failure-free operation, the frequency and the rate of failure. If a (t) is a non-increasing function, then the following relation is true:
ω (t) ≥ λ (t) ≥ a (t).

4. MTBF

MTBF is the mathematical expectation of uptime.

Probabilistic definition: MTBF is equal to the area under the MTBF curve.

Statistical definition: T * = ∑θ i / N 0
where θ I is the operating time of the i-th object to failure;
N 0 - the initial number of objects.

Obviously, the parameter T * cannot fully and satisfactorily characterize the reliability of durable systems, since it is a characteristic of reliability only until the first failure. Therefore, the reliability of long-term systems is characterized by the average time between two adjacent failures or MTBF t av:
t cf = ∑θ i / n = 1 / ω (t),
where n is the number of failures during time t;
θ i is the operating time of the object between the (i-1) th and the i-th failures.

MTBF is the average value of the time between adjacent failures, provided the failed element is restored.

When considering the laws of distribution of failures, it was found that the failure rates of elements can be either constant or change depending on the time of operation. For long-term systems, which include all transportation systems, preventive maintenance is provided, which practically eliminates the impact of wear failures, so only sudden failures occur.

This greatly simplifies the reliability calculation. However, complex systems are made up of many elements connected in different ways. When the system is in operation, some of its elements work continuously, others only at certain intervals, and still others perform only short turn-on or connection operations. Consequently, during a given period of time, only some of the elements have the same operating time as the operating time of the system, while others work for a shorter time.

In this case, to calculate the operating time of a given system, only the time during which the element is turned on is considered; such an approach is possible if it is assumed that during the periods when the elements are not included in the operation of the system, their failure rate is equal to zero.

From the point of view of reliability, the most common scheme of serial connection of elements. In this case, the calculation uses the rule of the product of reliability:

Where R (t i)- reliability i-th element that turns on t i hours from the total operating time of the system t h.


For calculations, the so-called

employment rate equal to

that is, the ratio of the operating time of the element to the operating time of the system. The practical meaning of this coefficient is that for an element with a known failure rate, the failure rate in the system, taking into account the operating time, will be equal to

The same approach can be used in relation to individual nodes of the system.

Another factor to consider when analyzing system reliability is the level of workload with which elements operate in the system, as it largely determines the magnitude of the expected failure rate.

The failure rate of elements changes significantly even with small changes in the workload acting on them.

In this case, the main difficulty in the calculation is caused by a variety of factors that determine both the concept of element strength and the concept of load.

The strength of an element combines its resistance to mechanical stress, vibration, pressure, acceleration, etc. The category of strength also includes resistance to thermal stress, electrical strength, moisture resistance, corrosion resistance and a number of other properties. Therefore, strength cannot be expressed in some numerical value and there are no strength units that take into account all these factors. The manifestations of the load are also manifold. Therefore, to assess the strength and load, statistical methods are used, with the help of which the observed effect of element failure in time is determined under the action of a number of loads or under the action of a predominant load.

The elements are designed to withstand the rated loads. When operating the elements under conditions of rated loads, a certain regularity of the intensity of their sudden failures is observed. This rate is called the nominal sudden failure rate of the elements, and it is the initial value for determining the actual rate of sudden failures of the real element (taking into account the operating time and workload).

For a real element or system, three main environmental influences are currently considered: mechanical, thermal and workloads.

The influence of mechanical influences is taken into account by a coefficient, the value of which is determined by the place of installation of the equipment, and can be taken equal to:

for laboratories and comfortable premises - 1

, stationary ground installations - 10

, railway rolling stock - 30.

Nominal sudden failure rate selected by

tab. 3, should be increased in times depending on the place of installation of the device in operation.

Curves in Fig. 7 illustrate the general nature of the change in the intensity of sudden failures of electrical and electronic components depending on the heating temperature and the magnitude of the workload.

The intensity of sudden failures with an increase in the workload, as can be seen from the curves above, increases according to the logarithmic law. These curves also show how you can reduce the rate of sudden failures of elements even to a value below the nominal value. A significant reduction in the rate of sudden failures is achieved if the elements are operated at loads below nominal values.


Fig. sixteen

Fig. 7 can be used when carrying out approximate (educational) calculations of the reliability of any electrical and electronic elements. In this case, the nominal mode corresponds to a temperature of 80 ° C and 100% of the working load.

If the calculated parameters of the element differ from the nominal values, then according to the curves in Fig. 7, the increase for the selected parameters can be determined and the ratio by which the value of the failure rate of the element in question is multiplied.

High reliability can be incorporated in the design of elements and systems. To do this, it is necessary to strive to reduce the temperature of the elements during operation and to use elements with increased nominal parameters, which is tantamount to a decrease in working loads.

The increase in the cost of manufacturing a product in any case pays off by reducing operating costs.


Failure rates for electrical circuit elements
drink depending on the load can be defined as follows
the same by empirical formulas. In particular, depending
on operating voltage and temperature

Table value at rated voltage and temperature t i.

- failure rate at operating voltage U 2 and temperature t 2.

It is assumed that mechanical stress remains at the same level. Depending on the type and type of elements, the value P, changes from 4 to 10, and the value TO within 1.02 1.15.

When determining the real failure rate of elements, it is necessary to have a good understanding of the expected load levels at which the elements will operate, to calculate the values ​​of electrical and thermal parameters taking into account transient modes. Correct identification of the loads acting on individual elements leads to a significant increase in the accuracy of the reliability calculation.

When calculating reliability taking into account wear failures, it is also necessary to take into account the operating condition. Durability values M, given in table. 3, as well as refer to the nominal load conditions and laboratory conditions. All elements operating under different conditions have a durability that differs from noah by an amount TO The quantity TO can be taken equal to:

for laboratory - 1.0

, ground installations - 0.3

, railway rolling stock - 0.17

Small fluctuations in the coefficient TO are possible for equipment for various purposes.

To determine the expected durability M it is necessary to multiply the average (nominal) durability, determined from the table, by a factor K.

In the absence of materials necessary to determine the failure rate depending on the load levels, the coefficient method for calculating the failure rate can be used.

The essence of the coefficient calculation method is reduced to the fact that when calculating the reliability criteria of equipment, coefficients are used that link the failure rate of elements of various types with the failure rate of an element whose reliability characteristics are reliably known.

It is assumed that the exponential law of reliability is valid, and the failure rates of elements of all types vary depending on the operating conditions to the same extent. The last assumption means that under different operating conditions the ratio

The failure rate of an element whose quantitative characteristics are known;

Reliability factor i-th element. An element with a failure rate of ^ 0 is called the main element of the system's calculation. When calculating the coefficients K i the main element of the calculation of the system is the wire-wound resistance. In this case, to calculate the reliability of the system, it is not required to know the failure rate of elements of all types. It is enough to know only the reliability factors K i, the number of elements in the circuit and the failure rate of the main element of the calculation Since K i has a scatter of values, then the reliability is checked for both TO min and for TO swing. The values K i, determined based on the analysis of data on failure rates for equipment for various purposes are given in table. five.

Table 5

The failure rate of the main element of the calculation (in this case, resistance) should be determined as the weighted average of the failure rates of the resistances used in the designed system, i.e.

AND N R- failure rate and number of resistances i-th type and denomination;

t- the number of types and ratings of resistances.

It is desirable to construct the resulting dependence of the system reliability on the operating time as for the values TO min , so for TO swing

Having information about the reliability of individual elements included in the system, it is possible to give a general assessment of the reliability of the system and determine the blocks and assemblies that require further refinement. For this, the system under study is divided into nodes according to a constructive or semantic criterion (a structural diagram is drawn up). Reliability is determined for each selected unit (units with lower reliability require revision and improvement in the first place).

When comparing the reliability of nodes, and even more so of different versions of systems, it should be remembered that the absolute value of reliability does not reflect the behavior of the system in operation and its efficiency. The same value of system reliability can be achieved in one case due to the main elements, the repair and replacement of which requires considerable time and large material costs (for an electric locomotive, removal from train operation), in another case, these are small elements, which are replaced by the operator. personnel without removing the machine from work. Therefore, for a comparative analysis of the designed systems, it is recommended to compare the reliability of elements that are similar in their significance and the consequences arising from their failures.

For approximate reliability calculations, you can use the data from the operating experience of similar systems. which to some extent takes into account the operating conditions. The calculation in this case can be carried out in two ways: by the average level of reliability of the same type of equipment or by the conversion factor to real operating conditions.

The calculation for the average level of reliability is based on the assumption that the designed equipment and the operating sample are equal. This can be tolerated with the same elements, similar systems and the same ratio of elements in the system.

The essence of the method is that

And - the number of elements and the MTBF of the equipment - sample;

And - the same for the designed equipment. From this ratio, it is easy to determine the MTBF for the designed equipment:

The advantage of the method is its simplicity. Disadvantages - the absence, as a rule, of a sample of the operating equipment suitable for comparison with the designed device.

The calculation according to the second method is based on the determination of the conversion factor, taking into account the operating conditions of similar equipment. To determine it, a similar system is selected, operated under specified conditions. Other requirements may not be met. For the selected operating system, reliability indicators are determined using the data in Table. 3, the same indicators are determined separately from the operational data.

The conversion factor is defined as the ratio

- MTBF according to operation data;

T oz- MTBF by calculation.

For the designed equipment, the calculation of reliability indicators is performed using the same tabular data as for the operated system. Then the results obtained are multiplied by To e.

Coefficient To e takes into account the real operating conditions - preventive repairs and their quality, replacement of parts between repairs, the qualifications of the maintenance personnel, the condition of the depot equipment, etc., which cannot be foreseen with other calculation methods. The values To e there may be more than one.

Any of the considered calculation methods can be performed for a given reliability, that is, by the opposite method - from the reliability of the system and the MTBF to the choice of indicators of the constituent elements.

Failure rate- the conditional density of the probability of the failure of a non-recoverable object, determined for the considered moment in time, provided that the failure did not occur before this moment.

Thus, statistically, the failure rate is equal to the number of failures that occurred per unit of time, referred to the number of objects that did not fail at a given moment.

A typical change in failure rate over time is shown in Fig. five.

The experience of operating complex systems shows that the change in the failure rate λ ( t) of most of the number of objects is described U- shaped curve.

Time can be conditionally divided into three characteristic sections: 1. The running-in period. 2. Period of normal operation. 3. The aging period of the object.

Fig. 5. Typical change in failure rate

The run-in period of an object has an increased failure rate caused by break-in failures caused by defects in production, installation and commissioning. Sometimes the end of this period is associated with the warranty service of the object, when the elimination of failures is made by the manufacturer. During normal operation, the failure rate remains practically constant, while failures are of a random nature and appear suddenly, primarily due to random load changes, non-compliance with operating conditions, unfavorable external factors, etc. It is this period that corresponds to the main operating time of the facility.

The increase in failure rate refers to the aging period of the object and is caused by an increase in the number of failures due to wear, aging and other reasons associated with long-term operation. That is, the probability of failure of an element that has survived for the moment t in some subsequent time interval depends on the values ​​of λ ( u) only in this interval, and therefore the failure rate is a local indicator of the reliability of an element at a given time interval.

Topic 1.3. Reliability of recoverable systems

Modern automation systems are complex recoverable systems. Such systems are repaired in the process of operation, in case of failure of some elements and continue further work. The property of systems to be restored during operation is "laid down" during their design and ensured during manufacture, and carrying out repair and restoration operations is provided for in the normative and technical documentation.

Carrying out repair and restoration measures is essentially another way aimed at increasing the reliability of the system.

1.3.1. Reliability indicators of restored systems

From the quantitative point of view, such systems, in addition to the previously considered reliability indicators, are also characterized by complex reliability indicators.

A complex indicator of reliability is a reliability indicator that characterizes several properties that make up the reliability of an object.

The complex reliability indicators that are most widely used to characterize the reliability of recoverable systems are:

Availability ratio;

Operational readiness ratio;

Technical utilization rate.

Availability ratio- the probability that the object will be in a working state at an arbitrary moment of time, except for the planned breaks, during which the use of the object for its intended purpose is not provided.

Thus, the availability factor simultaneously characterizes two different properties of an object - reliability and maintainability.

Availability is an important parameter, however, it is not universal.

Operational readiness ratio- the probability that the object will be in an operable state at an arbitrary moment of time, except for the planned breaks, during which the use of the object for its intended purpose is not provided, and, starting from this moment, it will work flawlessly for a given time interval.

The coefficient characterizes the reliability of objects, the need for the use of which arises at an arbitrary point in time, after which a certain trouble-free operation is required. Until this moment, the equipment can be in standby mode, the mode of use in other operating functions.

Technical utilization rate- the ratio of the mathematical expectation of the time intervals for the objects staying in a working state for a certain period of operation to the sum of the mathematical expectations of the time intervals for the object being in a working state, downtime due to maintenance, and repairs for the same period of operation.