What Is Fault Management?

Fault management is one of the most basic functions of network management. All users want a computer network that works reliably. When a component in the network fails, the network manager must quickly find the fault and eliminate it in time. Under normal circumstances, the possibility of quickly isolating a certain fault is unlikely, because the factors that cause network failures are often very complex, especially those caused by the composition of multiple networks. In this situation, the network should generally be repaired before analyzing the cause of the failure. By analyzing the cause of the failure, similar failures can be prevented from happening again, which is very important for the reliable performance of the network. [1]

The goal of fault management is to resume normal service operations as soon as possible, to minimize the negative impact of component failure on the business, so as to ensure that the service level objectives and service level quality agreed between the business customers in advance are met.
In practice, business-based strategies are needed to formulate IT service level goals and service quality requirements. Many service providers set service level goals based on their own resource allocation and delivery capabilities. As a result, these services do not meet business needs, and the end result is a sharp increase in business and IT conflicts. So the value of the service needs to be defined from the perspective of the customer. These quality requirements can be any element related to the service. Service level agreement (Service Level Agreement,
The content of fault management includes fault discovery and normalization, fault presentation, fault isolation, fault repair, and fault storage and query.
(1) Fault detection and normalization : find faults through fault detection, normalize the fault information, and save it to the fault
At present, many organizations are actively developing research on fault management architectures and developing relevant standards to regulate the design and development of fault management systems. For example, the Peking University Convention on Oceans (
Which tool to develop or choose depends on the needs of network management and the specific network environment.
1. Simple tools
The simplest tool can indicate the existence of a failure but not the cause of its occurrence. For example, a simple tool can send ICMP Echo messages to every host and device on a computer network to test its IP network layer connectivity. If the network does not use TIP / IP, you can use a program to repeatedly try to connect each host and device to complete the same test. The tool can mark every failed connection and provide a basis for further queries.
2. Complex tools
If the hosts and devices on your network are complex enough to report network events, you should develop a sophisticated tool to take advantage of this capability. This tool will notify you in time when a failure is detected by recording a network event or by query. At the same time, critical network events can also help isolate the cause of a failure.
3. Advanced tools
Advanced management tools use network management protocols to look at each device along the path, up to the last device in front of host B (we assume that both machines can communicate with each device on the path, but they Unable to communicate). The tool found no failures on these devices, and users were still unable to send email over the network. At this point, the tool will perform a new series of tests between the two machines. Although time consuming, it can detect many possible failures. [5]

IN OTHER LANGUAGES

Was this article helpful? Thanks for the feedback Thanks for the feedback

How can we help? How can we help?