Intelligent Distributed Fault and Performance Management for Communication Networks
December 31, 2002
This dissertation is devoted to the design of an intelligent, distributed fault and performance management system for communication networks. The architecture is based on a distributed agent paradigm, with belief networks as the framework for knowledge representation and evidence propagation. The dissertation consists of four major parts. First, we choose the mobile code technology to help implement a distributed, extensible framework for supporting adaptive, dynamic network monitoring and control. The focus of our work is on three aspects. First, the design of the standard infrastructure, or Virtual Machine, based on which agents could be created, deployed, managed and initiated to run. Second, the collection API for our delegated agents to collect data from network elements. Third, the callback mechanism through which the functionality of the delegated agents or even the native software could be extended. We propose three system designs based on such ideas. Second, we propose a distributed framework for intelligent fault management purpose. The managed network is divided into several domains and for each domain, there is an intelligent agent attached to it, which is responsible for this domain’s fault management tasks. Belief network are embedded in such an agent as the probabilistic fault models, based on which evidence propagation and decision making processes are carried out. Third, we address the problem of parameter learning for belief networks with fixed structure. Based on the idea of Expectation-Maximization (EM), we derive a uniform learning algorithm under incomplete observations. Further, we study the rate of convergence via the derivation of Jacobian matrices of our algorithm and provide a guideline for choosing step size. Our simulation results show that the learned values are relatively close to the true values. This algorithm is suitable for both batch and on-line mode. Finally, when using belief networks as the fault models, we identify two fundamental questions: When can I say that I get the right diagnosis and stop? If right diagnosis has not been obtained yet, which test should I choose next? The first question is tackled by the notion of right diagnosis via intervention, and we solve the second problem based on a dynamic decision theoretic strategy. Simulation shows that our strategy works well for the diagnosis purpose. This framework is general, scalable, flexible and robust.