Irina Rish, Mark Brodie, and Sheng Ma
In this paper, we address the problem of efficient diagnosis in real-time systems capable of on-line information gathering, such as sending "probes" (i.e., test transactions, such as "traceroute" or "ping") in order to identify network faults and evaluate performance of distributed computer systems. We use a Bayesian network to model probabilistic relations between the problems (faults, performance degradation) and symptoms (probe outcomes). Due to intractability of exact probabilistic inference in large systems, we investigated approximation techniques, such as a local-inference scheme called mini-buckets(Dechter and Rish 1997). Our empirical study demonstrates advantages of local approximations for large diagnostic problems: the approximation is very efficient and "degrades gracefully" with noise; also, the approximation error gets smaller on networks with higher confidence (probability) of the exact diagnosis. Since the accuracy of diagnosis depends on how much information the probes can provide about the system states, the second part of our work is focused on the probe selection task. Small probe sets are desirable in order to minimize the costs imposed by probing, such as additional network load and data management requirements. Our results show that, although finding the optimal collection of probes is expensive for large networks, efficient approximation algorithms can be used to find a nearly-optimal set.