Thursday, January 27, 2011

Black-Box Problem Diagnosis in Parallel File Systems

Abstract
We focus on automatically diagnosing different perfor- mance problems in parallel file systems by identify- ing, gathering and analyzing OS-level, black-box perfor- mance metrics on every node in the cluster. Our peer- comparison diagnosis approach compares the statistical attributes of these metrics across I/O servers, to identify the faulty node. We develop a root-cause analysis proce- dure that further analyzes the affected metrics to pinpoint the faulty resource (storage or network), and demonstrate that this approach works commonly across stripe-based parallel file systems. We demonstrate our approach for realistic storage and network problems injected into three different file-system benchmarks (dd, IOzone, and Post- Mark), in both PVFS and Lustre clusters.

Link to the paper:
http://www.usenix.org/events/fast10/tech/full_papers/kasick.pdf

Presented by Rishi Baldawa
Link to the slides:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/slides/03-baldawa.pdf

Review #1 by Hiraksh Bhagat
In this paper, the authors has developed an algorithm for automatically diagnosing different performance problems in parallel file systems by comparing different metrics gathered at every node. It uses Black Box performance metrics for peer comparison to basically do two things (i) to find whether any fault exists in the system and (ii) analyze the metrics to pinpoint faulty resource. The main goals of the author are application transparency minimal false alarms, minimal instrumentation overhead and many specific problem coverage. The paper says very clearly of what it is not looking to achieve here like code-level debugging, pathological workloads and diagnosis of non-peers. The paper demonstrates authors’ approach for realistic storage problems injected into different file system bench marks in PVFS and Lustre clusters.

The paper aptly describes why it uses Black Box metrics in peer comparison. It makes various of assumptions like all peer servers have identical software configuration, are synchronized and have a homogenous environment. The problem involving storage and network resources are separated into two classes viz. hog faults and busy or loss faults. Considering a small file system, the paper makes a variety of observations assuming many things which is not entirely true. Based on these observations, the authors developed the diagnosis algorithm. It works in two phases. The first phase finds the faulty server by using PDF on various OS-level metrics. It gives two approaches for this viz. Histogram based approach and Time based approach. Threshold selection is implemented on training data using machine learning algorithms. Phase 2 observes peer divergence in storage and network resources by calculating throughput and latency...



Link to the full review:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/reviews/03-review1-bhagat.pdf

Review #2 by Deepak Agrawal
The paper discusses about the OS-level, Black Box performance metrics applied to every node in Parallel file system like PVFS and Lustre, to identify difference performance problem, with the aim to find the faulty node and the using root cause analysis to find the faulty resource.

The goals of the Black Box testing is to be Application transparent so that the application do not require modification, minimize the false alarms, and minimal Instrumentation overhead so that analysis does not adversely impact performance. There are certain assumption which this paper makes like the IO servers are synchronized and a majority the exhibit fault free behavior, and the client and servers are comprised of homogeneous hardware and workloads.


The problem with storage and network resources the paper is focusing are disk hogs, disk busy, network hogs and packet-loss (network-busy). The paper list downs certain empirical observations of PVFS’s / Lustre file systems , concluding that the approach might apply to parallel file system in general...

Link to the full review:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/reviews/03-review2-agrawal.pdf

4 comments:

  1. When a packet-loss(network busy) fault occurs, why is the congestion window halved?

    ReplyDelete
  2. The current approach of Black Box Problem Diagnosis in Parallel File Systems involves manual work. Is there any tool which can automate these process.. Or is there any research thats been done in this regard ?

    ReplyDelete
  3. It is said large write caches can initially mask performance problems under write intensive workloads and thus, the problems might take a while to manifest. In contrast, performance problems in read-intensive workloads manifest rather quickly.why is it so?

    ReplyDelete
  4. The paper says it is assumed that
    server nodes are similar, with identical software configuration including data striping parameters, with an equal number of storage targets (of same size) per server, the same amount of memory, and the same class and speed-rating of network interfaces. Is it a realistic/common approach for real life environments? Is there research done that you know of that focuses on heterogenous environments?

    ReplyDelete