Tuesday, January 25, 2011

PVFS: A Parallel File System for Linux Clusters

Abstract
As Linux clusters have matured as platforms for low- cost, high-performance parallel computing, software packages to provide many key services have emerged, especially in areas such as message passing and net- working. One area devoid of support, however, has been parallel file systems, which are critical for high- performance I/O on such clusters. We have developed a parallel file system for Linux clusters, called the Parallel Virtual File System (PVFS). PVFS is intended both as a high-performance parallel file system that anyone can download and use and as a tool for pursuing further re- search in parallel I/O and parallel file systems for Linux clusters.

In this paper, we describe the design and implementa- tion of PVFS and present performance results on the Chiba City cluster at Argonne. We provide performance results for a workload of concurrent reads and writes for various numbers of compute nodes, I/O nodes, and I/O request sizes. We also present performance results for MPI-IO on PVFS, both for a concurrent read/write workload and for the BTIO benchmark. We compare the I/O performance when using a Myrinet network versus a fast-ethernet network for I/O-related communication in PVFS. We obtained read and write bandwidths as high as 700 Mbytes/sec with Myrinet and 225 Mbytes/sec with fast ethernet.


Link to the full paper: 
http://www.cct.lsu.edu/~kosar/csc7700-fall06/papers/Carns00.pdf


Presented by Shashank Kota Sathish
Link to the slides:
In PVFS, Carns et al try to develop a Virtual Parallel File system on top Local File System, hence the name PVFS, for Linux Systems providing dynamic distribution of IO and meta data. The main goals of the authors were high bandwidth concurrent read/writes from multiple nodes to a single file, support for multiple APIs, use of common UNIX commands for DFS, Application of APIs without constant recompilation, robustness, scalability and easy installations and use. The Paper provided a relatively cheap Distributed File System that could be easily applied to any Linux Based Cluster independent of significant hardware requirements and could be use for the applications such as scientific research, media streaming, complex computations etc.

PVFS allows user(s) to store and retrieve data using common UNIX commands (such as ls,cp and rm) where data stripped in round robin fashion and stored on multiple independent machines with different network connections. Data is stored in a distributed fashion to reduce single file bottlenecks and increase the aggregate bandwidth of the system...

Link to the full review:

http://www.cse.buffalo.edu/faculty/tkosar/cse726/reviews/02-review1-baldawa.pdf


Review #2 by Sughosh Kadkol
PVFS is proposed to be an open source solution available for download and use in research for parallel file systems and parallel I/O. The paper discusses the motivation, techniques and experimental results in developing an alternative to parallel file systems dominated by commercial parallel machines. PVFS is designed to provide high bandwidth concurrent I/O, support multiple API sets along with basic UNIX interoperability, be robust, scalable with a relative ease in installation and use. The tool described should provide a simple and cost-effective solution for data intensive research projects.

Platform specific commercial clusters and the lack of suitability of distributed file systems for large parallel scientific applications presented the need for a robust and scalable solution to PFS. To allow simple operation of the PFS, it is designed to include a wrapper in a custom kernel module replacing the standard UNIX wrapper with logic for both kernel and PVFS I/O support. The MPI-IO API is introduced to handle I/O operations to handle a custom data storage specification...

7 comments:

  1. The paper discusses the technique of trapping I/O calls by using dynamic linking and concept of system-call wrappers. This idea is new to me and would request more clarity

    ReplyDelete
  2. Can you comment on fault tolerance of PVFS.

    It seems to me that pcount number of IO daemons stores different portions of the file. So if a node goes down, then that data cannot be recovered since there was no duplication initially.

    ReplyDelete
  3. How does PVFS handle failed nodes in terms of recovery? Also since PVFS takes advantage of file striping, how does it prevent the corruption of full data sequences, assuming segments of data were stored on failed nodes?

    ReplyDelete
  4. When the write operation takes place how does the PVFS manager communicate the File Size as the manager only gives the base i/o number,number of i/o nodes and stripe size?

    ReplyDelete
  5. Does PVFS handle fragmentation? If so, are there different techniques to handle local fragmentation and distributed fragmentation? If not, what is the reason for the lack of support?

    ReplyDelete
  6. if the last stripe size of the file is known then can the size of the file be computed by the equation (pcount * ssize)? But there is no field which stores the last stripe size. please clarify

    ReplyDelete
  7. I'm aware its too late to ask so many questions but I'm hoping some of them are answered in the post-presentation discussion

    Is it feasible to check the performance of the system in the Wi-max or current Hi-speed Fiber-optic networks (and 100 Gigabit Ethernet) ? will this provide some innovative results?

    Why is linux leading the pack? Has it to do more with its free and openness than its other technical features?

    How was the issue of deadlocks and concurrency control in general dealt with during concurrent read/write ops? Are the directories stored in one disk or in distributed manner? If in distributed manner, will it cause bottle necks in case on frequent ls/cp calls.

    (if any) What was the max limit of Scalabitiy? How was the installation and user experience?

    Were PIOUS and PPFS ever released commercially or for general use?

    what is the manager daemon crashes? isn't this a design flaw?

    if PVFS has kernel modules then doesn't it become platform dependent?

    without recompilation and relinking, if any changes are made in executables, are they compiled and linked or does it go unnoticed, leading to a major security/system flaw?

    why were only 60 nodes available. its less than a quarter. Would there be any significant performance improvement or scalability issues if all were in use?

    ReplyDelete