GPFS: A Shared-Disk File System for Large Computing Clusters

Tuesday, January 25, 2011

GPFS: A Shared-Disk File System for Large Computing Clusters

Abstract

GPFS is IBM’s parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community over the last several years, particularly distributed locking and recovery technology. To date it has been a matter of conjecture how well these ideas scale. We have had the opportunity to test those limits in the context of a product that runs on the largest systems in existence. While in many cases existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.

Link to the full paper:
http://www.almaden.ibm.com/StorageSystems/projects/gpfs/Fast02.pdf

Presented by Prudhvi Reddy Avula
Link to the slides:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/slides/01-avula.pdf

Review #1 by Venkata Sudheerkumar Mupparaju
This paper “GPFS: A Shared-Disk File System for Large Computing Cluster” describes the overall architecture of GPFS (General Parallel File System) which is IBM's parallel shared-disk file system for cluster computers, paper describes its approach to achieving parallelism and data consistency in cluster environment, it details some of the features that contribute to its performance and scalability, describes the design for fault-tolerance and presents data on its performance.

GPFS achieves its extreme scalability through its shared-disk architecture. SAN provides Shared Disks, but SAN itself does not provide a Shared File System. If you have several computers that have access to a Shared Disk and try to use that disk with a regular File System, the disk logical structure will be damaged very quickly. Disk Space Allocation inconsistency and File Data inconsistency makes it impossible to use Shared Disks with regular File Systems as Shared File Systems. Cluster File Systems are designed to solve the problems outlined above. GPFS is one such parallel File System for cluster computers that provides as closely as possible the behavior of a general- purpose POSIX file system running on a single machine...

Link to the full review:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/reviews/01-review1-mupparaju.pdf

Review #2 by Pramod Kundapur Nayak

With the thirst for higher computing power in demand, cluster computing has become a trend. Fault-tolerance, boundless computing power, unrestrained storage capacity being prime requirements of a reliable system, cluster computing has been of keen interest among researchers. This paper focuses on the storage aspect of cluster computing by introducing GPFS (General Parallel File System), a file system package from IBM, which provides functionalities similar to standard POSIX file system.

To summarize:

• GPFS appears to work like traditional POSIX file system but provides parallel access to files.
• Enhanced performance achieved through data striping at block level across all disks in file system.
• Supports upto 4096 disks of upto 1TB each, providing a total of 4 petabytes per file system.
• Both file data and metadata of any disk is accessible from any node through disk I/O calls. Further, GPFS facilitates parallel flow of both data and metadata from node to disk.
• Highly reliable with fault-tolerance and replication mechanism.

This paper highlights GPFS’s answers to performance, scalability, concurrency and fault- tolerance issues of large file system and provides bird’s eye view of GPFS...

Link to the full review:
http://www.cse.buffalo.edu/faculty/tkosar/cse726/reviews/01-review2-nayak.pdf

8 comments:

AnonymousJanuary 26, 2011 at 3:06 AM
Extensible hashing implements single level (balanced) access of the required block by performing 2 operations:
1. Doubling the size of the hash table
2. Creating new buckets when no more room is available

the concept of global depth and local depth are crucial to identify which of the above 2 operation to perform.
global depth - key size in the hash table
local depth - key size used previously to map buckets

If the local depth is equal to the global depth, then there is only one pointer to the bucket, and there is no other directory pointers that can map to the bucket, so the directory must be doubled (global depth)

If the bucket is full, if the local depth is less than the global depth, then there exists more than one pointer from the directory to the bucket, and the bucket can be split (local depth).

reference
www.cc.gatech.edu/classes/AY2002/cs6421_spring/.../april3_p2.ppt
http://en.wikipedia.org/wiki/Extendible_hashing
ReplyDelete
Replies
sandeep vJanuary 26, 2011 at 10:52 AM
In the case of parallel data access, it says that "File blocks are assigned to nodes in a round-robin fashion, so that each data block will be read or written only by one particular node. GPFS forwards read and write operations originating from other nodes to the node responsible for a particular data block". So in this case if all the read and write operations are forwarded to the node responsible for a particular data block then wouldn't that node become a bottle neck in this system?
ReplyDelete
Replies
AjinkyaJanuary 26, 2011 at 11:22 AM
Adding to the question above, what happens when a record is deleted. Is the directory structure collapsed or it remains as it is? If it collapses then does it collapse by half and again re arranging the hash values and bucket contents?
ReplyDelete
Replies
lavoneJanuary 26, 2011 at 11:30 AM
Is there a specific periodic time in which updated metadata in a node's log file is written to shared disk? I ask because during a node failure, there still may be a small window of time in which a failed node’s updated log file metadata may not have been written to disk, making it difficult for a full complete recovery
ReplyDelete
Replies
kavyashrJanuary 26, 2011 at 12:03 PM
I think there is a trade off between choosing the round robin mechanism when the data blocks are very small and choosing the token mechanism. Because when the data blocks are very small token mechanism increases message traffic.
ReplyDelete
Replies
jyothsnaJanuary 26, 2011 at 12:12 PM
It is said in byte range tokens that multiple nodes can access the same file provided each node acquires a token for that particular byte range area,then how does token conflict arise even if individual write operations do not overlap.
ReplyDelete
Replies
SughoshJanuary 26, 2011 at 12:33 PM
Metadata updates will definitely be an issue when handling multiple parallel writes to the same file via byte range tokens. How does the metadata server handle multiple requests concurrently?
ReplyDelete
Replies
el_idiotoJanuary 27, 2011 at 1:21 AM
what is byte range locking?

as raid5 is implemented, doesn't it cause slow simultaneous reads and slow writes

is it possible to use a mix of storage nodes and symmetric cluster or SAN based on file needs,
or have all 3 in the same hybrid network at different levels?

a single manager can lead to issues if the manager crashes. how is this dealt with? I too would like to know about recovery techniques and security aspects of GPFS.

can distributed locking and centralized management lead to higher IO traffic? Will this cause any problems?

how many of the current super computers used GPFS?
ReplyDelete
Replies

Add comment