Availability in Globally Distributed Storage Systems

Thursday, February 3, 2011

Availability in Globally Distributed Storage Systems

Link to the paper:
http://www.usenix.org/events/osdi10/tech/full_papers/Ford.pdf

Presenter: Deepak Agrawal
Reviewers: Saurabh Baisane and Saakshi Verma

6 comments:

Lavone_RFebruary 9, 2011 at 1:25 AM
In section 2.1, the author states "15 minutes is long enough to exclude the majority of benign transient events while not too long to exclude
significant cluster-wide phenomena". What significant cluster-wide phenomena is not being excluded? what are the benign transient events being excluded?
ReplyDelete
Replies
Lavone_RFebruary 9, 2011 at 1:32 AM
In section 2.2, the authors state "For replication, R = n refers to n identical chunks in a stripe, so the data may be recovered from any one chunk" Does this imply all chunks in a stripe are identical? Or does ths imply that all non identical chunks (within a stripe) are replicated so that subsets of non identical chunks may reconstruct the origanl data?
ReplyDelete
Replies
Lavone_RFebruary 9, 2011 at 1:46 AM
In section 3.1 the authors states "For the purposes of figure 2 we do not exclude events that last less than 15 minutes, but we still end the unavailability period when the system reconstructs all the data previously stored on that node" Does this imply that the duration of all node unavailability events last less than a day? Is there a standard node unavailable duration? If this duration is exceeded does this mean the node has to be replaced? How does this affect performance?
ReplyDelete
Replies
AkshayFebruary 9, 2011 at 11:44 AM
For recovery, does the concept of RAID levels fit in here. And which of the RAID levels are used in distributed systems ?
ReplyDelete
Replies
SughoshFebruary 9, 2011 at 12:21 PM
The paper seems to make, inadvertently, a case for supporting distributed replication compared to RAID. Would you agree?
ReplyDelete
Replies
duyguFebruary 10, 2011 at 2:18 AM
@Lavone_R It is not a full answer but might help, In the presentation of the paper they say that some failures may last longer and in this situation they dominate the transient ones, but ones that dominate the event counts are infrequent. In the paper, it is said that the short unavailability events are most frequent, they tend to have a minor impact on cluster-level availability and data loss thanks to the distributed storage systems adding
enough redundancy to allow data to be served from other sources. So they just focus on the longer unavailability time intervals, however is it possible that in the case of a big number of node failures that are small in time may show the same effect as infrequent long time failures?
ReplyDelete
Replies

Add comment