Thursday, February 3, 2011

Availability in Globally Distributed Storage Systems

Link to the paper:
http://www.usenix.org/events/osdi10/tech/full_papers/Ford.pdf

Presenter: Deepak Agrawal
Reviewers: Saurabh Baisane and Saakshi Verma

6 comments:

  1. In section 2.1, the author states "15 minutes is long enough to exclude the majority of benign transient events while not too long to exclude
    significant cluster-wide phenomena". What significant cluster-wide phenomena is not being excluded? what are the benign transient events being excluded?

    ReplyDelete
  2. In section 2.2, the authors state "For replication, R = n refers to n identical chunks in a stripe, so the data may be recovered from any one chunk" Does this imply all chunks in a stripe are identical? Or does ths imply that all non identical chunks (within a stripe) are replicated so that subsets of non identical chunks may reconstruct the origanl data?

    ReplyDelete
  3. In section 3.1 the authors states "For the purposes of figure 2 we do not exclude events that last less than 15 minutes, but we still end the unavailability period when the system reconstructs all the data previously stored on that node" Does this imply that the duration of all node unavailability events last less than a day? Is there a standard node unavailable duration? If this duration is exceeded does this mean the node has to be replaced? How does this affect performance?

    ReplyDelete
  4. For recovery, does the concept of RAID levels fit in here. And which of the RAID levels are used in distributed systems ?

    ReplyDelete
  5. The paper seems to make, inadvertently, a case for supporting distributed replication compared to RAID. Would you agree?

    ReplyDelete
  6. @Lavone_R It is not a full answer but might help, In the presentation of the paper they say that some failures may last longer and in this situation they dominate the transient ones, but ones that dominate the event counts are infrequent. In the paper, it is said that the short unavailability events are most frequent, they tend to have a minor impact on cluster-level availability and data loss thanks to the distributed storage systems adding
    enough redundancy to allow data to be served from other sources. So they just focus on the longer unavailability time intervals, however is it possible that in the case of a big number of node failures that are small in time may show the same effect as infrequent long time failures?

    ReplyDelete