Tuesday, March 1, 2011

Quincy: Fair Scheduling for Distributed Computing Clusters

Link to the paper: 
http://www.sigops.org/sosp/sosp09/papers/isard-sosp09.pdf


Presenter:  Saurabh Baisane
Reviewers: Fahim Patel and Kavyashree Prasad 

7 comments:

  1. is there a specific reason behind using Dryad?

    It is assumed that all tasks are independent of each other. How realistic is this assumption?

    Scheduler can kill tasks and resubmit it in the queue. Instead of restarting the whole task, can't it save it's last state and next time run it from there?

    ReplyDelete
  2. How is data locality taken care of in the graph construction?

    ReplyDelete
  3. The scheduler may decide to kill a worker task before it completes in order to allow other jobs to have access to its resources. Such a killed task restarts from the beginning. Also, if a computer fails the job, the job will be re-executed from the start.

    Why cant there be a state maintained for each job so that the overhead of re running such jobs are eliminated.

    ReplyDelete
  4. It is mentioned in the paper that,if computations are not placed close to their input data, the network can therefore become a
    bottleneck. Additionally, reducing network traffic simplifies capacity planning,can you elaborate on this?

    ReplyDelete
  5. @anudipa
    I guess its because of its fine grain resource sharing strategy

    --
    prudhvi

    ReplyDelete
  6. In the paper, in the cluster architecture it is mentioned that there is only a single centralized scheduling service.
    But what happens in case of a single point failure?

    --
    prudhvi

    ReplyDelete
  7. The author describes a Computational Model wherein a "root task" which manages the workflow and, according to what i understand, assigns the individual "worker tasks" that run on any computer. Now, as the worker tasks finish their job, the inform the root task about this.

    So, my question is, does the root task have to busy-wait till the worker task responds back to it? What happens in case the worker task, which is running on a separate node on the cluster, takes longer than usual due to say, a sequential job or maybe it goes out of the cluster due to problems in the connectivity?

    ReplyDelete