The authors mentioned 3 pipeline design challenges that they overcame in implementing HOP: 1. How to allow fault tolerance mechanisms to co-exist with pipelining (since MapReduce fault tolerance mechanisms are predicated on the materialization of intermediate states) 2. How to allow batch-oriented optimizations while appeasing communication demands implicit in pipelines. and 3. How to intelligently co-schedule producers and consumers. What challenges do you think still need to be addressed that were not mentioned in the paper? How would you address those challenges?
In the paper, the authors say that their approach of job progress metric leads to wrong conclusions since the simple fraction metric used here assumed to be uniform for every sample data taken hourly. How would you improve this metric? Can you think of a way of doing the sampling in a more dynamic way for the page view traffic in the given example?
at the end of section 3.1.3, it talks about how aggressive use of combiner function can be controlled. does this mean that for every job/task it invokes combiner function for a predetermined number of times and based on the history decides future course of action?
It is mentioned ,in modified version of Hadoop, the reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS to a temporary file. Unfortunately, the computation of the reduce function from the previous job and the map function of the next job cannot be overlapped.What functionality they wish to achieve when they talk about overlapping?
@anudipa: I think the combiner function is not invoked for a predetermined number of times. The authors say it is invoked whenever the buffer grows to a threshold size and if the combiner function is effective at reducing the data volumes then more spill files are accumulated for the next invocation of combiner function.
In section 3.3 Fault Tolerance, the author states that if a Reduce task fails, a new Reduce task is created and all the input data from the Map tasks need to be sent to the Reduce task again. Now, in the case of pipelining, where the Map task discards the data itself and stores it to the local disk, what happens if the local disk does not have sufficient available space to store this data? I know this data would be small enough to be stored on the local disk, but assuming this could grow larger with recursive Map tasks, does the Map task store this data in another disk on the network or does the JobTracker run the Map task from the most recent checkpoint?
The authors mentioned 3 pipeline design challenges that they overcame in implementing HOP: 1. How to allow fault tolerance mechanisms to co-exist with pipelining (since MapReduce fault tolerance mechanisms are predicated on the materialization of intermediate states) 2. How to allow batch-oriented optimizations while appeasing communication demands implicit in pipelines. and 3. How to intelligently co-schedule producers and consumers. What challenges do you think still need to be addressed that were not mentioned in the paper? How would you address those challenges?
ReplyDeleteHow would you compare HOP's pipelining performance to Microsofts's Dryad performance in regards to processing and implementing query aggregations.
ReplyDeleteIn the paper, the authors say that their approach of job progress metric leads to wrong conclusions since the simple fraction metric used here assumed to be uniform for every sample data taken hourly. How would you improve this metric? Can you think of a way of doing the sampling in a more dynamic way for the page view traffic in the given example?
ReplyDeleteat the end of section 3.1.3, it talks about how aggressive use of combiner function can be controlled. does this mean that for every job/task it invokes combiner function for a predetermined number of times and based on the history decides future course of action?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteIt is mentioned ,in modified version of Hadoop, the reduce tasks of one job can optionally pipeline their output directly to the
ReplyDeletemap tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS to a temporary file. Unfortunately, the computation of the reduce function from the previous job and the map function of the next job cannot be overlapped.What functionality they wish to achieve when they talk about overlapping?
@anudipa: I think the combiner function is not invoked for a predetermined number of times. The authors say it is invoked whenever the buffer grows to a threshold size and if the combiner function is effective at reducing the data volumes then more spill files are accumulated for the next invocation of combiner function.
ReplyDeleteIn section 3.3 Fault Tolerance, the author states that if a Reduce task fails, a new Reduce task is created and all the input data from the Map tasks need to be sent to the Reduce task again. Now, in the case of pipelining, where the Map task discards the data itself and stores it to the local disk, what happens if the local disk does not have sufficient available space to store this data? I know this data would be small enough to be stored on the local disk, but assuming this could grow larger with recursive Map tasks, does the Map task store this data in another disk on the network or does the JobTracker run the Map task from the most recent checkpoint?
ReplyDelete