Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

Tuesday, March 8, 2011

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

8 comments:

SaakshiMarch 9, 2011 at 12:35 AM
In the Page Rank computation on a web page [sec 4.3], how are the number of iterations decided?
ReplyDelete
Replies
UnknownMarch 9, 2011 at 2:25 AM
I understand that the aggregation is done either by using an iterator which iterates on various sets of data and performs the aggregation. Another method described is the use of an accumulator which is initialized once and then used many a times to accumulate the data for aggregation.

Since the accumulator keeps accumulating the data, is the amount of data to be aggregated, calculated well before the accumulation takes place? If not, then how are the memory overflows handled if the amount of data is larger than the storage space available for the accumulator interface?
ReplyDelete
Replies
FahimMarch 9, 2011 at 5:05 AM
The paper identifies a task that can be considered for modified MapReduce using partial-aggregation by classifying the task as decomposbale/associative-decomposable. However, it assumes that partial-aggregation step is performed at local site (avoiding network traffic) and that Reduce phase is performed at remote site. If we have Reduce step at local site than proposed MapReduce will perform no better(in fact might be slower) than original MapReduce.
ReplyDelete
Replies
jyothsnaMarch 9, 2011 at 1:28 PM
In case of partial sort,bounded num-
ber of chunks of input records are read into memory,with each chunk occupying bounded storage.Since it uses bounded storage,it can be pipelined with upstream computations.Why are upstream computations significant when storage is bounded?
ReplyDelete
Replies
SudheerMarch 9, 2011 at 10:26 PM
In the iterative pagerank mentioned, what will be the map phase and reduce phase?

What kind of applications work better on hadoop compared to dryad? (Is hadoop always worst performed as mentioned by author?)

what are the short lived processes that author refers (which hinders hadoop performance)?
ReplyDelete
Replies
el_idiotoMarch 9, 2011 at 10:37 PM
in the paper it is concluded , "Many
.NET library functions are also defined in the it-
erator style. Now that we have implemented both
within DryadLINQ we are curious to discover which
will be more popular among users of the system."

I'd like to know how popular are .Net library functions for similar operations? esp since most architectures are Linux based.

Also, if Hadoop is the worst performing system, why is it being used by a variety of companies and same goes for MapReduce. (why is Google still sticking with it if it's nearly just as bad as Hadoop)
ReplyDelete
Replies
Pramod NayakMarch 10, 2011 at 12:56 AM
@Fahim,

my interpretation ...
Assume a huge data set and if you are able to identify a subsequence and aggregate that data during map phase , you are actually reducing the total data output from MAP phase and consequently the amount of data sent across the network for the Reduce phase which is done remotely.

Now , to find such sets of data intelligent partitioning is done

The paper explains this concept on aggregation based on locality to reduce network traffic during map phase as formation "aggregation tree" ..
ReplyDelete
Replies
Pramod NayakMarch 10, 2011 at 1:12 AM
Optimizations like partitioning of data , forming partial aggregations using aggregation tree, looks like preprocessing the data in Map phase. Is there any overhead faced because of these optimizations ?
ReplyDelete
Replies

Add comment