Nectar: Automatic Management of Data and Computation in Datacenters

Tuesday, March 29, 2011

Nectar: Automatic Management of Data and Computation in Datacenters

9 comments:

prudhvireddyMarch 29, 2011 at 11:31 PM
In section 5, author compares Comet and Nectar and says in Comet, its difficult to identify same code segments from different programs. Can you explain how this is dealt with in Nectar?
ReplyDelete
Replies
SaakshiMarch 30, 2011 at 12:09 AM
While discussing the advantages of the Nectar managed data-center bacause of ease of content management, why is keeping track of the requisite ﬁlepath information called a source of bug?
ReplyDelete
Replies
Shashank K SMarch 30, 2011 at 1:38 AM
Is there system similar to nectar implemented for hadoop..... ?
ReplyDelete
Replies
sandeepMarch 30, 2011 at 2:16 AM
@prudhvireddy this is dealt by the static program analyzer in the case of Nectar. The working of static program analyzer is given in section
3.1 Caching Computation under Cache and Programs.
ReplyDelete
Replies
Lavone_RMarch 30, 2011 at 10:50 AM
In section 2.2 the authors state "that actual datasets are stored in the distributed storage system and the datacenter-wide services manipulate the actual datasets by maintaining pointers to them." What happens if a pointer becomes corrupt ? How do services detect this? How is this issue resolved?
ReplyDelete
Replies
Lavone_RMarch 30, 2011 at 10:52 AM
Also in section 2.2 the authors state "programs of all successful computations are uploaded to a dedicated program store in the cluster. Thus, the service has the necessary information about cached results, meaning that it has a recipe to recreate any derived dataset in the datacenter". My Question is how big can a program store be in a cluster? Will there ever be a case where programs of successful computations are bigger than the store?
ReplyDelete
Replies
UnknownMarch 30, 2011 at 2:57 PM
The idea of caching the derived datasets and replacing data by the computation that produced it can help in managing the data better. But, the working of Garbage Collection specified in Section 3.2 based on the cost-to-benefit analysis does not account for the importance of the data based on the desired data retrieval time for a particular application. For example, in medical applications, a large amount of data, which may be a result of a long computation may be stored until an experiment arrives to a successful checkpoint. And at such a point of time, the data from the last checkpoint may be needed immediately and the computation of such data (which was large in size and not used for a long time and hence was deleted) may take a long time to re-calculate the data, would not be acceptable.
ReplyDelete
Replies
PRASADMarch 30, 2011 at 8:53 PM
If a dataset has not been accessed for longer time then it is removed from datacentre and if required in future ,the respective computation is rerun and dataset is obtained.

However is there any time period defined by Nectar for which the computation(program) is stored ?

e.g If a dataset is not accessed for years then is the corresponding computation also removed or is it that once a computation program is stored for any dataset it is never deleted/removed by Nectar?
ReplyDelete
Replies
el_idiotoMarch 30, 2011 at 9:59 PM
section 3.2 discuses Garbage Collection in detail,

the formula used for calculating CB Ratio (in authors words) doesn't give freshly cache entries a chance to demostrate their freshness... so can't we add a constant or include a new factor to determine to give them a fighting case...

also why is size of derived dataset costly and not beneficial?

PS, its said 7143 hours of computation per day were saved ( section 4.1) but I couldn't find the computations required for caching and for the operations needed to support caching.
ReplyDelete
Replies

Add comment