In section 5, author compares Comet and Nectar and says in Comet, its difficult to identify same code segments from different programs. Can you explain how this is dealt with in Nectar?
While discussing the advantages of the Nectar managed data-center bacause of ease of content management, why is keeping track of the requisite filepath information called a source of bug?
@prudhvireddy this is dealt by the static program analyzer in the case of Nectar. The working of static program analyzer is given in section 3.1 Caching Computation under Cache and Programs.
In section 2.2 the authors state "that actual datasets are stored in the distributed storage system and the datacenter-wide services manipulate the actual datasets by maintaining pointers to them." What happens if a pointer becomes corrupt ? How do services detect this? How is this issue resolved?
Also in section 2.2 the authors state "programs of all successful computations are uploaded to a dedicated program store in the cluster. Thus, the service has the necessary information about cached results, meaning that it has a recipe to recreate any derived dataset in the datacenter". My Question is how big can a program store be in a cluster? Will there ever be a case where programs of successful computations are bigger than the store?
The idea of caching the derived datasets and replacing data by the computation that produced it can help in managing the data better. But, the working of Garbage Collection specified in Section 3.2 based on the cost-to-benefit analysis does not account for the importance of the data based on the desired data retrieval time for a particular application. For example, in medical applications, a large amount of data, which may be a result of a long computation may be stored until an experiment arrives to a successful checkpoint. And at such a point of time, the data from the last checkpoint may be needed immediately and the computation of such data (which was large in size and not used for a long time and hence was deleted) may take a long time to re-calculate the data, would not be acceptable.
If a dataset has not been accessed for longer time then it is removed from datacentre and if required in future ,the respective computation is rerun and dataset is obtained.
However is there any time period defined by Nectar for which the computation(program) is stored ?
e.g If a dataset is not accessed for years then is the corresponding computation also removed or is it that once a computation program is stored for any dataset it is never deleted/removed by Nectar?
section 3.2 discuses Garbage Collection in detail,
the formula used for calculating CB Ratio (in authors words) doesn't give freshly cache entries a chance to demostrate their freshness... so can't we add a constant or include a new factor to determine to give them a fighting case...
also why is size of derived dataset costly and not beneficial?
PS, its said 7143 hours of computation per day were saved ( section 4.1) but I couldn't find the computations required for caching and for the operations needed to support caching.
In section 5, author compares Comet and Nectar and says in Comet, its difficult to identify same code segments from different programs. Can you explain how this is dealt with in Nectar?
ReplyDeleteWhile discussing the advantages of the Nectar managed data-center bacause of ease of content management, why is keeping track of the requisite filepath information called a source of bug?
ReplyDeleteIs there system similar to nectar implemented for hadoop..... ?
ReplyDelete@prudhvireddy this is dealt by the static program analyzer in the case of Nectar. The working of static program analyzer is given in section
ReplyDelete3.1 Caching Computation under Cache and Programs.
In section 2.2 the authors state "that actual datasets are stored in the distributed storage system and the datacenter-wide services manipulate the actual datasets by maintaining pointers to them." What happens if a pointer becomes corrupt ? How do services detect this? How is this issue resolved?
ReplyDeleteAlso in section 2.2 the authors state "programs of all successful computations are uploaded to a dedicated program store in the cluster. Thus, the service has the necessary information about cached results, meaning that it has a recipe to recreate any derived dataset in the datacenter". My Question is how big can a program store be in a cluster? Will there ever be a case where programs of successful computations are bigger than the store?
ReplyDeleteThe idea of caching the derived datasets and replacing data by the computation that produced it can help in managing the data better. But, the working of Garbage Collection specified in Section 3.2 based on the cost-to-benefit analysis does not account for the importance of the data based on the desired data retrieval time for a particular application. For example, in medical applications, a large amount of data, which may be a result of a long computation may be stored until an experiment arrives to a successful checkpoint. And at such a point of time, the data from the last checkpoint may be needed immediately and the computation of such data (which was large in size and not used for a long time and hence was deleted) may take a long time to re-calculate the data, would not be acceptable.
ReplyDeleteIf a dataset has not been accessed for longer time then it is removed from datacentre and if required in future ,the respective computation is rerun and dataset is obtained.
ReplyDeleteHowever is there any time period defined by Nectar for which the computation(program) is stored ?
e.g If a dataset is not accessed for years then is the corresponding computation also removed or is it that once a computation program is stored for any dataset it is never deleted/removed by Nectar?
section 3.2 discuses Garbage Collection in detail,
ReplyDeletethe formula used for calculating CB Ratio (in authors words) doesn't give freshly cache entries a chance to demostrate their freshness... so can't we add a constant or include a new factor to determine to give them a fighting case...
also why is size of derived dataset costly and not beneficial?
PS, its said 7143 hours of computation per day were saved ( section 4.1) but I couldn't find the computations required for caching and for the operations needed to support caching.