In conjunction with CCGrid 2015 - 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 4-7, 2015, Shenzhen, Guangdong, China
Programs may generate different results depending on the computing system where they are executed. Consequently, distributed applications are restricted to a few computing sites to avoid being flawed by numerical variations. We propose algorithms to increase the use of infrastructures while controlling the risk created by numerical variations. Our algorithms rely on a classification of the computing sites updated regularly depending on (i) the amount of observed numerical differences, and (ii) the number of tasks to execute. We simulate our algorithms on a 40-site infrastructure using SimGrid, in 3 different configurations. Results show that for one of the configurations, our algorithm speeds up the execution by a factor of about 4 in 10% of the cases. For the remaining cases and the other 2 configurations, we show that sites cannot be aggregated unless a high risk of numerical variations is tolerated. We conclude that site classifications are a promising approach to handle numerical variability among computing sites. Results could be further improved by integrating more a-priori information in the classifications.