Workshop on Clusters, Clouds and Grids for Life Sciences

In conjunction with CCGrid 2019 - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 14–17, 2019, Larnaca, Cyprus

Abstracts May 14, 9:00h-12:30h at Golden Beach Bay Hotel Hotel

Welcome and Introduction by the Workshop Chairs

Towards a generic provenance model for Life Science designed as an European Open Science Cloud service

Isabelle Perseil

Dr. Isabelle Perseil is heading the Computational Science Coordination and e-infrastructures at INSERM http://cvscience.aviesan.fr/cv/857/isabelle-perseil She manages a small group of experts that provides the best practices in software engineering, Data Management, Big data, Machine Learning, HPC, Grids, Cloud Computing, parallel computing to 300 research units (1200 research teams). The Computational Science Coordination is working with 13 regional administrations and 23 regional Mesocenters in France to pool the computational resources (grids and HPC) and train engineers and researchers to HPC (OpenMP, MPI and now ORWL) and Big data (MapReduce, Hadoop, Spark, Flink, Storm). Isabelle Perseil has taught UML at Ecole Centrale de Paris for 15 years and is supervising thesis in the domain of parallel computing. She has been recently elected at the Technical Advisory Board of RDA. Isabelle Perseil is involved in many European research projects, among them: EOSC-LIfe, an INFRAEOSC-04 accepted European project headed by ELIXIR, in which Isabelle Perseil is co-leading WP6:FAIRification and provenance services

Exploiting stream parallelism of MRI reconstruction using GrPPI over multiple back-ends

Javier Garcia Blas, David Del Río Astorga, Jose Daniel Garcia and Jesus Carretero.

In recent years, on-line processing of data streams has been established as a major computing paradigm. This is due mainly to two reasons: first, more and more data are generated in near real-time that need to be processed; the second reason is given by the need of efficient parallel applications. However, the above-mentioned areas expose a tough challenge over traditional data-analysis techniques, which have been forced to evolve to a stream perspective. In this work we present an experimental study of a stream-aware multi-staged application, which has been implemented using GrPPI, a generic and reusable parallel pattern interface for C++ applications. We demonstrate the benefits of using this interface in terms of programability, performance, and scalability.

Reproducibility and Performance of Deep Learning Applications for Cancer Detection in Pathological Images

Christoph Jansen, Bruno Schilling, Klaus Strohmenger, Michael Witt, Jonas Annuscheit and Dagmar Krefting.

Convolutional Neural Networks (CNN) have shown to be successful in automatic cancer detection in pathological images. However, such data-driven experiments are difficult to reproduce, because the CNN code may require CUDA-enabled Nvidia GPUs for acceleration and the training is performed on large datasets, that are often stored on a researcher's local computer, inaccessible to others. We introduce the RED file format for reproducible experiment description, where executable programs are packaged as containerized applications and are referenced as Docker container images. Data inputs and outputs are described in terms of network resources using standard transmission and authentication protocols instead of local file paths. Following the FAIR guiding principles, the RED format is based on and compatible with the established Common Workflow Language commandline tool specification. RED files can be interpreted by the accompanying Curious Containers (CC) software. Arbitrarily large datasets are mounted inside running Docker containers via FUSE network filesystems like SSHFS. The approach we have taken heavily relies on network bandwidth for the CNN training time. We have benchmarked SSHFS compared to SSD and NFS in terms of filesystem read speeds and in terms of a of a CNN training scenario, both as sequential and parallel workloads in a compute cluster via two 10 Gb/s interfaces. In our CCN training scenario, network file access via SSHFS introduces a factor of 1.8 to the execution time, compared to a local SSD. We are convinced that RED and CC can greatly improve the reproducibility of deep learning workloads and data-driven experiments. This is in particular important in clinical scenarios where the result of an analysis may contribute to a patient's treatment.

Enabling Large Scale Data Production for OpenDose with GATE on the EGI Infrastructure

Maxime Chauvin, Gilles Mathieu, Sorina Camarasu-Pop, Axel Bonnet, Manuel Bardiès and Isabelle Perseil

The OpenDose collaboration has been established to generate an open and traceable reference database of dosimetric data for nuclear medicine, using a variety of Monte Carlo codes. The amount of data to generate requires to run tens of thousands of simulations per anthropomorphic model, for a total computation time estimated to millions of CPU hours. To tackle this challenge, a project has been initiated to enable large scale data production with the Monte Carlo code GATE. Within this project, CRCT, Inserm CISI and CREATIS worked on developing solutions to run Gate simulations on the EGI grid infrastructure using existing tools such as VIP and GateLab. Those developments include a new GATE grid application deployed on VIP, modifications to the existing GateLab application, and the development of a client code using a REST API for using both. Developed tools are now in production and have already allowed running 30% of GATE simulations for the first 2 models (adult male and adult female). On-going and future work includes improvements both to code and submission strategies, definition and implementation of a long-term storage strategy, extension to other models, and generalisation of the tools to the other Monte Carlo codes used within the OpenDose collaboration.

On distributed collaboration for biomedical analyses

Fatima-Zahra Boujdad, Alban Gaignard, Mario Südholt, Wilmer Garzón Alfonso, Luis Daniel Benavides Navarro and Richard Redon

SCooperation of research groups is nowadays common for the development and execution of biomedical analyses. Multiple partners contribute data in this context, data that is often centralized for processing at some cluster-based or supercomputer-based infrastructure. In contrast, real distributed collaboration that involves processing of data from several partners at different sites is rare. However, such distributed analyses are often very interesting, in particular, for scalability, security and privacy reasons. In this article, we motivate the need for real distributed biomedical analyses in the context of several on-going projects, including the I-CAN project that involves 34 French hospitals and affiliated research groups. We present a set of distributed architectures for such analyses that we have derived from discussions with four different medical research groups and a study of related work. These architectures allow for scalability, security/privacy and reproducibility issues to be taken into account. Finally, we illustrate how such architectures can be implemented with specific tools from computer science and bioinformatics.

Towards a Science Gateway for Bioinformatics: Experiences in the Brazilian System of High Performance Computing

Kary Ocaña, Marcelo Galheigo, Carla Osthoff, Luiz Gadelha, Antônio Tadeu A. Gomes, Daniel de Oliveira, Fabio Porto and Ana Tereza Vasconelos​

Science gateways bring out the possibility of reproducible science as they are integrated to reusable techniques, data and workflow management systems, security mechanisms, and high performance computing (HPC). We introduce Bioinfo-Portal, a science gateway that executes bioinformatics applications using HPC and data management resources provided by the Brazilian National HPC System (SINAPAD). BioinfoPortal follows the Software as a Service (SaaS) model and it is freely available for academic use. Overall, this paper addresses an investigation on some of HPC features in the BioinfoPortal gateway system and analyzes of their impacts. We analyzed the scalability of RAxML in HPC clusters and general features from application executions (dataset, software parameters, efficiency of machines capacity) using machine learning techniques for predicting the effective allocation/usage of computational resources for the gateway. The machine-learning strategies appoint the best machine setup in a heterogeneous environment for the executions of applications that presented at least 75% of efficiency.​

Towards data intensive aware programming models for Exascale systems

Javier Garcia Blas​

Extreme Data is an incarnation of Big Data concept distinguished by the massive amounts of data that must be queried, communicated and analyzed in (near) real-time by using a very large number of memory/storage elements and Exascale computing systems. Immediate examples are the scientific data produced at a rate of hundreds of gigabits-per-second that must be stored, filtered and analyzed, the millions of images per day that must be mined (analyzed) in parallel, the one billion of social data posts queried in real-time on an in-memory components database. Traditional disks or commercial storage cannot handle nowadays the extreme scale of such application data. Following the need of improvement of current concepts and technologies, ASPIDE’s activities focus on data-intensive applications running on systems composed of up to millions of computing elements (Exascale systems). Practical results will include the methodology and software prototypes that will be designed and used to implement Exascale applications. The ASPIDE project will contribute with the definition of a new programming paradigms, APIs, runtime tools and methodologies for expressing data-intensive tasks on Exascale systems, which can pave the way for the exploitation of massive parallelism over a simplified model of the system architecture, promoting high performance and efficiency, and offering powerful operations and mechanisms for processing extreme data sources at high speed and/or real-time.

Big Data Analytics Exploration of Green Space and Mental Health in Melbourne

Richard Sinnott and Ying Hu

Numerous researchers have shown that urban green space, e.g. parks and gardens, is positively associated with health and general well-being. However, these works are typically based on surveys that have many limitations related to the sample size and the questionnaire design. Social media offers the possibility to systematically assess how human emotion is impacted by access to green space at a far larger scale that is more representative of society. In this paper, we explore how Twitter data was used to explore the relationship between green space and human emotion (sentiment). We consider the relationship between Twitter sentiment and green space in the suburbs of Melbourne and consider the impact of socio-economics and related demographic factors. We develop a linear model to explore the extent that access to green space has on the sentiment of tweeters..​

Discussion: Lessons Learned and Future Perspectives

Moderation: Workshop Chairs

University of Applied Sciences Berlin