Workshop on

Clusters, Clouds and Grids for Life Sciences

CCGrid Life 2022

The 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing , May 16–19, 2022, Taormina, Italy

Abstracts May 16 the University of Messina

Welcome and Introduction by the Workshop Chairs

A Secure Workflow for Shared HPC systems

Hendrik Nolte, Simon Hernan Sarmiento Sabater, Tim Ehlers and Julian Kunkel

Driven by the progress of data and compute-intensive methods in various scientific domains, there is an increasing demand from researchers working with highly sensitive data to have access to the necessary computational resources to be able to adapt those methods in their respective fields. To satisfy the computing needs of those researchers in a cost-effective way, it is an open quest to integrate reliable security measures on existing High Performance Computing (HPC) clusters. The fundamental problem to securely work with sensitive data is, that HPC systems are shared systems that are typically trimmed for highest performance – not for high security. For instance, there are commonly no additional virtualization techniques employed, thus, users typically have access to the host operating system. Since new vulnerabilities are being continuously discovered, solely relying on the traditional Unix permissions is not secure enough. In this paper, we discuss a generic and secure workflow that can be implemented on typical HPC systems allowing users to transfer, store and analyze sensitive data. In our experiments, we see an advantage in the asynchronous execution of IO requests, while reaching 80% of the ideal performance.

The EMPAIA Platform: Vendor-neutral integration of AI applications into digital pathology infrastructures

Christoph Jansen, Klaus Strohmenger, Daniel Romberg, Tobias Küster, Nick Weiss, Björn Lindequist, Michael Franz, André Homeyer and Norman Zerbe

Automated image analysis and artificial intelligence (AI) are a growing market in digital pathology. While various proprietary pathology systems exist, there are no fully vendor-agnostic integration approaches for AI apps. This makes it difficult for vendors of AI solutions to integrate their products into the multitude of non-standard software systems in pathology. The EMPAIA Consortium (EcosysteM for Pathology Diagnostics with AI Assistance) develops an open and decentralized platform allowing AI-based apps of different vendors to be integrated with existing lab IT infrastructures. This is intended to lower the barriers to entry for AI vendors and provide pathologists with access to advanced AI tools. The EMPAIA platform is based on web technologies that can be deployed both on-premises and in the cloud. There are open-source reference implementations for core platform services that can be integrated with or replaced by proprietary alternatives as long as they conform to open API specifications. Apps can be obtained through a central marketplace so pathologists can use them in their daily workflow. In this paper, we provide an overview of the EMPAIA platform architecture. We identify critical use cases and requirements for AI-based software platforms in pathology and explain how these are fulfilled by the EMPAIA platform. Finally, we evaluate the efficiency of routing image data through the platform.

On the building of self-adaptable systems to efficiently manage medical data

Genaro Sanchez-Gallegos, Dante Sánchez-Gallegos, Jose Luis Gonzalez-Compean and Jesus Carretero

he systems that meet non-functional requirements (NFRs) are key for e-health services to face up events such as service outages and violations of confidentiality. However, traditional NFR systems produce efficiency issues, which, when arising in runtime, could affect e-health decision-making processes. This paper presents a dynamic parallel pattern construction model to design and create efficient NFR self-adaptable systems. These patterns include two design phases: in the first one, the designers build NFR systems by creating pipelines including as many applications as required to meet the NFRs established by organizations. In the second phase, the pipelines are automatically added to a dynamic pattern in the form of workers to improve the performance of the NFR systems. The dynamic patterns include services that recursively and automatically perform performance diagnosis and modifications in the parallel version of the NFR system to face, in runtime, changes in the incoming workload. A prototype was implemented to create self-adaptable NFR systems, which were used in a case study to manage spirometry studies, tomography images, and electrocardiograms. The evaluation showed the effectiveness of this dynamic pattern model to create self-adaptable systems for processing multiple types of health data. The evaluation also revealed that the self-adaptive NFR systems built by dynamic patterns yielded significant performance gain in a direct comparison with the implementation of NFR application pipelines built by a traditional framework called Jenkins.

SparkFlow: Towards High-Performance Data Analytics for Spark-based Genome Analysis

Rosa Filgueira, Feras Awaysheh, Adam Carter, Darren White and Omer Rana

The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific work- flows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow cap- italizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on- demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation perfor- mance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm

Multi-bed stitching tool for 3D computed tomography accelerated by GPU devices

Javier Garcia Blas, Pablo Brox, Manuel Desco and Monica Abella

In computed axial tomography (CT) systems it is common to make several consecutive tomographic acquisitions for different bed positions, which are subsequently combined to enlarge the field of view in the longitudinal direction. For this combination, geometric calibration of the bed motion is necessary to avoid double edges in the stitched area. This calibration is performed periodically using a specific manual calibration. This work presents a novel correlation-based automatic bed stitching tool for CT. Our approach exploits the massive parallelism offered by GPUs and features an optimized memory model that allows large volumes to be stitched in near-real time. Evaluation in rodent studies demonstrates not only that the offered implementation is able to paste tomographic studies in reduced time, but also that it reduces the memory footprint.

Evaluating the spread of Omicron COVID-19 variant in Spain

Miguel Guzman Merino, Maria Cristina Marinescu and David E. Singh

This work analyzes the propagation of Omicron, a high transmissible COVID-19 variant, across Spain by means of EpiGraph. EpiGraph is an agent-based parallel simulator that reproduces the COVID-19 propagation over wide areas. In this work we consider a population of 19,574,086 individuals related to the 63 most populated cities of Spain for the time interval comprised between May 15th of 2021 and March 6th of 2022. The main variants existing at the start of the simulation were the Alpha and Delta, with a a 4% and 96% prevalence of the existing infections, respectively. Then, during the second mid of November of 2021 the Delta variant appears in Spain. Given to the higher transmissibility of this new variant -about 2 times larger than Delta-, it quickly spreads through all the cities and becomes the dominant strain of the country. In this work we analyze the propagation of this variant under multiple conditions. First, we define a baseline scenario, that reproduces the existing conditions of the COVID-19 propagation in Spain for this period. Then, we consider alternative scenarios, in which different locations of the initial spread of Omicron variant are considered. Finally, for each one of these scenarios, we evaluate different transportation intensities -i.e. movement of individuals between the cities-. The main conclusion of this work is that, independently of the initial location of the Omicron variant, and the existing transportation conditions, the Omicron variant spreads though all the country in a short time interval.

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

Mathialakan Thavappiragasam, Wael Elwasif and Ada Sedova

High-throughput structure-based screening of drug- like molecules has become a common tool in biomedical research. Recently, acceleration with graphics processing units (GPUs) has provided a large performance boost for molecular docking programs. Both cloud and high-performance computing (HPC) resources have been used for large screens with molecular docking programs; while NVIDIA GPUs have dominated cloud and HPC resources, new vendors such as AMD and Intel are now entering the field, creating the problem of software portability across different GPUs. Ideally, software productivity could be maximized with portable programming models that are able to maintain high performance across architectures. While in many cases compiler directives have been used as an easy way to offload parallel regions of a CPU-based program to a GPU accelerator, they may also be an attractive programming model for providing portability across different GPU vendors, in which case the porting process may proceed in the reverse direction: from low-level, architecture-specific code to higher- level directive-based abstractions. MiniMDock is a new mini- application (miniapp) designed to capture the essential compu- tational kernels found in molecular docking calculations, such as are used in pharmaceutical drug discovery efforts, in order to test different solutions for porting across GPU architectures. Here we extend MiniMDock to GPU offloading with OpenMP directives, and compare to performance of kernels using CUDA, and HIP on both NVIDIA and AMD GPUs, as well as across different compilers, exploring performance bottlenecks. We document this reverse-porting process, from highly optimized device code to a higher-level version using directives, compare code structure, and describe barriers that were overcome in this effort.

Workshop Closing: Lessons Learned and Future Perspectives

Workshop Chairs