In conjunction with CCGrid 2014 - 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 26-29, 2014, Chicago, IL, USA
Detection and identification of biological targets such as DNA, proteins, and diseased human cells is crucial towards early disease diagnosis and prognosis. The key to differentiate healthy cells from the diseased cells is the biophysical properties that differs significantly. Micro- and nanosystems such as solid-state micropores and nanopores can measure and translate these properties of human cells and DNA into electrical spikes. Nonetheless, such approaches result into large data streams that are often plagued with inherit noise and baseline wanders. Moreover, the extant detection approaches are tedious, time-consuming, and error-prone, and there is no error-resilient software that can analyze large datasets instantly. The ability to effectively process and detect biological targets in larger datasets lies in the automated and accelerated data processing strategies using state-of-the-art distributed systems. To this end, we propose a distributed detection framework that collects the raw data stream onto the server node, which then splits/distributes the data into segments across the worker nodes. Each node reduces noise in the assigned data segment using moving-average filtering, and detects the electric spikes by comparing them against a threshold (i.e., based on the statistics – mean and standard deviation of the data). Our proposed framework enables the detection of cancer cells with an accuracy of 63% in a mixture of Cancer cells, Red Blood Cells (RBCs), and White Blood Cells (WBCs), and achieves a speedup of 6X over a single-node machine by processing 10 gigabytes of raw data on an 8-node cluster in less than a minute.