CAMPAIGN: Expanding the Universe for Clustering Algorithms
Speedups produced by a C++ library of Clustering Algorithms for Massively Parallel Architectures Including GPU Nodes
GenBank, a repository for storing biological sequences, currently contains some 124 billion base pairs and is doubling in size every 18 months.1 Though not a huge number compared to the billion trillion stars in the universe, a human being would clearly have difficulty making sense of it. Clustering algorithms can aid us in this effort, partitioning the data into groups. However, these algorithms can take a long time to run. A natural solution to speed up the process is to implement them on parallel architectures, but typically these do not work well with large, high-dimensional data sets. CAMPAIGN, a C++ library of Clustering Algorithms for Massively Parallel Architectures Including GPU Nodes, was created to fill that gap.
“What would normally take an hour on a CPU, you can now do on a GPU in a second,” says Kai Kohlhoff, PhD, a Simbios postdoctoral fellow who developed CAMPAIGN. “That’s really exciting! Think about what kind of science you can do now.”
For CAMPAIGN, Kohlhoff took five popular clustering algorithms—K-means, K-medoids, K-centers, hierarchical clustering, and self-organizing maps (neural networks)—and implemented them to run on both standard CPUs as well as on GPUs, or graphics processing units. The software library also includes several distance metrics, such as Euclidean and Manhattan, which can be mixed and matched with the clustering algorithms.
Kohlhoff modified the algorithms so that calculations could be run in parallel, taking advantage of the GPU architecture. CAMPAIGN’s K-means algorithm can handle millions of data points with tens of thousands of dimensions (the coordinate system axes that describe the data set) and large numbers of clusters. “For a gene data set, you have a dimensionality with thousands of dimensions, but previous K-means implementations for GPUs reach at most a dimensionality of around 50 before performance starts suffering,” says Kohlhoff. “That’s a big limitation and that’s what nobody had addressed before.” By solving that problem, CAMPAIGN expands the research problems in which clustering can play a role.
Kohlhoff achieved the best results with K-means, which ran up to 2800 times faster on the GPU than on a CPU using MATLAB, a popular numerical software package, for his test case. The performance varied with the algorithm and the size of the data set. For instance, the speedup for hierarchical clustering when compared with fast C++ CPU code was five-fold, which was nevertheless noteworthy: “It is the first time, as far as I know, that you have full hierarchical clustering on a GPU, and not just the initial pair-wise distance computations,” Kohlhoff says.
These faster algorithms can allow researchers to be more experimental and innovative. “Imagine if it takes you an hour to cluster your data,” says Kohlhoff, “Now you can just say, ‘Maybe this will work.’ Then, try it out in an instance.”
Kohlhoff envisions researchers using CAMPAIGN to develop novel clustering protocols, as well as to improve their clustering results. With computational time being less of an issue, researchers can explore clustering using different sets of parameters or run the algorithms with additional iterations. They could even try combining several different clustering algorithms that traditionally do not run very fast.
Clustering plays a key role in the initial steps of the Pande lab’s Markov State Model software tool, MSMBuilder (see Biomedical Computation Review, Fall 2009), so they have started using CAMPAIGN to improve their models. “Clustering is often the rate-limiting step,” says Vijay Pande, PhD, associate professor of chemistry at Stanford University. “The ability to speed clustering would have a fundamental impact in our ability to build better models.”
CAMPAIGN currently runs on NVIDIA GPU cards using NVIDIA’s programming language, CUDA. Coding algorithms in CUDA can have a big payoff in terms of performance, but it is not an easy task. That’s why the availability of libraries like CAMPAIGN is significant. “I have come to appreciate how much work it is to program in CUDA, so I understand how much value there is in sharing these kinds of libraries,” says Kohlhoff.
1GenBank Release Notes, February 15, 2011
CAMPAIGN is open-source and is available for download from http://simtk.org/home/campaign.