Clustering Without Limits

Affinity propagation clusters lots of different kinds of data better and faster than other methods

Starting in preschool we all learn how to get organized. Typically, we start with pre-determined categories (dolls, trains, blocks); pre-set ideas about what belongs in each category (Barbie: doll; Thomas the Tank Engine: train) and a fixed number of bins to put things in.

 

If asked to cluster facial images, a standard clustering method (k-means clustering) would take up to a million years on a single computer to achieve the accuracy achieved by affinity propagation after five minutes.But what if you started with none of those initial limitations? Could you still group the toys? It turns out that, in a computer, such sorting is not only possible, but extremely efficient. Using a novel algorithm called affinity propagation, researchers at the University of Toronto found that they can not only cluster lots of different kinds of data appropriately, but do it better and faster than other methods. The work was published in the February 16 issue of Science.

 

“Almost all existing techniques work on a hypothesis refinement basis: they start off with a set of assumed groups and iteratively refine them,” says Brendan Frey, PhD, associate professor of electrical and computer engineering at the University of Toronto, co-author of the paper. “To our knowledge, ours is the first algorithm to consider all possible groupings at once.”

 

The task sounds mind-boggling: There are a huge number of possible groupings. But affinity propagation handles that problem by sending messages between data points—pairwise—so as to maximize the net similarity in each group. “Each message encapsulates or summarizes a whole distribution of possible groupings for one of the data points,” says Delbert Dueck, a PhD candidate in Frey’s lab. “No one has done that before.”

 

Frey and Dueck use affinity propagation to cluster data around “exemplars”— data points that best represent their compatriots. In this graphic, after starting with an equal chance of serving as an exemplar, candidates for that job have already emerged (red dots). Each data point sends messages to each candidate exemplar conveying how well it represents the blue point compared to other candidate exemplars. And candidate exemplars send messages conveying their availability to serve as an exemplar for particular data points.Affinity propagation is based on an algorithm called belief propagation, which has been around in various incarnations for many years. But, say the authors, it’s an approach that has never been applied to clustering. “Certainly not to generic clustering of any type of data,” says Dueck. Indeed the algorithm is so generic that Frey and Dueck used it to analyze gene expression data, facial images, and airline routes, while other researchers have found applications in basketball statistics, the stock market and computer vision. And many tasks in computational biology require a computer to organize the data before using it to make predictions.

 

“Part of the attraction of the algorithm is that, although it was complicated to derive, it’s quite simple to implement and to get an intuitive feel for it,” says Frey. There are basically only two equations to it. “Sometimes we’ll give a talk and get emails from people who’ve implemented it the day after,” he says.

 

When the researchers looked at how well the algorithm performed compared to other clustering methods they found it remarkably efficient. “A problem our algorithm could solve in about five minutes on one computer would take other methods up to one million years to solve on that same computer,” says Frey.

 

Tim Hughes, PhD, of the Center for Cellular and Biomolecular Research at the University of Toronto, is considering using affinity propagation in his research. “It seems like it would do best when things really do form independent groups, and when the data are fairly sparse, so most of the correlation matrix can be dropped in early cycles,” he says. “I think it will work well with exon-profiling data or genome-tiling data, where there is also a constraint that the groups have to correspond to regions near each other on the chromosome.”



All submitted comments are reviewed, so it may be a few days before your comment appears on the site.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.