# How can I identify groups of people from interaction data?

This post is based on a question I answered on Stackoverflow. Even though mine is the accepted answer, I am not very happy with it, and I have this feeling that there is a better way. I can envision someone who understands graph theory and SQL well being able to put together a few magical joins to solve this problem ;-).

So, you have data on a bunch of people interacting through various ways. For our purposes, we care only about the person initiating the interaction and the recipient. We want to find all sets of people such that everyone in the group interacted with everyone else in the group, with the additional constraint that there be at least three people in a group.

The interactions naturally form a directed graph. When I first read the question (more information was provided by the OP in iterations along the way), I thought the groups would be just the strongly connected components of the graph. However, being in a strongly connected component is a weaker condition than being in a set of people all of whom interacted with each other: Say there are three individuals, A, B, and C. Suppose we have A → B, B → A, B → C, and C → B, where X → Y means X initiated an interaction with Y. Then, there is a path from A to C and a path from C to A, but A and C never interacted and this set of people does not satisfy our definition of a group.

The only thing I was able to come up with was to reduce the set of possible groups by first looking at the strongly connected components of the graph of interactions, and then checking to see if they satisfied our definition of a group. If a such component did not satisfy the requirements, I then considered all subsets of side N - 1 where N is the number of people in the component. If I could not find any groups in any of those sets, then my solution went down to subsets of size N - 2 etc …

This method seems to work in the test case the OP provided and I am pretty sure it is correct, but, as I said at the outset, I strongly suspect there is a better way.

Here is the code, including an embedded test case provided with the OP:

#!/usr/bin/env perl

use strict;
use warnings;

use Graph;
use Algorithm::ChooseSubsets;

use constant MIN_SIZE => 3;

my \$interactions = Graph->new(
directed => 1,
);

while (my \$interaction = <DATA>) {
last unless \$interaction =~ /\S/;
my (\$from, \$to) = split ' ', \$interaction, 3;

}

my @groups = map {
is_group(\$interactions, \$_) ? \$_
: check_subsets(\$interactions, \$_)
} grep @\$_ >= MIN_SIZE, \$interactions->strongly_connected_components;

print "Groups: \n";
print "[ @\$_ ]\n" for @groups;

sub check_subsets {
my (\$graph, \$candidate) = @_;

my @groups;
for my \$size (reverse MIN_SIZE .. (@\$candidate - 1)) {
my \$subsets = Algorithm::ChooseSubsets->new(
set => \$candidate,
size => \$size,
);

my \$groups_found;
while (my \$subset = \$subsets->next) {
if (is_group(\$interactions, \$subset)) {
++\$groups_found;
push @groups, \$subset;
}
}
last if \$groups_found;
}

return @groups;
}

sub is_group {
my (\$graph, \$candidate) = @_;

for my \$member (@\$candidate) {
for my \$other (@\$candidate) {
next if \$member eq \$other;
return unless \$graph->has_edge(\$member, \$other);
return unless \$graph->has_edge(\$other, \$member);
}
}

return 1;
}

__DATA__
a   c   Dec  2 06:40:23 IST 2000    comment
f   g   Dec  2 06:40:23 IST 2009    like
c   a   Dec  2 06:40:23 IST 2009    like
g   h   Dec  2 06:40:23 IST 2008    like
a   d   Dec  2 06:40:23 IST 2008    like
r   t   Dec  2 06:40:23 IST 2007    share
d   a   Dec  2 06:40:23 IST 2007    share
t   u   Dec  2 06:40:23 IST 2006    follow
a   e   Dec  2 06:40:23 IST 2006    follow
k   l   Dec  2 06:40:23 IST 2009    like
e   a   Dec  2 06:40:23 IST 2009    like
j   k   Dec  2 06:40:23 IST 2003    like
c   d   Dec  2 06:40:23 IST 2003    like
l   j   Dec  2 06:40:23 IST 2002    like
d   c   Dec  2 06:40:23 IST 2002    like
m   n   Dec  2 06:40:23 IST 2005    like
c   e   Dec  2 06:40:23 IST 2005    like
m   l   Dec  2 06:40:23 IST 2011    like
e   c   Dec  2 06:40:23 IST 2011    like
h   j   Dec  2 06:40:23 IST 2010    like
d   e   Dec  2 06:40:23 IST 2010    like
o   p   Dec  2 06:40:23 IST 2009    like
e   d   Dec  2 06:40:23 IST 2009    like
p   q   Dec  2 06:40:23 IST 2000    comment
q   p   Dec  2 06:40:23 IST 2009    like
a   p   Dec  2 06:40:23 IST 2008    like
p   a   Dec  2 06:40:23 IST 2007    share
l   p   Dec  2 06:40:23 IST 2003    like
j   l   Dec  2 06:40:23 IST 2002    like
t   r   Dec  2 06:40:23 IST 2000    comment
r   h   Dec  2 06:40:23 IST 2009    like
j   f   Dec  2 06:40:23 IST 2008    like
g   d   Dec  2 06:40:23 IST 2007    share
w   q   Dec  2 06:40:23 IST 2003    like
o   y   Dec  2 06:40:23 IST 2002    like
x   y   Dec  2 06:40:23 IST 2000    comment
y   x   Dec  2 06:40:23 IST 2009    like
x   z   Dec  2 06:40:23 IST 2008    like
z   x   Dec  2 06:40:23 IST 2007    share
y   z   Dec  2 06:40:23 IST 2003    like
z   y   Dec  2 06:40:23 IST 2002    like

The output should be:

Groups:
[ x z y ]
[ e c a d ]