jcreed blog > Doing $k$-means with intermediate points
Doing $k$-means with intermediate points
The other day tom7 posed a little problem that seemed like a rather
natural mild generalization of $k$-means, which turned out to in fact have a
nice $k$-meansy sort of solution. Here's how it goes:
Suppose you have some data, and you know that it's structured as clusters, but the clusters
themselves have some structure. For example, maybe you know that your data has clusters
near three points $a$, $b$, and $c$,
but it also has clusters near the midpoints of those centers,
$(a+b)/2, (b+c)/2,$ and $(a+c)/2$. That is, suppose you want to find
the best $a, b, c$ such that the sum of squared distances over all
data points to their nearest 'center' (i.e. one of $a, b, c, (a+b)/2,
(b+c)/2,$ and $(a+c)/2$) is minimized.
In general, suppose we're given $n$ are supposed to choose points $P_1, \ldots, P_n \in \R^d$.
We're also given $m$, the total number of 'centers' that we bucket datapoints into,
and for every $i \in 1\ldots n$ and $j \in 1\ldots m$, we get a $\lambda_{ij}$ which
is the coefficient in the linear sum that says how much influence the chosen point $P_i$
has on the $j^{th}$ center. For the example above, $n$ is $3$ and $m$ is $6$, and
the $\lambda$s go like
$\lambda_{1j}$
$\lambda_{2j}$
$\lambda_{3j}$
j=1
1
0
0
j=2
0
1
0
j=3
0
0
1
j=4
1/2
1/2
0
j=5
0
1/2
1/2
j=6
1/2
0
1/2
If we're doing a $k$-means-like algorithm, then we'll alternate
Assigning data points to the center that they're nearest to, and
Finding the best place to move our $P_i$ to assuming the assignment of data points to centers from the previous step.
Since (1) is exactly the same as what $k$-means does, what I'm going to describe is deriving what
exactly we need to do in step (2).
Suppose for each center $j \in 1\ldots m$, we have $p_j$ data points assigned to it,
and that they're named $x_{jk}$ for $k \in 1\ldots p_j$. The error we want to minimize
is
\[ E = \sum_{j=1}^m\sum_{k = 1}^{p_j} \left|\left|x_{jk}-\sum_{i=1}^n P_i\lambda_{ij}\right|\right|^2 \]
\[ = \sum_{j=1}^m\sum_{k = 1}^{p_j} \sum_{\ell = 1}^d \left(x_{jk\ell}-\sum_{i=1}^n P_{i\ell}\lambda_{ij}\right)^2 \]
To minimize, we set to zero all
the partial derivatives with respect to things we have any control over, namely
all the coordinates of the points $P_i$.
\[ \forall \bar i, \bar\ell: \qquad 0 = {\partial E \over \partial P_{\bar i\bar\ell}} \]
\[ = {\partial \over \partial P_{\bar i\bar\ell}} \sum_{j=1}^m\sum_{k = 1}^{p_j} \sum_{\ell = 1}^d \left(x_{jk\ell}-\sum_{i=1}^n P_{i\ell}\lambda_{ij}\right)^2 \]
\[ = \sum_{j=1}^m\sum_{k = 1}^{p_j} \sum_{\ell = 1}^d 2 \left(x_{jk\ell}-\sum_{i=1}^n P_{i\ell}\lambda_{ij}\right) {\partial \over \partial P_{\bar i\bar\ell}} \left(x_{jk\ell}-\sum_{i=1}^n P_{i\ell}\lambda_{ij}\right) \]
\[ = -2 \sum_{j=1}^m\sum_{k = 1}^{p_j} \left(x_{jk\bar\ell}-\sum_{i=1}^n P_{i\bar\ell}\lambda_{ij}\right) \lambda_{\bar i j} \]
hence
\[ 0 = \sum_{j=1}^m\sum_{k = 1}^{p_j} \left(\lambda_{\bar i j}x_{jk\bar\ell}-\lambda_{\bar i j}\sum_{i=1}^n P_{i\bar\ell}\lambda_{ij}\right) \]
\[ \sum_{j=1}^m\sum_{k = 1}^{p_j} \lambda_{\bar i j}\sum_{i=1}^n P_{i\bar\ell}\lambda_{ij} = \sum_{j=1}^m\sum_{k = 1}^{p_j} \lambda_{\bar i j}x_{jk\bar\ell} \]
\[ \sum_{j=1}^m{p_j} \lambda_{\bar i j}\sum_{i=1}^n P_{i\bar\ell}\lambda_{ij} = \sum_{j=1}^m\sum_{k = 1}^{p_j} \lambda_{\bar i j}x_{jk\bar\ell} \]
\[ \sum_{i=1}^n P_{i\bar\ell}\sum_{j=1}^m{p_j} \lambda_{\bar i j}\lambda_{ij} = \sum_{j=1}^m\sum_{k = 1}^{p_j} \lambda_{\bar i j}x_{jk\bar\ell} \]
And if we set $M_{\bar i i} = \sum_{j=1}^m{p_j} \lambda_{\bar i j}\lambda_{ij}$
and $w_{\bar i} = \sum_{j=1}^m\sum_{k = 1}^{p_j} \lambda_{\bar i j}x_{jk\bar\ell}$
and $v_{i} = P_{i\bar\ell}$, then we're really demanding
that
\[ M v = w\]
so we can of course just solve for $v$ by inverting $M$:
\[ v = M^{-1} w\]
Hastily cobbled-together demo:
It seems like empirically it does fail to converge to the 'real' solution
every once in a while.