# Attention: More Musings

As a reminder, the attention model is a binary classifier which takes matrix valued inputs $X in mathbb{R}^{d times k}$ with $d$ features and $k$ columns, weights (“attends”) to some columns more than others via parameter $v in mathbb{R}^d$, and then predicts with parameter $u in mathbb{R}^d$, [

begin{aligned}

hat y &= mathrm{sgn ;} left( u^top X z right), \

z &= frac{exp left( v^top X_i right)}{sum_k exp left (v^top X_k right) }.

end{aligned}

] I changed the notation slightly from my last post ($w rightarrow u$), the reasons for which will be clear shortly. In the previous post the comparison model was an unconstrained linear predictor on all columns, [

begin{aligned}

hat y &= mathrm{sgn ;} left( w^top mathrm{vec,} (X) right),

end{aligned}

] with $w in mathbb{R}^{d k}$. But this is not a good comparison model because the attention model in nonlinear in ways this model cannot achieve: apples and oranges, really.

This is easier to see with linear attention and a regression task. A linear attention model weights each column according to $(v^top X_i)$, e.g., $(v^top X_i)$ is close to zero for “background” or “irrelevant” stuff and is appreciably nonzero for “foreground” or “relevant” stuff. In that case, [

begin{aligned}

hat y &= u^top X (v^top X)^top = mathrm{tr} left( X X^top v u^top right),

end{aligned}

] (using properties of the trace) which looks like a rank-1 assumption on a full model, [

begin{aligned}

hat y &= mathrm{tr} left( X X^top W right) = sum_{ijk} X_{ik} W_{ij} X_{jk} \

%&= sum_i left( X X^top W right)_{ii} = sum_{ij} left( X X^top right) _{ij} W_{ji} \

%&= sum_{ijk} X_{ik} X_{jk} W_{ji} = sum_{ijk} X_{ik} X_{jk} W_{ij}

end{aligned}

] where $W in mathbb{R}^{d times d}$ and w.l.o.g. symmetric. (Now hopefully the notation change makes sense: the letters $U$ and $V$ are often used for the left and right singular spaces of the SVD.)

The symmetry of $W$ confuses me, because it suggests $u$ and $v$ are the same (but then the prediction is nonnegative?), so clearly more thinking is required. However this gives a bit of insight, and perhaps this leads to some known results about sample complexity.