I wonder what might be a good way to efficiently poll 21 texels.
The answer is the efficient way is the way that is not polling 21 texels. Sorry to be obvious but smartphones don'tmobile devices may not have the necessary bus width to support such kernels. You need to optimize by reducing the size of the texture plugged in the sampler so that caching will cover a larger kernel radius.
Also, you could forget about your disk kernel and use a two passes algorithm using a vertical kernel, and another one using a purely horizontal, this way you pass from "2D" to "1D" so to speak, and reduce drastically the number of samplings as well as improving cache performance thanks to linear access.
Vertical fetechesfetches should not affect cache performance thanks to the Z storage textures should be arranged in GPU memory. cf http://en.wikipedia.org/wiki/Z-order_curve