I have noted that zgemm will be taken by AO to the Xeon PHI, but not zgemm3m. This is a bit annoying as AO can give a significant boost to zgemm performance, but zgemm3m is faster than ZGEMM normally.
For example on my 8 core machine, ZGEMM3M takes 46s, ZGEMM 62s for a 8192 x 8192 matrix
With AO to the Phi, ZGEMM takes 12.5s, but ZGEMM3M is not accelerated by AO at all.