In Google's tests, a Haswell Xeon E5-2699 v3 processor with 18 cores running at 2.3 GHz using 64-bit floating point math units was able to handle 1.3 Tera Operations Per Second (TOPS) and delivered 51 GB/sec of memory bandwidth; the Haswell chip consumed 145 watts and its system (which had 256 GB of memory) consumed 455 watts when it was busy.
The TPU, by comparison, used 8-bit integer math and access to 256 GB of host memory plus 32 GB of its own memory was able to deliver 34 GB/sec of memory bandwidth on the card and process 92 TOPS - a factor of 71X more throughput on inferences, and in a 384 watt thermal envelope for the server that hosted the TPU.
Despite its multitude of matrix multiplication units, the TPU does not have any stored program; it simply executes instructions sent from the host. The DRAM on the TPU is operated as one unit in parallel because of the need to fetch so many weights to feed to the matrix multiplication unit.
These seem to execute code similarly to a GPU, but a lot more efficient at specific tasks with weak/narrow AI in mind.
This is quite the amazing "processor" they have there, thanks for sharing this!