In addition to @KevinReid's answer which is spot on, you may look into compressing that transform data as much as possible.
If you can use 10 bits per each of your x, y, z positional components, for example, then 3x10 = 30 bits can be fit into a single 32-bit value passed to the GPU. That uses 4x less bandwidth than passing a 32-bit float per component (= 96 bits, then round up to 128 bits / 16B to align to 32B word word boundaries, depending on whether or not you additionally send w-coordinate).