This saves one register and one instruction per transform. add16 and add16intra thus become stack-less.