Changing details as following:
1. Remove the local variable 'out_m' in 'CLIP_SH' and store the result in
source vector.
2. Refine the implementation of macro 'CLIP_SH_0_255' and 'CLIP_SW_0_255'.
Performance of VP8 decoding has speed up about 1.1%(from 7.03x to 7.11x).
Performance of H264 decoding has speed up about 0.5%(from 4.35x to 4.37x).
Performance of Theora decoding has speed up about 0.7%(from 5.79x to 5.83x).
3. Remove redundant macro 'CLIP_SH/Wn_0_255_MAX_SATU' and use 'CLIP_SH/Wn_0_255'
instead, because there are no difference in the effect of this two macros.
Reviewed-by: Shiyou Yin <yinshiyou-hf@loongson.cn>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>