There are two implementations here: - a generic scalable one processing two columns at a time, - a specialised processing one (fixed-size) row at a time. Unsurprisingly, the generic one works out better with smaller widths. With larger widths, the gains from filling vectors are outweighed by the extra cost of strided loads and stores. In other words, memory accesses become the bottleneck. T-Head C908: h264_weight2_8_c: 54.5 h264_weight2_8_rvv_i32: 13.7 h264_weight4_8_c: 101.7 h264_weight4_8_rvv_i32: 27.5 h264_weight8_8_c: 197.0 h264_weight8_8_rvv_i32: 75.5 h264_weight16_8_c: 385.0 h264_weight16_8_rvv_i32: 74.2 SpacemiT X60: h264_weight2_8_c: 48.5 h264_weight2_8_rvv_i32: 8.2 h264_weight4_8_c: 90.7 h264_weight4_8_rvv_i32: 16.5 h264_weight8_8_c: 175.0 h264_weight8_8_rvv_i32: 37.7 h264_weight16_8_c: 342.2 h264_weight16_8_rvv_i32: 66.0release/7.1
parent
85706f5136
commit
3606e592ea
2 changed files with 90 additions and 0 deletions
Loading…
Reference in new issue