The driver is somewhat bitrotten (not updated for years) but is still
usable for decoding with this change. To support it, this adds a new
driver quirk to indicate no support at all for surface attributes.
Based on a patch by wm4 <nfxjfg@googlemail.com>.
The early check for inconsistent in-source vs out-of-source build
cannot generate a config.log otherwise.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
D3D9Ex uses different driver paths. This helps with "headless"
configurations when no user logs in. Plain D3D9 device creation will
fail if no user is logged in, while it works with D3D9Ex.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This is an extended version of the AVFrame.opaque field, which can be
used to attach arbitrary user information to an AVFrame.
The usefulness of the opaque field is rather limited, because it can
store only up to 32 bits of information (or 64 bit on 64 bit systems).
It's not possible to set this field to a memory allocation, because
there is no way to deallocate it correctly.
The opaque_ref field circumvents this by letting the user set an
AVBuffer, which makes the user data refcounted.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Previously we first calculated hev, and then negated it.
Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.
Before: Cortex A7 A8 A9 A53 A53/AArch64
vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7
vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0
After:
vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7
vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7
vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7
vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
Before: Cortex A53
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2
vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3
Signed-off-by: Martin Storsjö <martin@martin.st>
Fold the field lengths into the macro.
This makes the macro invocations much more readable, when the
lines are shorter.
This also makes it easier to use only half the registers within
the macro.
Signed-off-by: Martin Storsjö <martin@martin.st>
The ld1r is a leftover from the arm version, where this trick is
beneficial on some cores.
Use a single-lane load where we don't need the semantics of ld1r.
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 14740 bytes to 24292 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Before:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4
vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6
vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3
vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4
vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3
vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4
vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3
vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3
vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7
After:
vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7
vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8
vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0
vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0
vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3
vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3
vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1
vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4
vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4
vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9
vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5
vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6
vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from
19496 to 14740 bytes.
This gives a small slowdown of a couple of tens of cycles, but makes
it more feasible to add more optimized versions of these transforms.
Before:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1
vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8
Signed-off-by: Martin Storsjö <martin@martin.st>
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.
This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.
Before: Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3
vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2
After:
vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1
vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9
Signed-off-by: Martin Storsjö <martin@martin.st>
Fixes all sorts of configuration problems introducec by dad7a9c7c0
on non-Linux or non-vanilla configs. Also removes a line made redundant
in that commit.
This avoids having to count the number of frames sent to the codec
and the number of output packets received; instead just wait until
the encoder returns a buffer with the EOS flag set.
Signed-off-by: Martin Storsjö <martin@martin.st>
This allows distinguishing between the internal variable name for
external libraries and the pkg-config package name. Having both
names available avoids special-casing outside the helper function
when the two identifiers do not match.