This fixes two issues preventing suncc from building this code.
The undocumented 'a' operand modifier, causing gcc to omit a $ in
front of immediate operands (as required in addresses), is not
supported by suncc. Luckily, the also undocumented 'c' modifer
has the same effect and is supported.
On some asm statements with a large number of operands, suncc for no
obvious reason fails to correctly substitute some of the operands.
Fortunately, some of the operands in these statements are plain
numbers which can be inserted directly into the code block instead
of passed as operands.
With these changes, the code builds correctly with both gcc and
suncc.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register.
There is a surprisingly large performance improvement over the c version (more
so than the generated assembly seems to suggest) just in get_cabac, I measured
roughly 40% faster for get_cabac on a K8. However, overall the difference is
not that big, I measured roughly 5% on a test clip on a K8 and a Core2.
Hopefully it still compiles on x86 32bit...
Now that only one table is used, there's some chance even darwin as compiles
this (apparently the label arithmetic used previously doesn't work if it
involves symbols defined in a different file, thanks to Ronald S. Bultje for
helping me with this).
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.
The assembly uses the new table but still won't work for PIC case.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.
The assembly uses the new table but still won't work for PIC case.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register.
There is a surprisingly large performance improvement over the c version (more
so than the generated assembly seems to suggest) just in get_cabac, I measured
roughly 40% faster for get_cabac on a K8. However, overall the difference is
not that big, I measured roughly 5% on a test clip on a K8 and a Core2.
Hopefully it still compiles on x86 32bit...
v2: incorporated feedback from Loren Merritt to avoid rip-relative movs
for every table, and got rid of unnecessary @GOTPCREL.
v3: apply similar fixes to the the decode_significance functions, and use
same macro arguments for non-pic case.
v4: prettify inline asm arguments, add a non-fast-cmov version (as I expect
the c code to be faster otherwise since both cmov and sbb suck hard on a
Prescott, even can't construct the mask with a 64bit shift as that's just as
terrible - it's quite difficult to find usable instructions on that chip...).
This is tested to work but not on a P4, in theory it _should_ be fast there.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Author: Mans Rullgard <mans@mansr.com>
Date: Sun Dec 11 21:41:59 2011 +0000
x86: cabac: replace explicit memory references with "m" operands
This replaces the explicit offset(reg) memory references with
"m" operands for the same locations. As a result, one fewer
register operand is needed for these inline asm statements.
This change appears to have broken compilation on darwin, and subsequent
fixes by martin (which did not fix compilation) removed the register
advantage, thus this change seems not a good idea to keep.
See: http://fate.ffmpeg.org/log.cgi?time=20120103122446&log=compile&slot=i386-darwin-llvm-gcc-4.2.1
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The change in 599b4c6ef didn't turn out to work properly on
i386 on OS X, where it broke building with PIC enabled.
Signed-off-by: Martin Storsjö <martin@martin.st>
(cherry picked from commit f1dba9e498)
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
The change in 599b4c6ef didn't turn out to work properly on
i386 on OS X, where it broke building with PIC enabled.
Signed-off-by: Martin Storsjö <martin@martin.st>
This replaces the explicit offset(reg) memory references with
"m" operands for the same locations. As a result, one fewer
register operand is needed for these inline asm statements.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Inspection of compiled code shows gcc handles these fine on its own.
Benchmarking also shows no measurable speed difference.
Removing the remaining cases in get_cabac_bypass_sign_x86() does
cause more substantial changes to the compiled code with uncertain
impact.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Some operands need to be accessed in byte mode, which restricts the
available registers in 32-bit mode. Using the 'q' constraint selects
a suitable register.
Signed-off-by: Mans Rullgard <mans@mansr.com>