Merge pull request #1374 from mesonbuild/simd

Add support for SIMD detection
pull/2084/head
Jussi Pakkanen 7 years ago committed by GitHub
commit e89b6cdd10
  1. 12
      docs/markdown/Module-reference.md
  2. 7
      docs/markdown/Release-notes-for-0.42.0.md
  3. 70
      docs/markdown/Simd-module.md
  4. 3
      docs/sitemap.txt
  5. 17
      mesonbuild/compilers/c.py
  6. 52
      mesonbuild/compilers/compilers.py
  7. 4
      mesonbuild/compilers/cpp.py
  8. 6
      mesonbuild/environment.py
  9. 15
      mesonbuild/interpreter.py
  10. 72
      mesonbuild/modules/unstable_simd.py
  11. 15
      run_unittests.py
  12. 8
      test cases/common/155 simd/fallback.c
  13. 43
      test cases/common/155 simd/meson.build
  14. 43
      test cases/common/155 simd/simd_avx.c
  15. 42
      test cases/common/155 simd/simd_avx2.c
  16. 63
      test cases/common/155 simd/simd_mmx.c
  17. 20
      test cases/common/155 simd/simd_neon.c
  18. 29
      test cases/common/155 simd/simd_sse.c
  19. 37
      test cases/common/155 simd/simd_sse2.c
  20. 38
      test cases/common/155 simd/simd_sse3.c
  21. 40
      test cases/common/155 simd/simd_sse41.c
  22. 43
      test cases/common/155 simd/simd_sse42.c
  23. 48
      test cases/common/155 simd/simd_ssse3.c
  24. 93
      test cases/common/155 simd/simdchecker.c
  25. 67
      test cases/common/155 simd/simdfuncs.h

@ -1,4 +1,6 @@
Meson has a selection of modules to make common requirements easy to use. Modules can be thought of like the standard library of a programming language. Currently Meson provides the following modules. Meson has a selection of modules to make common requirements easy to use.
Modules can be thought of like the standard library of a programming language.
Currently Meson provides the following modules.
* [Gnome](Gnome-module.md) * [Gnome](Gnome-module.md)
* [i18n](i18n-module.md) * [i18n](i18n-module.md)
@ -8,3 +10,11 @@ Meson has a selection of modules to make common requirements easy to use. Module
* [Python3](Python-3-module.md) * [Python3](Python-3-module.md)
* [RPM](RPM-module.md) * [RPM](RPM-module.md)
* [Windows](Windows-module.md) * [Windows](Windows-module.md)
In addition there are unstable modules. These are meant for testing new
functionality but note that they do *not* provide a stable API. It can
change in arbitrary ways between releases. The modules might also be removed
without warning in future releases.
* [SIMD](Simd-module.md)

@ -58,3 +58,10 @@ Rust's [linkage reference][rust-linkage].
Both the address- and undefined behavior sanitizers can now be used Both the address- and undefined behavior sanitizers can now be used
simultaneously by passing `-Db_sanitize=address,undefined` to Meson. simultaneously by passing `-Db_sanitize=address,undefined` to Meson.
## Unstable SIMD module
A new experimental module to compile code with many different SIMD
instruction sets and selecting the best one at runtime. This module
is unstable, meaning its API is subject to change in later releases.
It might also be removed altogether.

@ -0,0 +1,70 @@
# Unstable SIMD module
This module provides helper functionality to build code with SIMD instructions.
Available since 0.42.0.
**Note**: this module is unstable. It is only provided as a technology preview.
Its API may change in arbitrary ways between releases or it might be removed
from Meson altogether.
## Usage
This module is designed for the use case where you have an algorithm with one
or more SIMD implementation and you choose which one to use at runtime.
The module provides one method, `check`, which is used like this:
rval = simd.check('mysimds',
mmx : 'simd_mmx.c',
sse : 'simd_sse.c',
sse2 : 'simd_sse2.c',
sse3 : 'simd_sse3.c',
ssse3 : 'simd_ssse3.c',
sse41 : 'simd_sse41.c',
sse42 : 'simd_sse42.c',
avx : 'simd_avx.c',
avx2 : 'simd_avx2.c',
neon : 'simd_neon.c',
compiler : cc)
Here the individual files contain the accelerated versions of the functions
in question. The `compiler` keyword argument takes the compiler you are
going to use to compile them. The function returns an array with two values.
The first value is a bunch of libraries that contain the compiled code. Any
SIMD code that the compiler can't compile (for example, Neon instructions on
an x86 machine) are ignored. You should pass this value to the desired target
using `link_with`. The second value is a `configuration_data` object that
contains true for all the values that were supported. For example if the
compiler did support sse2 instructions, then the object would have `HAVE_SSE2`
set to 1.
Generating code to detect the proper instruction set at runtime is
straightforward. First you create a header with the configuration object and
then a chooser function that looks like this:
void (*fptr)(type_of_function_here) = NULL;
#if HAVE_NEON
if(fptr == NULL && neon_available()) {
fptr = neon_accelerated_function;
}
#endif
#if HAVE_AVX2
if(fptr == NULL && avx2_available()) {
fptr = avx_accelerated_function;
}
#endif
...
if(fptr == NULL) {
fptr = default_function;
}
Each source file provides two functions, the `xxx_available` function to query
whether the CPU currently in use supports the instruction set and
`xxx_accelerated_function` that is the corresponding accelerated
implementation.
At the end of this function the function pointer points to the fastest
available implementation and can be invoked to do the computation.

@ -27,14 +27,15 @@ index.md
Build-options.md Build-options.md
Subprojects.md Subprojects.md
Modules.md Modules.md
Gnome-module.md
i18n-module.md i18n-module.md
Pkgconfig-module.md Pkgconfig-module.md
Python-3-module.md Python-3-module.md
Qt4-module.md Qt4-module.md
Qt5-module.md Qt5-module.md
RPM-module.md RPM-module.md
Simd-module.md
Windows-module.md Windows-module.md
Gnome-module.md
Java.md Java.md
Vala.md Vala.md
IDE-integration.md IDE-integration.md

@ -25,6 +25,8 @@ from .compilers import (
msvc_buildtype_args, msvc_buildtype_args,
msvc_buildtype_linker_args, msvc_buildtype_linker_args,
msvc_winlibs, msvc_winlibs,
vs32_instruction_set_args,
vs64_instruction_set_args,
ClangCompiler, ClangCompiler,
Compiler, Compiler,
CompilerArgs, CompilerArgs,
@ -810,7 +812,7 @@ class VisualStudioCCompiler(CCompiler):
std_warn_args = ['/W3'] std_warn_args = ['/W3']
std_opt_args = ['/O2'] std_opt_args = ['/O2']
def __init__(self, exelist, version, is_cross, exe_wrap): def __init__(self, exelist, version, is_cross, exe_wrap, is_64):
CCompiler.__init__(self, exelist, version, is_cross, exe_wrap) CCompiler.__init__(self, exelist, version, is_cross, exe_wrap)
self.id = 'msvc' self.id = 'msvc'
# /showIncludes is needed for build dependency tracking in Ninja # /showIncludes is needed for build dependency tracking in Ninja
@ -820,6 +822,7 @@ class VisualStudioCCompiler(CCompiler):
'2': ['/W3'], '2': ['/W3'],
'3': ['/W4']} '3': ['/W4']}
self.base_options = ['b_pch'] # FIXME add lto, pgo and the like self.base_options = ['b_pch'] # FIXME add lto, pgo and the like
self.is_64 = is_64
# Override CCompiler.get_always_args # Override CCompiler.get_always_args
def get_always_args(self): def get_always_args(self):
@ -1005,3 +1008,15 @@ class VisualStudioCCompiler(CCompiler):
if not isinstance(args, list): if not isinstance(args, list):
args = [args] args = [args]
return ['/WHOLEARCHIVE:' + x for x in args] return ['/WHOLEARCHIVE:' + x for x in args]
def get_instruction_set_args(self, instruction_set):
if self.is_64:
return vs64_instruction_set_args.get(instruction_set, None)
if self.version.split('.')[0] == '16' and instruction_set == 'avx':
# VS documentation says that this exists and should work, but
# it does not. The headers do not contain AVX intrinsics
# and the can not be called.
return None
return vs32_instruction_set_args.get(instruction_set, None)

@ -228,6 +228,43 @@ base_options = {'b_pch': coredata.UserBooleanOption('b_pch', 'Use precompiled he
True), True),
} }
gnulike_instruction_set_args = {'mmx': ['-mmmx'],
'sse': ['-msse'],
'sse2': ['-msse2'],
'sse3': ['-msse3'],
'ssse3': ['-mssse3'],
'sse41': ['-msse4.1'],
'sse42': ['-msse4.2'],
'avx': ['-mavx'],
'avx2': ['-mavx2'],
'neon': ['-mfpu=neon'],
}
vs32_instruction_set_args = {'mmx': ['/arch:SSE'], # There does not seem to be a flag just for MMX
'sse': ['/arch:SSE'],
'sse2': ['/arch:SSE2'],
'sse3': ['/arch:AVX'], # VS leaped from SSE2 directly to AVX.
'sse41': ['/arch:AVX'],
'sse42': ['/arch:AVX'],
'avx': ['/arch:AVX'],
'avx2': ['/arch:AVX2'],
'neon': None,
}
# The 64 bit compiler defaults to /arch:avx.
vs64_instruction_set_args = {'mmx': ['/arch:AVX'],
'sse': ['/arch:AVX'],
'sse2': ['/arch:AVX'],
'sse3': ['/arch:AVX'],
'ssse3': ['/arch:AVX'],
'sse41': ['/arch:AVX'],
'sse42': ['/arch:AVX'],
'avx': ['/arch:AVX'],
'avx2': ['/arch:AVX2'],
'neon': None,
}
def sanitizer_compile_args(value): def sanitizer_compile_args(value):
if value == 'none': if value == 'none':
return [] return []
@ -755,6 +792,12 @@ class Compiler:
return [] return []
raise EnvironmentException('Language %s does not support linking whole archives.' % self.get_display_language()) raise EnvironmentException('Language %s does not support linking whole archives.' % self.get_display_language())
# Compiler arguments needed to enable the given instruction set.
# May be [] meaning nothing needed or None meaning the given set
# is not supported.
def get_instruction_set_args(self, instruction_set):
return None
def build_unix_rpath_args(self, build_dir, from_dir, rpath_paths, install_rpath): def build_unix_rpath_args(self, build_dir, from_dir, rpath_paths, install_rpath):
if not rpath_paths and not install_rpath: if not rpath_paths and not install_rpath:
return [] return []
@ -933,6 +976,10 @@ class GnuCompiler:
return ['-mwindows'] return ['-mwindows']
return [] return []
def get_instruction_set_args(self, instruction_set):
return gnulike_instruction_set_args.get(instruction_set, None)
class ClangCompiler: class ClangCompiler:
def __init__(self, clang_type): def __init__(self, clang_type):
self.id = 'clang' self.id = 'clang'
@ -983,7 +1030,7 @@ class ClangCompiler:
def has_multi_arguments(self, args, env): def has_multi_arguments(self, args, env):
return super().has_multi_arguments( return super().has_multi_arguments(
['-Werror=unknown-warning-option'] + args, ['-Werror=unknown-warning-option', '-Werror=unused-command-line-argument'] + args,
env) env)
def has_function(self, funcname, prefix, env, extra_args=None, dependencies=None): def has_function(self, funcname, prefix, env, extra_args=None, dependencies=None):
@ -1010,6 +1057,9 @@ class ClangCompiler:
return result return result
return ['-Wl,--whole-archive'] + args + ['-Wl,--no-whole-archive'] return ['-Wl,--whole-archive'] + args + ['-Wl,--no-whole-archive']
def get_instruction_set_args(self, instruction_set):
return gnulike_instruction_set_args.get(instruction_set, None)
# Tested on linux for ICC 14.0.3, 15.0.6, 16.0.4, 17.0.1 # Tested on linux for ICC 14.0.3, 15.0.6, 16.0.4, 17.0.1
class IntelCompiler: class IntelCompiler:

@ -173,10 +173,10 @@ class IntelCPPCompiler(IntelCompiler, CPPCompiler):
class VisualStudioCPPCompiler(VisualStudioCCompiler, CPPCompiler): class VisualStudioCPPCompiler(VisualStudioCCompiler, CPPCompiler):
def __init__(self, exelist, version, is_cross, exe_wrap): def __init__(self, exelist, version, is_cross, exe_wrap, is_64):
self.language = 'cpp' self.language = 'cpp'
CPPCompiler.__init__(self, exelist, version, is_cross, exe_wrap) CPPCompiler.__init__(self, exelist, version, is_cross, exe_wrap)
VisualStudioCCompiler.__init__(self, exelist, version, is_cross, exe_wrap) VisualStudioCCompiler.__init__(self, exelist, version, is_cross, exe_wrap, is_64)
self.base_options = ['b_pch'] # FIXME add lto, pgo and the like self.base_options = ['b_pch'] # FIXME add lto, pgo and the like
def get_options(self): def get_options(self):

@ -534,8 +534,12 @@ class Environment:
# Visual Studio prints version number to stderr but # Visual Studio prints version number to stderr but
# everything else to stdout. Why? Lord only knows. # everything else to stdout. Why? Lord only knows.
version = search_version(err) version = search_version(err)
if not err or not err.split('\n')[0]:
m = 'Failed to detect MSVC compiler arch: stderr was\n{!r}'
raise EnvironmentException(m.format(err))
is_64 = err.split('\n')[0].endswith(' x64')
cls = VisualStudioCCompiler if lang == 'c' else VisualStudioCPPCompiler cls = VisualStudioCCompiler if lang == 'c' else VisualStudioCPPCompiler
return cls(compiler, version, is_cross, exe_wrap) return cls(compiler, version, is_cross, exe_wrap, is_64)
if '(ICC)' in out: if '(ICC)' in out:
# TODO: add microsoft add check OSX # TODO: add microsoft add check OSX
inteltype = ICC_STANDARD inteltype = ICC_STANDARD

@ -161,6 +161,7 @@ class ConfigurationDataHolder(MutableInterpreterObject):
'set_quoted': self.set_quoted_method, 'set_quoted': self.set_quoted_method,
'has': self.has_method, 'has': self.has_method,
'get': self.get_method, 'get': self.get_method,
'merge_from': self.merge_from_method,
}) })
def is_used(self): def is_used(self):
@ -221,6 +222,16 @@ class ConfigurationDataHolder(MutableInterpreterObject):
def keys(self): def keys(self):
return self.held_object.values.keys() return self.held_object.values.keys()
def merge_from_method(self, args, kwargs):
if len(args) != 1:
raise InterpreterException('Merge_from takes one positional argument.')
from_object = args[0]
if not isinstance(from_object, ConfigurationDataHolder):
raise InterpreterException('Merge_from argument must be a configuration data object.')
from_object = from_object.held_object
for k, v in from_object.values.items():
self.held_object.values[k] = v
# Interpreter objects can not be pickled so we must have # Interpreter objects can not be pickled so we must have
# these wrappers. # these wrappers.
@ -1479,6 +1490,10 @@ class Interpreter(InterpreterBase):
if len(args) != 1: if len(args) != 1:
raise InvalidCode('Import takes one argument.') raise InvalidCode('Import takes one argument.')
modname = args[0] modname = args[0]
if modname.startswith('unstable-'):
plainname = modname.split('-', 1)[1]
mlog.warning('Module %s has no backwards or forwards compatibility and might not exist in future releases.' % modname)
modname = 'unstable_' + plainname
if modname not in self.environment.coredata.modules: if modname not in self.environment.coredata.modules:
try: try:
module = importlib.import_module('mesonbuild.modules.' + modname) module = importlib.import_module('mesonbuild.modules.' + modname)

@ -0,0 +1,72 @@
# Copyright 2017 The Meson development team
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .. import mesonlib, compilers, mlog
from . import ExtensionModule
class SimdModule(ExtensionModule):
def __init__(self):
super().__init__()
self.snippets.add('check')
# FIXME add Altivec and AVX512.
self.isets = ('mmx',
'sse',
'sse2',
'sse3',
'ssse3',
'sse41',
'sse42',
'avx',
'avx2',
'neon',
)
def check(self, interpreter, state, args, kwargs):
result = []
if len(args) != 1:
raise mesonlib.MesonException('Check requires one argument, a name prefix for checks.')
prefix = args[0]
if not isinstance(prefix, str):
raise mesonlib.MesonException('Argument must be a string.')
if 'compiler' not in kwargs:
raise mesonlib.MesonException('Must specify compiler keyword')
compiler = kwargs['compiler'].compiler
if not isinstance(compiler, compilers.compilers.Compiler):
raise mesonlib.MesonException('Compiler argument must be a compiler object.')
cdata = interpreter.func_configuration_data(None, [], {})
conf = cdata.held_object
for iset in self.isets:
if iset not in kwargs:
continue
iset_fname = kwargs[iset] # Migth also be an array or Files. static_library will validate.
args = compiler.get_instruction_set_args(iset)
if args is None:
mlog.log('Compiler supports %s:' % iset, mlog.red('NO'))
continue
if len(args) > 0:
if not compiler.has_multi_arguments(args, state.environment):
mlog.log('Compiler supports %s:' % iset, mlog.red('NO'))
continue
mlog.log('Compiler supports %s:' % iset, mlog.green('YES'))
conf.values['HAVE_' + iset.upper()] = ('1', 'Compiler supports %s.' % iset)
libname = prefix + '_' + iset
lib_kwargs = {'sources': iset_fname,
compiler.get_language() + '_args': args}
result.append(interpreter.func_static_lib(None, [libname], lib_kwargs))
return [result, cdata]
def initialize():
return SimdModule()

@ -948,6 +948,7 @@ class AllPlatformTests(BasePlatformTests):
# Detect with evar and do sanity checks on that # Detect with evar and do sanity checks on that
if evar in os.environ: if evar in os.environ:
ecc = getattr(env, 'detect_{}_compiler'.format(lang))(False) ecc = getattr(env, 'detect_{}_compiler'.format(lang))(False)
self.assertTrue(ecc.version)
elinker = env.detect_static_linker(ecc) elinker = env.detect_static_linker(ecc)
# Pop it so we don't use it for the next detection # Pop it so we don't use it for the next detection
evalue = os.environ.pop(evar) evalue = os.environ.pop(evar)
@ -971,6 +972,7 @@ class AllPlatformTests(BasePlatformTests):
self.assertEqual(ecc.get_exelist(), shlex.split(evalue)) self.assertEqual(ecc.get_exelist(), shlex.split(evalue))
# Do auto-detection of compiler based on platform, PATH, etc. # Do auto-detection of compiler based on platform, PATH, etc.
cc = getattr(env, 'detect_{}_compiler'.format(lang))(False) cc = getattr(env, 'detect_{}_compiler'.format(lang))(False)
self.assertTrue(cc.version)
linker = env.detect_static_linker(cc) linker = env.detect_static_linker(cc)
# Check compiler type # Check compiler type
if isinstance(cc, gnu): if isinstance(cc, gnu):
@ -1004,11 +1006,18 @@ class AllPlatformTests(BasePlatformTests):
self.assertTrue(is_windows()) self.assertTrue(is_windows())
self.assertIsInstance(linker, lib) self.assertIsInstance(linker, lib)
self.assertEqual(cc.id, 'msvc') self.assertEqual(cc.id, 'msvc')
self.assertTrue(hasattr(cc, 'is_64'))
# If we're in the appveyor CI, we know what the compiler will be
if 'arch' in os.environ:
if os.environ['arch'] == 'x64':
self.assertTrue(cc.is_64)
else:
self.assertFalse(cc.is_64)
# Set evar ourselves to a wrapper script that just calls the same # Set evar ourselves to a wrapper script that just calls the same
# exelist + some argument. This is meant to test that setting # exelist + some argument. This is meant to test that setting
# something like `ccache gcc -pipe` or `distcc ccache gcc` works. # something like `ccache gcc -pipe` or `distcc ccache gcc` works.
wrapper = os.path.join(testdir, 'compiler wrapper.py') wrapper = os.path.join(testdir, 'compiler wrapper.py')
wrappercc = [sys.executable, wrapper] + cc.get_exelist() + cc.get_always_args() wrappercc = [sys.executable, wrapper] + cc.get_exelist() + ['-DSOME_ARG']
wrappercc_s = '' wrappercc_s = ''
for w in wrappercc: for w in wrappercc:
wrappercc_s += shlex.quote(w) + ' ' wrappercc_s += shlex.quote(w) + ' '
@ -1027,6 +1036,10 @@ class AllPlatformTests(BasePlatformTests):
# Ensure that the exelist is correct # Ensure that the exelist is correct
self.assertEqual(wcc.get_exelist(), wrappercc) self.assertEqual(wcc.get_exelist(), wrappercc)
self.assertEqual(wlinker.get_exelist(), wrapperlinker) self.assertEqual(wlinker.get_exelist(), wrapperlinker)
# Ensure that the version detection worked correctly
self.assertEqual(cc.version, wcc.version)
if hasattr(cc, 'is_64'):
self.assertEqual(cc.is_64, wcc.is_64)
def test_always_prefer_c_compiler_for_asm(self): def test_always_prefer_c_compiler_for_asm(self):
testdir = os.path.join(self.common_test_dir, '141 c cpp and asm') testdir = os.path.join(self.common_test_dir, '141 c cpp and asm')

@ -0,0 +1,8 @@
#include<simdfuncs.h>
void increment_fallback(float arr[4]) {
int i;
for(i=0; i<4; i++) {
arr[i]++;
}
}

@ -0,0 +1,43 @@
project('simd', 'c')
simd = import('unstable-simd')
cc = meson.get_compiler('c')
cdata = configuration_data()
if not meson.is_cross_build() and host_machine.cpu_family() == 'arm' and cc.get_id() == 'clang'
message('Adding -march=armv7 because assuming that this build happens on Raspbian.')
message('Its Clang seems to be misconfigured and does not support NEON by default.')
add_project_arguments('-march=armv7', language : 'c')
endif
if cc.get_id() == 'msvc' and cc.version().version_compare('<17')
error('MESON_SKIP_TEST VS2010 produces broken binaries on x86.')
endif
# FIXME add [a, b] = function()
rval = simd.check('mysimds',
mmx : 'simd_mmx.c',
sse : 'simd_sse.c',
sse2 : 'simd_sse2.c',
sse3 : 'simd_sse3.c',
ssse3 : 'simd_ssse3.c',
sse41 : 'simd_sse41.c',
sse42 : 'simd_sse42.c',
avx : 'simd_avx.c',
avx2 : 'simd_avx2.c',
neon : 'simd_neon.c',
compiler : cc)
simdlibs = rval[0]
cdata.merge_from(rval[1])
configure_file(output : 'simdconfig.h',
configuration : cdata)
p = executable('simdtest', 'simdchecker.c', 'fallback.c',
link_with : simdlibs)
test('simdtest', p)

@ -0,0 +1,43 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<stdint.h>
#ifdef _MSC_VER
#include<intrin.h>
int avx_available() {
return 1;
}
#else
#include<immintrin.h>
#include<cpuid.h>
#ifdef __APPLE__
/*
* Apple ships a broken __builtin_cpu_supports and
* some machines in the CI farm seem to be too
* old to have AVX so just always return 0 here.
*/
int avx_available() { return 0; }
#else
int avx_available() {
return __builtin_cpu_supports("avx");
}
#endif
#endif
void increment_avx(float arr[4]) {
double darr[4];
darr[0] = arr[0];
darr[1] = arr[1];
darr[2] = arr[2];
darr[3] = arr[3];
__m256d val = _mm256_loadu_pd(darr);
__m256d one = _mm256_set1_pd(1.0);
__m256d result = _mm256_add_pd(val, one);
_mm256_storeu_pd(darr, result);
arr[0] = (float)darr[0];
arr[1] = (float)darr[1];
arr[2] = (float)darr[2];
arr[3] = (float)darr[3];
}

@ -0,0 +1,42 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<stdint.h>
/*
* FIXME add proper runtime detection for VS.
*/
#ifdef _MSC_VER
#include<intrin.h>
int avx2_available() {
return 0;
}
#else
#include<immintrin.h>
#include<cpuid.h>
#if defined(__APPLE__)
int avx2_available() { return 0; }
#else
int avx2_available() {
return __builtin_cpu_supports("avx2");
}
#endif
#endif
void increment_avx2(float arr[4]) {
double darr[4];
darr[0] = arr[0];
darr[1] = arr[1];
darr[2] = arr[2];
darr[3] = arr[3];
__m256d val = _mm256_loadu_pd(darr);
__m256d one = _mm256_set1_pd(1.0);
__m256d result = _mm256_add_pd(val, one);
_mm256_storeu_pd(darr, result);
one = _mm256_permute4x64_pd(one, 66); /* A no-op, just here to use AVX2. */
arr[0] = (float)darr[0];
arr[1] = (float)darr[1];
arr[2] = (float)darr[2];
arr[3] = (float)darr[3];
}

@ -0,0 +1,63 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<stdint.h>
#ifdef _MSC_VER
#include<intrin.h>
int mmx_available() {
return 1;
}
/* Contrary to MSDN documentation, MMX intrinsics
* just plain don't work.
*/
void increment_mmx(float arr[4]) {
arr[0]++;
arr[1]++;
arr[2]++;
arr[3]++;
}
#elif defined(__MINGW32__)
int mmx_available() {
return 1;
}
/* MinGW does not seem to ship with MMX or it is broken.
*/
void increment_mmx(float arr[4]) {
arr[0]++;
arr[1]++;
arr[2]++;
arr[3]++;
}
#else
#include<mmintrin.h>
#include<cpuid.h>
#if defined(__APPLE__)
int mmx_available() { return 1; }
#else
int mmx_available() {
return __builtin_cpu_supports("mmx");
}
#endif
void increment_mmx(float arr[4]) {
/* Super ugly but we know that values in arr are always small
* enough to fit in int16;
*/
int i;
__m64 packed = _mm_set_pi16(arr[3], arr[2], arr[1], arr[0]);
__m64 incr = _mm_set1_pi16(1);
__m64 result = _mm_add_pi16(packed, incr);
/* Should be
* int64_t unpacker = _m_to_int64(result);
* but it does not exist on 32 bit platforms for some reason.
*/
int64_t unpacker = (int64_t)(result);
_mm_empty();
for(i=0; i<4; i++) {
arr[i] = (float)(unpacker & ((1<<16)-1));
unpacker >>= 16;
}
}
#endif

@ -0,0 +1,20 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<arm_neon.h>
#include<stdint.h>
int neon_available() {
return 1; /* Incorrect, but I don't know how to check this properly. */
}
void increment_neon(float arr[4]) {
float32x2_t a1, a2, one;
a1 = vld1_f32(arr);
a2 = vld1_f32(&arr[2]);
one = vdup_n_f32(1.0);
a1 = vadd_f32(a1, one);
a2 = vadd_f32(a2, one);
vst1_f32(arr, a1);
vst1_f32(&arr[2], a2);
}

@ -0,0 +1,29 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#ifdef _MSC_VER
#include<intrin.h>
int sse_available() {
return 1;
}
#else
#include<xmmintrin.h>
#include<cpuid.h>
#include<stdint.h>
#if defined(__APPLE__)
int sse_available() { return 1; }
#else
int sse_available() {
return __builtin_cpu_supports("sse");
}
#endif
#endif
void increment_sse(float arr[4]) {
__m128 val = _mm_load_ps(arr);
__m128 one = _mm_set_ps1(1.0);
__m128 result = _mm_add_ps(val, one);
_mm_storeu_ps(arr, result);
}

@ -0,0 +1,37 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<emmintrin.h>
#ifdef _MSC_VER
int sse2_available() {
return 1;
}
#else
#include<cpuid.h>
#include<stdint.h>
#if defined(__APPLE__)
int sse2_available() { return 1; }
#else
int sse2_available() {
return __builtin_cpu_supports("sse2");
}
#endif
#endif
void increment_sse2(float arr[4]) {
double darr[4];
__m128d val1 = _mm_set_pd(arr[0], arr[1]);
__m128d val2 = _mm_set_pd(arr[2], arr[3]);
__m128d one = _mm_set_pd(1.0, 1.0);
__m128d result = _mm_add_pd(val1, one);
_mm_store_pd(darr, result);
result = _mm_add_pd(val2, one);
_mm_store_pd(&darr[2], result);
arr[0] = (float)darr[1];
arr[1] = (float)darr[0];
arr[2] = (float)darr[3];
arr[3] = (float)darr[2];
}

@ -0,0 +1,38 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#ifdef _MSC_VER
#include<intrin.h>
int sse3_available() {
return 1;
}
#else
#include<pmmintrin.h>
#include<cpuid.h>
#include<stdint.h>
#if defined(__APPLE__)
int sse3_available() { return 1; }
#else
int sse3_available() {
return __builtin_cpu_supports("sse3");
}
#endif
#endif
void increment_sse3(float arr[4]) {
double darr[4];
__m128d val1 = _mm_set_pd(arr[0], arr[1]);
__m128d val2 = _mm_set_pd(arr[2], arr[3]);
__m128d one = _mm_set_pd(1.0, 1.0);
__m128d result = _mm_add_pd(val1, one);
_mm_store_pd(darr, result);
result = _mm_add_pd(val2, one);
_mm_store_pd(&darr[2], result);
result = _mm_hadd_pd(val1, val2); /* This does nothing. Only here so we use an SSE3 instruction. */
arr[0] = (float)darr[1];
arr[1] = (float)darr[0];
arr[2] = (float)darr[3];
arr[3] = (float)darr[2];
}

@ -0,0 +1,40 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<stdint.h>
#ifdef _MSC_VER
#include<intrin.h>
int sse41_available() {
return 1;
}
#else
#include<smmintrin.h>
#include<cpuid.h>
#if defined(__APPLE__)
int sse41_available() { return 1; }
#else
int sse41_available() {
return __builtin_cpu_supports("sse4.1");
}
#endif
#endif
void increment_sse41(float arr[4]) {
double darr[4];
__m128d val1 = _mm_set_pd(arr[0], arr[1]);
__m128d val2 = _mm_set_pd(arr[2], arr[3]);
__m128d one = _mm_set_pd(1.0, 1.0);
__m128d result = _mm_add_pd(val1, one);
result = _mm_ceil_pd(result); /* A no-op, only here to use a SSE4.1 intrinsic. */
_mm_store_pd(darr, result);
result = _mm_add_pd(val2, one);
_mm_store_pd(&darr[2], result);
arr[0] = (float)darr[1];
arr[1] = (float)darr[0];
arr[2] = (float)darr[3];
arr[3] = (float)darr[2];
}

@ -0,0 +1,43 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<stdint.h>
#ifdef _MSC_VER
#include<intrin.h>
int sse42_available() {
return 1;
}
#else
#include<nmmintrin.h>
#include<cpuid.h>
#ifdef __APPLE__
int sse42_available() {
return 1;
}
#else
int sse42_available() {
return __builtin_cpu_supports("sse4.2");
}
#endif
#endif
void increment_sse42(float arr[4]) {
double darr[4];
__m128d val1 = _mm_set_pd(arr[0], arr[1]);
__m128d val2 = _mm_set_pd(arr[2], arr[3]);
__m128d one = _mm_set_pd(1.0, 1.0);
__m128d result = _mm_add_pd(val1, one);
_mm_store_pd(darr, result);
result = _mm_add_pd(val2, one);
_mm_store_pd(&darr[2], result);
_mm_crc32_u32(42, 99); /* A no-op, only here to use an SSE4.2 instruction. */
arr[0] = (float)darr[1];
arr[1] = (float)darr[0];
arr[2] = (float)darr[3];
arr[3] = (float)darr[2];
}

@ -0,0 +1,48 @@
#include<simdconfig.h>
#include<simdfuncs.h>
#include<emmintrin.h>
#include<tmmintrin.h>
#ifdef _MSC_VER
#include<intrin.h>
int ssse3_available() {
return 1;
}
#else
#include<cpuid.h>
#include<stdint.h>
int ssse3_available() {
#ifdef __APPLE__
return 1;
#elif defined(__clang__)
/* https://github.com/numpy/numpy/issues/8130 */
return __builtin_cpu_supports("sse4.1");
#else
return __builtin_cpu_supports("ssse3");
#endif
}
#endif
void increment_ssse3(float arr[4]) {
double darr[4];
__m128d val1 = _mm_set_pd(arr[0], arr[1]);
__m128d val2 = _mm_set_pd(arr[2], arr[3]);
__m128d one = _mm_set_pd(1.0, 1.0);
__m128d result = _mm_add_pd(val1, one);
__m128i tmp1, tmp2;
tmp1 = tmp2 = _mm_set1_epi16(0);
_mm_store_pd(darr, result);
result = _mm_add_pd(val2, one);
_mm_store_pd(&darr[2], result);
tmp1 = _mm_hadd_epi32(tmp1, tmp2); /* This does nothing. Only here so we use an SSSE3 instruction. */
arr[0] = (float)darr[1];
arr[1] = (float)darr[0];
arr[2] = (float)darr[3];
arr[3] = (float)darr[2];
}

@ -0,0 +1,93 @@
#include<simdfuncs.h>
#include<stdio.h>
/*
* A function that checks at runtime which simd accelerations are
* available and calls the best one. Falls
* back to plain C implementation if SIMD is not available.
*/
int main(int argc, char **argv) {
float four[4] = {2.0, 3.0, 4.0, 5.0};
const float expected[4] = {3.0, 4.0, 5.0, 6.0};
void (*fptr)(float[4]) = NULL;
const char *type;
int i;
/* Add here. The first matched one is used so put "better" instruction
* sets at the top.
*/
#if HAVE_NEON
if(fptr == NULL && neon_available()) {
fptr = increment_neon;
type = "NEON";
}
#endif
#if HAVE_AVX2
if(fptr == NULL && avx2_available()) {
fptr = increment_avx2;
type = "AVX2";
}
#endif
#if HAVE_AVX
if(fptr == NULL && avx_available()) {
fptr = increment_avx;
type = "AVX";
}
#endif
#if HAVE_SSE42
if(fptr == NULL && sse42_available()) {
fptr = increment_sse42;
type = "SSE42";
}
#endif
#if HAVE_SSE41
if(fptr == NULL && sse41_available()) {
fptr = increment_sse41;
type = "SSE41";
}
#endif
#if HAVE_SSSE3
if(fptr == NULL && ssse3_available()) {
fptr = increment_ssse3;
type = "SSSE3";
}
#endif
#if HAVE_SSE3
if(fptr == NULL && sse3_available()) {
fptr = increment_sse3;
type = "SSE3";
}
#endif
#if HAVE_SSE2
if(fptr == NULL && sse2_available()) {
fptr = increment_sse2;
type = "SSE2";
}
#endif
#if HAVE_SSE
if(fptr == NULL && sse_available()) {
fptr = increment_sse;
type = "SSE";
}
#endif
#if HAVE_MMX
if(fptr == NULL && mmx_available()) {
fptr = increment_mmx;
type = "MMX";
}
#endif
if(fptr == NULL) {
fptr = increment_fallback;
type = "fallback";
}
printf("Using %s.\n", type);
fptr(four);
for(i=0; i<4; i++) {
if(four[i] != expected[i]) {
printf("Increment function failed, got %f expected %f.\n", four[i], expected[i]);
return 1;
}
}
return 0;
}

@ -0,0 +1,67 @@
#pragma once
#include<simdconfig.h>
/* Yes, I do know that arr[4] decays into a pointer
* as a function argument. Don't do this in real code
* but for this test it is ok.
*/
void increment_fallback(float arr[4]);
#if HAVE_MMX
int mmx_available();
void increment_mmx(float arr[4]);
#endif
#if HAVE_SSE
int sse_available();
void increment_sse(float arr[4]);
#endif
#if HAVE_SSE2
int sse2_available();
void increment_sse2(float arr[4]);
#endif
#if HAVE_SSE3
int sse3_available();
void increment_sse3(float arr[4]);
#endif
#if HAVE_SSSE3
int ssse3_available();
void increment_ssse3(float arr[4]);
#endif
#if HAVE_SSE41
int sse41_available();
void increment_sse41(float arr[4]);
#endif
#if HAVE_SSE42
int sse42_available();
void increment_sse42(float arr[4]);
#endif
#if HAVE_AVX
int avx_available();
void increment_avx(float arr[4]);
#endif
#if HAVE_AVX2
int avx2_available();
void increment_avx2(float arr[4]);
#endif
#if HAVE_NEON
int neon_available();
void increment_neon(float arr[4]);
#endif
#if HAVE_ALTIVEC
int altivec_available();
void increment_altivec(float arr[4]);
#endif
/* And so on. */
Loading…
Cancel
Save