# Fast UTF-8 validation with Range algorithm (NEON+SSE4+AVX2)
This is a brand new algorithm to leverage SIMD for fast UTF-8 string validation. Both **NEON**(armv8a) and **SSE4** versions are implemented. **AVX2** implementation contributed by [ioioioio](https://github.com/ioioioio).
Four UTF-8 validation methods are compared on both x86 and Arm platforms. Benchmark result shows range base algorithm is the best solution on Arm, and achieves same performance as [Lemire's approach](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) on x86.
* Range based algorithm
* range-neon.c: NEON version
* range-sse.c: SSE4 version
* range-avx2.c: AVX2 version
* range2-neon.c, range2-sse.c: Process two blocks in one iteration
* Depending on First Byte, one legal character can be 1, 2, 3, 4 bytes
* For First Byte within C0..DF, character length = 2
* For First Byte within E0..EF, character length = 3
* For First Byte within F0..F4, character length = 4
* C0, C1, F5..FF are not allowed
* Second,Third,Fourth Bytes must lie in 80..BF.
* There are four **special cases** for Second Byte, shown ***bold italic*** in above table.
### Range table
Range table maps range index 0 ~ 15 to minimal and maximum values allowed. Our task is to observe input string, find the pattern and set correct range index for each byte, then validate input string.
* C0,C1,F5..FF are not included in range table and will always be detected.
* Illegal 80..BF will have range index 0(00..7F) and be detected.
* Based on First Byte, according Second, Third and Fourth Bytes will have range index 1/2/3, to make sure they must lie in 80..BF.
* If non-ASCII First Byte overlaps, above algorithm will set range index of the latter First Byte to 9,10,11, which are illegal ranges. E.g, Input = F1 80 C2 90 --> Range index = 8 3 10 1, where 10 indicates error. See table below.
Overlapped non-ASCII First Byte
Input | F1 | 80 | C2 | 90
:---- | :- | :- | :- | :-
*first_len* |*3* |*0* |*1* |*0*
First Byte | 8 | 0 | 8 | 0
Second Byte | 0 | 3 | 0 | 1
Third Byte | 0 | 0 | 2 | 0
Fourth Byte | 0 | 0 | 0 | 1
Range index | 8 | 3 |***10***| 1
### Adjust Second Byte range for special cases
Range index adjustment for four special cases
First Byte | Second Byte | Before adjustment | Correct index | Adjustment |
* Saturate add temporary indices with 112(0x70) (F0 -> 0x71, F4 -> 0x75, all values above 16 will be larger than 128(7-th bit set))
* Use added indices to look up table_ef, get the correct adjustment (index 0x71,0x75 returns 1st,5th elements, per ```pshufb``` behaviour)
#### Error handling
* For overlapped non-ASCII First Byte, range index before adjustment is 9,10,11. After adjustment (adds 2,3,4 or 0), the range index will be 9 to 15, which is still illegal in range table. So the error will be detected.
### Handling remaining bytes
For remaining input less than 16 bytes, we will fallback to naive byte by byte approach to validate them, which is actually faster than SIMD processing.
* Look back last 16 bytes buffer to find First Byte. At most three bytes need to look back. Otherwise we either happen to be at character boundray, or there are some errors we already detected.
* Validate string byte by byte starting from the First Byte.
## Tests
It's necessary to design test cases to cover corner cases as more as possible.
### Positive cases
1. Prepare correct characters
2. Validate correct characters
3. Validate long strings
* Round concatenate characters starting from first character to 1024 bytes
* Validate 1024 bytes string
* Shift 1 byte, validate 1025 bytes string
* Shift 2 bytes, Validate 1026 bytes string
* ...
* Shift 16 bytes, validate 1040 bytes string
4. Repeat step3, test buffer starting from second character
5. Repeat step3, test buffer starting from third character
6. ...
### Negative cases
1. Prepare bad characters and bad strings
* Bad character
* Bad character cross 16 bytes boundary
* Bad character cross last 16 bytes and remaining bytes boundary
2. Test long strings
* Prepare correct long strings same as positive cases
* Append bad characters
* Shift one byte for each iteration
* Validate each shift
## Code breakdown
Below table shows how 16 bytes input are processed step by step. See [range-neon.c](range-neon.c) for according code.
![Range based UTF-8 validation algorithm](https://raw.githubusercontent.com/cyb70289/utf8/master/range.png)