Current latency profiles have their tails dominated by writing latency
logs, which is hugely undesirable.
Now when a thread log fills up, push it to a background thread to write
to disk. At shutdown, wait for all latency traces to be flushed.
Fixed an important bug whereby thread info wasn't being taken into
account for ! marks.
Also dramatically improved performance by getting rid of a silly O(n^2)
loop.
Namely, 50,90,95 and 99th percentiles are now reported on important marks.
Example output (for a single ! mark between begin-end marks in grpc_tcp_write()):
```
Important marks:
================
99999@src/core/iomgr/tcp_posix.c:545
Relative mark: 50th p. 90th p. 95th p. 99th p.
205 { (src/core/iomgr/tcp_posix.c:541): 0.037 0.057 0.070 0.087
205 } (src/core/iomgr/tcp_posix.c:556): 15.181 27.021 32.509 41.103
```
For a fabricated example (see https://gist.github.com/dgquintas/026d333815589cc37269) with the same ! mark
in two different frames, the output is:
```
Important marks:
================
999999@src/core/iomgr/tcp_posix.c:5
Relative mark: 50th p. 90th p. 95th p. 99th p.
205 { (src/core/iomgr/tcp_posix.c:1): 9.500 13.900 14.450 14.890
205 } (src/core/iomgr/tcp_posix.c:6): 3.000 4.600 4.800 4.960
999999@src/core/iomgr/tcp_posix.c:3
Relative mark: 50th p. 90th p. 95th p. 99th p.
205 { (src/core/iomgr/tcp_posix.c:1): 2.500 2.900 2.950 2.990
205 { (src/core/iomgr/tcp_posix.c:2): 1.500 1.900 1.950 1.990
205 } (src/core/iomgr/tcp_posix.c:4): 2.000 2.800 2.900 2.980
205 } (src/core/iomgr/tcp_posix.c:6): 10.000 15.600 16.300 16.860
```