From 3de283c665387a569b139d589c2409d51ceafee8 Mon Sep 17 00:00:00 2001
From: Jared Hance <jhance@dropbox.com>
Date: Wed, 20 Mar 2019 10:23:22 -0700
Subject: [PATCH] Make gil handling in completion queue more robust

It turns out that the code generation for "with gil" is a bit more
complicated than the logic for re-obtaining the gil at the end of
"with nogil." This is because PyGILState_Ensure seems to, during
interpreter finalization, think it needs to call a new thread
(resulting in a call to cpython new_threadstate) which then segfaults.

Because "with nogil" knows that, prior to executing, it already had
the gil, it doesn't need to set up as much state, and thus the segfault
does not occur.

To avoid this, we just only use "with nogil" within the infinite loop,
and then end the "nogil" block before we check signals. This avoids
needing any "with gil" call at all.

I was able to reliably reproduce the segfault within a few minutes
before the patch by running a binary in a loop (with py3) while
maxing out my machines cpu usage. After the patch, I have not
been able to reproduce the segfault after two hours.

Note that this race can only occur when the user does not properly
clean up all their channels, and is relying on garbage collection to
do so (which isn't guaranteed). However, we want to avoid a segfault
on failure to close because this isn't a good user error and makes it
hard to debug.
---
 .../grpc/_cython/_cygrpc/completion_queue.pyx.pxi     | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi b/src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi
index a4d425ac564..212d27dc2b7 100644
--- a/src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi
+++ b/src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi
@@ -30,19 +30,20 @@ cdef grpc_event _next(grpc_completion_queue *c_completion_queue, deadline):
   else:
     c_deadline = _timespec_from_time(deadline)
 
-  with nogil:
-    while True:
+  while True:
+    with nogil:
       c_timeout = gpr_time_add(gpr_now(GPR_CLOCK_REALTIME), c_increment)
       if gpr_time_cmp(c_timeout, c_deadline) > 0:
         c_timeout = c_deadline
+  
       c_event = grpc_completion_queue_next(c_completion_queue, c_timeout, NULL)
+  
       if (c_event.type != GRPC_QUEUE_TIMEOUT or
           gpr_time_cmp(c_timeout, c_deadline) == 0):
         break
 
-      # Handle any signals
-      with gil:
-        cpython.PyErr_CheckSignals()
+    # Handle any signals
+    cpython.PyErr_CheckSignals()
   return c_event