mirror of https://github.com/grpc/grpc.git
[pick_first] fix shutdown bug in new PF impl (#38144)
The bug occurs in the following fairly specific sequence of events: 1. PF gets a resolver update with two or more addresses. It starts connecting to the first address and starts a Happy Eyeballs timer for 250ms. - Note that the timer holds a ref to the `SubchannelList`, which is necessary to trigger the bug below. If there was only one address, there would be no Happy Eyeballs timer holding a ref here, so the bug would not occur. 2. The first subchannel reports CONNECTING and is seen by the LB policy. 3. The first subchannel reports READY, and the notification hops into the WorkSerializer but has not yet been executed. 4. The timer fires, and the timer callback hops into the WorkSerializer but has not yet been executed. 5. The LB policy gets shut down. This shuts down the `SubchannelList`, but we fail to actually shut down the underlying `SubchannelState`. - This is the bug! We *should* be shutting down the `SubchannelState` here. - Note that if the pending timer callback were not holding a ref to the `SubchannelList`, then the bug would not occur: the `SubchannelList` would have been immediately destroyed, which *would* have shut down the `SubchannelState`. In particular, note that if the timer had not yet fired, shutting down the `SubchannelList` would cancel the timer, thus releasing the ref immediately and shutting down the `SubchannelState`. Similarly, if the timer callback had already been seen by the LB policy, then the ref would also no longer be held. 6. The LB policy now sees the READY notification. This should be a no-op, since PF has already been shut down. However, because the `SubchannelState` was not shut down, it selects the subchannel instead. 7. The LB policy now sees the timer fire. This becomes a no-op, but it releases the ref to the `SubchannelList`, thus causing the `SubchannelList` to be destroyed. However, the `SubchannelState` for the selected subchannel from the previous step is no longer owned by the `SubchannelList`, so it is not shut down. 8. The selected subchannel now reports IDLE. This causes PF to call `GoIdle()`, and at this point we are holding the last ref to the LB policy, which we try to access after giving up that ref, thus causing a crash. - Note that we're not actually holding this ref in order to keep the LB policy alive at this point; the ref actually exists only due to some [tech debt](pull/38168/head14e077f9bd/src/core/load_balancing/pick_first/pick_first.cc (L196)
). We should never be executing this code path to begin with after PF has been shut down, so we shouldn't need that ref. Closes #38144 COPYBARA_INTEGRATE_REVIEW=https://github.com/grpc/grpc/pull/38144 from markdroth:pick_first_new_fix4ec9f9ea1d
PiperOrigin-RevId: 698807898
parent
67d82ecbb9
commit
a5703a0693
3 changed files with 102 additions and 23 deletions
Loading…
Reference in new issue