-`grpc_endpoint` code calls `recvmsg` call when the fd is readable and `sendmsg` call when the fd is writable
-` tcp_client` connect code issues async `connect` and finishes creating the client once the fd is writable (i.e when the `connect` actually finished)
- gRPC needed some component that can "efficiently" do the above operations __using the threads provided by the applications (i.e., not create any new threads)__. Also by "efficiently" we mean optimized for latency and throughput
### Opaque Structures exposed by the polling engine
The following are the **Opaque** structures exposed by Polling Engine interface (NOTE: Different polling engine implementations have different definitions of these structures)
- **grpc_fd:** Structure representing a file descriptor
- **grpc_pollset_set:** A group of `grpc_fd`s, `grpc_pollset`s and `grpc_pollset_set`s (yes, a `grpc_pollset_set` can contain other `grpc_pollset_set`s)
- Register a [closure](https://github.com/grpc/grpc/blob/v1.15.1/src/core/lib/iomgr/closure.h#L67) to be called when the fd becomes readable/writable or has an error (In grpc parlance, we refer to this act as “arming the fd”)
- The closure is called exactly once per event. I.e once the fd becomes readable (or writable or error), the closure is fired and the fd is ‘unarmed’. To be notified again, the fd has to be armed again.
- **grpc_fd_shutdown**
- Signature: `grpc_fd_shutdown(grpc_fd* fd)`
- Any current (or future) closures registered for readable/writable/error events are scheduled immediately with an error
- Release the `grpc_fd` structure and call `on_done` closure when the operation is complete
- If `release_fd` is set to `nullptr`, then `close()` the underlying fd as well. If not, put the underlying fd in `release_fd` (and do not call `close()`)
> **NOTE**: There is no `grpc_pollset_remove_fd`. This is because calling `grpc_fd_orphan()` will effectively remove the fd from all the pollsets it’s a part of
> **NOTE**: `grpc_pollset_work()` requires the pollset mutex to be locked before calling it. Shortly after calling `grpc_pollset_work()`, the function populates the `*worker` pointer (among other things) and releases the mutex. Once `grpc_pollset_work()` returns, the `*worker` pointer is **invalid** and should not be used anymore. See the code in `completion_queue.cc` to see how this is used.
- It means that calling `grpc_pollset_work()` on the pollset will also poll all the fds in the pollset_set i.e semantically, it is similar to adding all the fds inside pollset_set to the pollset.
- The logic to choose a designated poller is quite complicated. Pollsets are internally sharded into what are called `pollset_neighborhood` (a structure internal to `epoll1` polling engine implementation). `grpc_pollset_workers` that call `grpc_pollset_work` on a given pollset are all queued in a linked-list against the `grpc_pollset`. The head of the linked list is called "root worker"
- There are as many neighborhoods as the number of cores. A pollset is put in a neighborhood based on the CPU core of the root worker thread. When picking the next designated poller, we always try to find another worker on the current pollset. If there are no more workers in the current pollset, a `pollset_neighborhood` listed is scanned to pick the next pollset and worker that could be the new designated poller.
- NOTE: There is room to tune this implementation. All we really need is good way to maintain a list of `grpc_pollset_workers` with a way to group them per-pollset (needed to implement `grpc_pollset_kick` semantics) and a way randomly select a new designated poller
- See [`begin_worker()`](https://github.com/grpc/grpc/blob/v1.15.1/src/core/lib/iomgr/ev_epoll1_linux.cc#L729) function to see how a designated poller is chosen. Similarly [`end_worker()`](https://github.com/grpc/grpc/blob/v1.15.1/src/core/lib/iomgr/ev_epoll1_linux.cc#L916) function is called by the worker that was just out of `epoll_wait()` and will have to choose a new designated poller)
- FDs are added to multiple epollsets with EPOLLEXCLUSIVE flag. This prevents multiple worker threads from waking up from polling whenever the fd is readable/writable
- We create one new completion queue per call. If we didn’t have PO_EMPTY and PO_FD type pollables, then every call on a given channel will effectively have to create a `Pollable` and hence an epollset. This is because every completion queue automatically creates a pollset and the channel fd will have to be put in that pollset. This clearly requires an epollset to put that fd. Creating an epollset per call (even if we delete the epollset once the call is completed) would mean a lot of sys calls to create/delete epoll fds. This is clearly not a good idea.
- With these new types of `Pollable`s, all pollsets (corresponding to the new per-call completion queue) will initially point to PO_EMPTY global epollset. Then once the channel fd is added to the pollset, the pollset will point to the `Pollable` of type PO_FD containing just that fd (i.e it will reuse the existing `Pollable`). This way, the epoll fd creation/deletion churn is avoided.
### Other polling engine implementations (poll and windows polling engine)
- **poll** polling engine: gRPC's `poll` polling engine is quite complicated. It uses the `poll()` function to do the polling (and hence it is for platforms like osx where epoll is not available)
- The implementation is further complicated by the fact that poll() is level triggered (just keep this in mind in case you wonder why the code at `src/core/lib/iomgr/ev_poll_posix.cc` is written a certain/seemingly complicated way :))
- **Polling engine on Windows**: Windows polling engine looks nothing like other polling engines
- Unlike the grpc polling engines for Unix systems (epollex, epoll1 and poll) Windows endpoint implementation and polling engine implementations are very closely tied together
- Windows endpoint read/write API implementations use the Windows IO API which require specifying an [I/O completion port](https://docs.microsoft.com/en-us/windows/desktop/fileio/i-o-completion-ports)
- In Windows polling engine’s grpc_pollset_work() implementation, ONE of the threads is chosen to wait on the I/O completion port while other threads wait on a condition variable (much like the turnstile polling in epollex/epoll1)