With very little effort we should be able to determine fairly proper
timeouts we can use based on prior query history. We track in order to
be able to auto-scale when network conditions change (e.g. maybe there
is a provider failover and timings change due to that). Apple appears to
do this within their system resolver in MacOS. Obviously we should have
a minimum, maximum, and initial value to make sure the algorithm doesn't
somehow go off the rails.
Values:
- Minimum Timeout: 250ms (approximate RTT half-way around the globe)
- Maximum Timeout: 5000ms (Recommended timeout in RFC 1123), can be
reduced by ARES_OPT_MAXTIMEOUTMS, but otherwise the bound specified by
the option caps the retry timeout.
- Initial Timeout: User-specified via configuration or
ARES_OPT_TIMEOUTMS
- Average latency multiplier: 5x (a local DNS server returning a cached
value will be quicker than if it needs to recurse so we need to account
for this)
- Minimum Count for Average: 3. This is the minimum number of queries we
need to form an average for the bucket.
Per-server buckets for tracking latency over time (these are ephemeral
meaning they don't persist once a channel is destroyed). We record both
the current timespan for the bucket and the immediate preceding timespan
in case of roll-overs we can still maintain recent metrics for
calculations:
- 1 minute
- 15 minutes
- 1 hr
- 1 day
- since inception
Each bucket contains:
- timestamp (divided by interval)
- minimum latency
- maximum latency
- total time
- count
NOTE: average latency is (total time / count), we will calculate this
dynamically when needed
Basic algorithm for calculating timeout to use would be:
- Scan from most recent bucket to least recent
- Check timestamp of bucket, if doesn't match current time, continue to
next bucket
- Check count of bucket, if its not at least the "Minimum Count for
Average", check the previous bucket, otherwise continue to next bucket
- If we reached the end with no bucket match, use "Initial Timeout"
- If bucket is selected, take ("total time" / count) as Average latency,
multiply by "Average Latency Multiplier", bound by "Minimum Timeout" and
"Maximum Timeout"
NOTE: The timeout calculated may not be the timeout used. If we are
retrying
the query on the same server another time, then it will use a larger
value
On each query reply where the response is legitimate (proper response or
NXDOMAIN) and not something like a server error:
- Cycle through each bucket in order
- Check timestamp of bucket against current timestamp, if out of date
overwrite previous entry with values, clear current values
- Compare current minimum and maximum recorded latency against query
time and adjust if necessary
- Increment "count" by 1 and "total time" by the query time
Other Notes:
- This is always-on, the only user-configurable value is the initial
timeout which will simply re-uses the current option.
- Minimum and Maximum latencies for a bucket are currently unused but
are there in case we find a need for them in the future.
Fixes Issue: #736
Fix By: Brad House (@bradh352)