[PATCH 0/5] Fix ntpd to not poll frequently

Thu Sep 25 15:47:38 UTC 2014

On Thu, Sep 25, 2014 at 3:25 PM, Miroslav Lichvar <mlichvar at redhat.com> wrote:
>> Keeping this in mind, bbox ntpd currently does a few things to speed up
>> clock sync. Such as "revert to MINPOLL polling interval if we step the clock".
>> The rationale is that if ntpd does discover that step is needed,
>> something unusual happened. Such as my laptop hibernating:
>> apparently my CMOS clock is busted, it doesn't "tick".
>
> Does your system set the RTC to the system time before suspend? The
> busybox ntpd doesn't seem to reset the MAXERROR field in adjtimex, so
> the kernel RTC synchronization (aka 11-minute mode) is disabled and
> something else is needed to set the RTC. Is that intentional?

My RTC clock is simply broken. It can be set, but it will not advance.
Neither when the laptop is on battery nor when it is on AC power.

Which makes it very useful for testing how ntpd behaves
on sudden clock jumps.

>> So after hibernating, the clock is off by at least a few seconds,
>> sometimes much more. ntpd needs to basically start syncing anew.
>> If it would do it with one request per 20 minutes, it won't go
>> "reasonably fast", right?
>
> No, I don't see why should be the polling interval reset in this case.
> After the clock was stepped, the time offset is close to zero, the
> frequency offset should be still good enough and the polling can
> continue as before the system was suspended.

There is no reason to believe frequency offset is still good.
Assuming that it may be different now (say, different temperature),
keeping 4096 second poll interval runs the risk that
clock would drift by as much as a few seconds between
measurements.

> I'd be ok with it if suspending the system was the only reason the
> clock can be stepped, but there are others, including
>
> - another application is messing with the system clock
> - remote clock was stepped
> - network is congested
> - jitter is so large that the measured offset is above the step
>   threshold
> - frequency offset between local and remote clock is so large that
>   the time offset reaches the step threshold
>
> From these, I think polling interval should be shortened only in the
> last case and there is a problem that it's not so easy to reliably
> distinguish it from the other cases.

I don't want to guess about the cause of the stepping
(because I don't see a reliable way to do it).
I prefer to conservatively assume that something went wrong.

> If the local network connection is down, sendto() will fail and the
> code will keep trying to send the packet in 5 second interval
> (RETRY_INTERVAL) independently from the normal polling interval.
> I think this is the most common case.
>
> If it's a problem somewhere else, I'm not sure what assumptions could
> be made. The network could be congested somewhere close and polling
> frequently could be making it worse.
>
> If ntpd is configured to use only
> one server, perhaps the service was stopped or the access was
> restricted for some reason. How does it help to reset the polling
> interval to the minimum here?
>
>> Who know how long it lasted? What if it lasted many hours?
>> I do want to syncronize my clock soon after network problem is fixed,
>> not 20 minutes after that.
>
> If the clock was synchronized before the sources became unreachable
> and they were not reachable for many hours, does it matter much if the
> first clock update after they are reachable again is delayed by one
> long polling interval?

Yes, it does. Needing more than one hour to set the clock after network outage
is *stupid*.

>> """
>> Keep increasing the polling interval in the following situations:
>> - no replies are received from a peer
>> - no source can be selected
>> - peer claims to be unsynchronized (e.g. we are polling it too
>>   frequently)
>> - recv() returns with an error (e.g. the host doesn't exist or is not
>>   running an NTP service)
>> """
>>
>> I am not sure any of these conditions warrant increasing poll interval.
>>
>> Can you explain why you think it should be done?
>
> To make sure the maximum polling interval is always reached as it
> would normally. Imagine millions of clients not updating their clocks
> for one of the reasons listed above and getting stuck at a short
> polling interval (e.g. after they are restarted), increasing the
> traffic unnecessarily by orders of magnitude.

As I already said, I do understand the desire to keep long poll interval.
But not at any cost.

There should be BALANCE with other goals.

The goal of providing reasonably fast time synchronization
is important too.

Can we stop thinking in terms "my way or the highway" and find
a middle ground?

I don't accept patches which would keep poll interval growing
in the face of not getting responses, or patches which drop code
which lowers it when there seems to be some trouble.

But I *will* accept patches which make ntpd less aggressive
in doing that.

IOW: there are places in code where poll interval gets dropped to MINPOLL
(32 seconds).
I can accept a patch which would drop poll interval
to, say, ~5 minutes. (Please, do provide a rationale with the patch).
That would be a x10 decrease of ntp traffic, yet it would still prevent stupid
scenarios where user has to wait for hours for clock sync.

-- 
vda