RFD: Rework/extending functionality of mdev

Wed Mar 18 12:34:52 UTC 2015

On 18.03.2015 10:42, Didier Kryn wrote:
>> Long lived daemons should have both startup methods, selectable by a
>> parameter, so you make nobodies work more difficult than required.
>
>      OK, I think you are right, because it is a little more than a fork:
> you want to detach from the controlling terminal and start a new
> session. I agree that it is a pain to do it by hand and it is OK if
> there is a command-line switch to avoid all of it.

> But there must be this switch.

Ack!

>> No, restart is not required, as netlink dies, when fifosvd dies (or
>> later on when the handler dies), the supervisor watching netlink may
>> then fire up a new netlink reader (possibly after failure management),
>> where this startup is always done through a central startup command
>> (e.g. xdev).
>>
>> The supervisor, never starts up the netlink reader directly, but
>> watches the process it starts up for xdev. xdev does it's initial
>> action (startup code) then chains (exec) to the netlink reader. This
>> may look ugly and unnecessary complicated at the first glance, but is
>> a known practical trick to drop some memory resources not needed by
>> the long lived daemon, but required by the start up code. For the
>> supervisor instance this looks like a single process, it has started
>> and it may watch until it exits. So from that view it looks, as if
>> netlink has created the pipe and started the fifosvd, but in fact this
>> is done by the startup code (difference between flow of operation and
>> technical placing the code).
>
>      I didn't notice this trick in your description. It is making more
> and more sense :-).

I left it out, to make it not unnecessary complicated, and I wanted to 
focus on the netlink / pipe operation.

>      Now look, since nldev (lest's call it by its name) is execed by
> xdev, it remains the parent of fifosvd, and therefore it shall receive
> the SIGCLD if fifosvd dies. This is the best way for nldev to watch
> fifosvd. Otherwise it should wait until it receives an event from the
> netlink and tries to write it to the pipe, hence loosing the event and
> the possible burst following it. nldev must die on SIGCLD (after piping
> available events, though); this is the only "supervision" logic it must
> implement, but I think it is critical. And it is the same if nldev is
> launched with a long-lived mdev-i without a fifosvd.

netlink reader (nldev) does not need to explicitly watch the fifosvd by 
SIGCHLD.

Either that piece of code does it's job, or it fails and dies. When 
fifosvd dies, the read end of the pipe is closed (by kernel), except 
there is still a handler process (which shall process remaining events 
from the pipe). As soon as there is neither a fifosvd, nor a handler 
process, the pipe is shut down by the kernel, and nldev get error when 
writing to the pipe, so it knows the other end died.

You won't gain much benefit from watching SIGCHLD and reading the 
process status. It either will give you the information, fifosvd process 
is still running, or it died (failed). The same information you get from 
the write to the pipe, when the read end died, you get EPIPE.

Limiting the time, nldev tries to write to the pipe, would although 
allow to detect stuck operation of fifosvd / handler (won't be given by 
SIGCHLD watching) ... but (in parallel I discussed that with Laurent), 
the question is, how to react, when write to the pipe stuck (but no 
failure)? We can't do much here, and are in trouble either, but Laurent 
gave the argument: The netlink socket also contain a buffer, which may 
hold additional events, so we do not loss them, in case processing 
continues normally. When the kernel buffer fills up to it's limit, let 
the kernel react to the problem.

... otherwise you are right, nldev's job is to detect failure of the 
rest of the chain (that is supervise those), and has to react on this. 
The details of taken actions in this case, need and can be discussed 
(and may be later adapted), without much impact on other operation.

This clearly means, I'm open for suggestions, which kind of failure 
handling shall be done. Every action taken, to improve reaction, which 
is of benefit for the major purpose of the netlink reader, without 
blowing this up needlessly, is of interest (hold in mind: long lived 
daemon, trying to keep it simple and small).

My suggestion is: Let the netlink reader detect relevant errors, and 
exec (not spawn) a script of given name, when there are failures. This 
is small, and gives the invoked script full control on the failure 
management (no fixed functionality in a binary). When done, it can 
either die, letting a higher instance doing the job to restart, or exec 
back and re-start the hotplug system (may be with a different 
mechanism). When the script does not exist, the default action is to 
exit the netlink reader process unsuccessful, giving a higher instance a 
failure indication and the possibility to react on it.

>> Not detect? Sure you closed all open file descriptors for the write
>> end (a common cave-eat)? I have never bean hit by such a case, except
>> anyone forgot to close all file descriptors of the write end.

>      You notice that something happened on input (AFAIR) but I'm sure
> you don't know what. It may be data as well. You must read() to know.

The information is all you need. Either the writer process is still 
there (good), or has gone (bad). This is all required to decide what to 
do. More information may only be of interest for some kind of logging or 
error message, but this should have been done, before the writer process 
dies, not afterwards from the back (which always has less information 
than the writer itself).

>      Anyway you don't want to poll() the pipe unless mdev-i is dead
> because you don't want to awake fifosvd for every event.

Therefor fifosvd does poll the pipe only, when there is no running 
handler process. As soon as a handler is started (handing over the read 
end of the pipe), fifosvd waits not for events on the pipe, but for exit 
of the handler process (supervising that). When the handler exits 
fifosvd, goes back to watching for more data arriving in the pipe. With 
a few simple counter checks, fifosvd shall detect ping-pong plays, and 
avoid endless respawning of a failing handler process. If that happen, 
spawn a failure script, wait until exit, then retry pipe / handler 
operation.

--
Harald