You can't spell "evil" without "vi".

Rob Landley rob at landley.net
Wed Oct 15 16:36:19 UTC 2008


On Tuesday 14 October 2008 05:34:46 Roberto A. Foglietta wrote:
> 2008/10/14 Denys Vlasenko <vda.linux at googlemail.com>:
> > On Tuesday 14 October 2008 10:48:54 am Rob Landley wrote:
> >> On Sunday 12 October 2008 23:48:22 Rob Landley wrote:
> >> > Next time it reads a buffer, it starts with the last character of a
> >> > cursor left sequence: capital D.  Capital D is "delete to end of
> >> > line", which it does.
> >> >
> >> > So basically, busybox vi is corrupting your data when you cursor
> >> > around in a file on a loaded system.  Wheee...
> >>
> >> Hmmm...  I redid the readit() code to only read ahead when processing an
> >> escape sequence.  (This let me shrink the readahead buffer to 8 bytes,
> >> actually 5 but with 32 bit alignment 8 makes more sense.  Bloatcheck
> >> says I shrunk the code by 17 bytes.)
> >
> > Disregard my previous patch, I just looked at your code and it's
> > as good but it's smaller than mine, so let's use yours.
> >
> >> Unfortunately, this mitigated the problem a bit, but didn't actually
> >> _fix_ it. It happens less often, but I can still trigger it.
> >>
> >> I _think_ this is actually a qemu issue.  The escape sequences are being
> >> generated by the host Linux, which are then sent to the qemu process
> >> over a virtual serial console, which breaks them down into individual
> >> bytes with an interrupt for each.
> >>
> >> This means that the blocking we're depending on to parse escape
> >> sequences doesn't work over a serial console.  You _can get an escape
> >> character by itself with poll saying there's no more data, and then on
> >> the next clock cycle you can get a "[D".
> >>
> >> Hmmm...
> >>
> >> Ok, making poll wait 300 miliseconds before deciding there's no next
> >> character in a pending escape sequence seems to have fixed it.  (At
> >> least I can't reproduce the problem under qemu anymore.)
> >
> > Please document this next time, or someone else might come later
> > and delete the timeout. I did this a few mins ago :( will fix it now.
> >
> > Did you try something smaller than 300ms?
>
>  As far as I understood the problem: considering a 1200bps line, 120
> chars per second, 40 escape sequence per seconds, then the minimum
> timeout should be at least 1/40 sec = 25 ms.

It's not the number of escape sequences per second, it's the delay between 
each character in the escape sequence.  (The problem is when we don't 
recognize that this string of characters should be interpreted as a block, 
and instead interpret them as individual characters.)  So if this was the 
_only_ issue, the minimum delay needed is 1/120th of a second.  (But see 
below.)

> This is for a fixed speed 
> line, considering an asynchronous data line with an average speed of
> 1200bps and a bell curve variance of 25 ms then using a timeout of
> 100ms would catch the 0.999936657516% of the escape sequences.

I'm not following that at all.  If it's just a question of giving the next 
character time to come in at 1200bps, a 25 ms delay should catch 100.00% of 
the escape sequences.  There's no "variance" in the timing of characters 
coming in on a serial line, if they didn't work on a hard clock the data 
would be corrupted at the physical link layer.

> Enlarging the timeout to 150ms 0.999999998027%. Over 200ms should not
> make any sense any more if the variance has been correctly estimated
> in 25 ms.
>
> http://en.wikipedia.org/wiki/Normal_distribution

Where does statistics enter into serial communications?

No, the specific case I'm worried about is virtual systems running under an 
emulator.  (In addition to kvm and xen and such, I bang on qemu a lot to test 
busybox on non-x86 platforms.  A virtual serial console is a standard way to 
talk to that sort of thing.)

The emulator process is a normal userspace process on the host system, subject 
to the process scheduler and thus capable of having its execution suspended 
at any time for an arbitrary length of time on a loaded host system.  When 
the host system schedules away the emulator process, the virtual system's 
processor pauses.  But the poll() timeout is in wall clock time, which 
doesn't pause.  This means that when the emulator _does_ resume, there's a 
race between the timer interrupt (telling the poll to return because there 
was no data within the requested time period) and the serial interrupt 
(queueing the new data).  Which one takes precedence?  I have no idea, that's 
a "dig into the guts of the Linux kernel" thing.

If the timeout take precedence over the servicing the serial interrupt, then 
we need a long enough delay to make sure that the emulator process gets 
scheduled to handle the next character before the timeout.  If the serial 
interrupt takes precedence over the timer, then even a delay of 25 
miliseconds should be plenty.

(The other fun little detail is that the _sending_ process on the other end of 
the serial line can get scheduled away, but it submits data to the kernel as 
a multi-character block so even if your terminal process gets scheduled away, 
it either hasn't sent the escape sequence yet or has queued the whole thing 
into the kernel's serial output buffer which isn't subject to userspace 
scheduling.  At least as long as the write blocks until complete instead of 
returning a short write.  I think.)

I think I can work out a test for this (write a script suspending the qemu 
process with "while true; do kill -STOP $PID; sleep 1; kill -CONT $PID; sleep 
1; done" and then hold down "cursor left" in vi for a couple minutes and see 
if it zaps the line it's one.  Those delays are long enough that even the 
current 300 ms wait to connect a partial sequence won't be long enough, so it 
should either manifest the bug or let us know that we can cut the delay down 
to something unnoticeable like 50ms and still be safe.)

(P.S.  Most network things send data in larger blocks (256 bytes for ppp, 1500 
bytes for 100baseT, 65k for gigabit) so this problem doesn't easily manifest 
via ssh, just via serial connection.  It could theoretically manifest via ssh 
on a poor connection where you queue up 500+ cursor movements, have an escape 
sequence cross packet boundaries due to the MTU size, and then have the 
second packet delayed by retransmission.  But that's kind of unlikely.)

Rob



More information about the busybox mailing list