How do I (unconditionally) enable unicode support in busybox?

Denys Vlasenko vda.linux at googlemail.com
Mon Aug 11 18:28:50 UTC 2014


On Mon, Aug 11, 2014 at 8:12 PM, James Bowlin <bitjam at gmail.com> wrote:
> On Mon, Aug 11, 2014 at 08:02 PM, Denys Vlasenko said:
>> Looks like you want to set CONFIG_UNICODE_SUPPORT=y and unset
>> both of these latter options.  This way, all busybox applets
>> should always work in Unicode mode.
>
> That set up does not work for sed, although it sometimes works.

I just tried your sed example.

busybox sed code uses regexp routines for s///.
Those routines are part of libc. Therefore, they will be Unicode-aware
only if you are using CONFIG_UNICODE_USING_LOCALE=y.

I just built busybox against glibc with these options:

CONFIG_LOCALE_SUPPORT=y
CONFIG_UNICODE_SUPPORT=y
CONFIG_UNICODE_USING_LOCALE=y
# CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set
CONFIG_SUBST_WCHAR=63
CONFIG_LAST_SUPPORTED_WCHAR=4351
# CONFIG_UNICODE_COMBINING_WCHARS is not set
# CONFIG_UNICODE_WIDE_WCHARS is not set
# CONFIG_UNICODE_BIDI_SUPPORT is not set
# CONFIG_UNICODE_NEUTRAL_TABLE is not set
# CONFIG_UNICODE_PRESERVE_BROKEN is not set

And I'm getting this:

$ export LANG=en_US.UTF-8
$ echo ÀÀÀ | ./busybox sed 's/./x/g' | wc -c
4

> The most mysterious thing is that it sometimes works and what
> I need to do to get it to work in /init in an initrd.

You may have found a bug. bbox never runs setlocale()
for init. According to git log, this behavior was there from the very beginning:


commit e5dfced23a904d08afa5dcee190c3c3d845d9f50
Author: Eric Andersen <andersen at codepoet.org>
Date:   Mon Apr 9 22:48:12 2001 +0000

    Apply Vladimir's latest cleanup patch.

...
...
+#ifdef BB_LOCALE_SUPPORT
+       if(getpid()!=1) /* Do not set locale for `init' */
+               setlocale(LC_ALL, "");
+#endif


This probably should be changed so that init is not special.

As to your other cases, they are interesting too.
For example, you noticed that ${#VAR} handling is buggy.
Even on the above mentioned build, I get this

$ export LANG=en_US.UTF-8
$ ./busybox sh
/home/srcdevel/bbox/fix/busybox.4z $ a=ÀÀÀ; echo ${#a}
6

whereas "standard" shell gives 3.

This is clearly a bug (or at least "incompatibility").
Please report each such bug separately.

> The "wc -m" solution always works (for me) so my problem is solved.
> But there is still a strange problem with sed and unicode that
> Harald was able to reproduce.

Which problem? There are so many mails in this thread...


More information about the busybox mailing list