How do I (unconditionally) enable unicode support in busybox?
Denys Vlasenko
vda.linux at googlemail.com
Mon Aug 11 18:28:50 UTC 2014
On Mon, Aug 11, 2014 at 8:12 PM, James Bowlin <bitjam at gmail.com> wrote:
> On Mon, Aug 11, 2014 at 08:02 PM, Denys Vlasenko said:
>> Looks like you want to set CONFIG_UNICODE_SUPPORT=y and unset
>> both of these latter options. This way, all busybox applets
>> should always work in Unicode mode.
>
> That set up does not work for sed, although it sometimes works.
I just tried your sed example.
busybox sed code uses regexp routines for s///.
Those routines are part of libc. Therefore, they will be Unicode-aware
only if you are using CONFIG_UNICODE_USING_LOCALE=y.
I just built busybox against glibc with these options:
CONFIG_LOCALE_SUPPORT=y
CONFIG_UNICODE_SUPPORT=y
CONFIG_UNICODE_USING_LOCALE=y
# CONFIG_FEATURE_CHECK_UNICODE_IN_ENV is not set
CONFIG_SUBST_WCHAR=63
CONFIG_LAST_SUPPORTED_WCHAR=4351
# CONFIG_UNICODE_COMBINING_WCHARS is not set
# CONFIG_UNICODE_WIDE_WCHARS is not set
# CONFIG_UNICODE_BIDI_SUPPORT is not set
# CONFIG_UNICODE_NEUTRAL_TABLE is not set
# CONFIG_UNICODE_PRESERVE_BROKEN is not set
And I'm getting this:
$ export LANG=en_US.UTF-8
$ echo ÀÀÀ | ./busybox sed 's/./x/g' | wc -c
4
> The most mysterious thing is that it sometimes works and what
> I need to do to get it to work in /init in an initrd.
You may have found a bug. bbox never runs setlocale()
for init. According to git log, this behavior was there from the very beginning:
commit e5dfced23a904d08afa5dcee190c3c3d845d9f50
Author: Eric Andersen <andersen at codepoet.org>
Date: Mon Apr 9 22:48:12 2001 +0000
Apply Vladimir's latest cleanup patch.
...
...
+#ifdef BB_LOCALE_SUPPORT
+ if(getpid()!=1) /* Do not set locale for `init' */
+ setlocale(LC_ALL, "");
+#endif
This probably should be changed so that init is not special.
As to your other cases, they are interesting too.
For example, you noticed that ${#VAR} handling is buggy.
Even on the above mentioned build, I get this
$ export LANG=en_US.UTF-8
$ ./busybox sh
/home/srcdevel/bbox/fix/busybox.4z $ a=ÀÀÀ; echo ${#a}
6
whereas "standard" shell gives 3.
This is clearly a bug (or at least "incompatibility").
Please report each such bug separately.
> The "wc -m" solution always works (for me) so my problem is solved.
> But there is still a strange problem with sed and unicode that
> Harald was able to reproduce.
Which problem? There are so many mails in this thread...
More information about the busybox
mailing list