How do I (unconditionally) enable unicode support in busybox?

Tanguy Pruvot tanguy.pruvot at gmail.com
Wed Aug 6 12:36:48 UTC 2014


On android, the terminal is utf-8 (only one forced locale)

All is fine except vi cursor with utf-8 data... all is displayed correctly,
even multi-column chinese chars but cursor is not at the right place if
simple utf8 (european accents) is present in the line.

i was trying to implement it, here is a simple implementation of mblen :

+#ifdef BIONIC_UTF8_MBLEN
+static int utf_mblen(char *s) // Char len in bytes
+{
+       int buf_mb_len = 1;
+       if (s == NULL) {
+               return -1;
+       }
+
+       if ((*s & 0xF0) == 0xF0)
+               buf_mb_len = 4; // start of 4-char utf series
+       else if ((*s & 0xF0) == 0xE0)
+               buf_mb_len = 3; // start of 3-char
+       else if ((*s & 0xF0) == 0xC0)
+               buf_mb_len = 2; // start of 2-char
+       else if ((*s & 0xC0) == 0x80)
+               buf_mb_len = 0; // utf hidden part
+
+       return buf_mb_len;
+}
+#endif



2014-08-06 14:04 GMT+02:00 James Bowlin <bitjam at gmail.com>:

> On Wed, Aug 06, 2014 at 11:28 AM, Harald Becker said:
> > Have you ever considered to set an "export LANG=en_US.UTF-8" in
> > /etc/profile?
> > This script file gets run whenever ash is started as a login shell
> > (e.g. via getty/login or with a leading dash from /etc/inittab
> > "-/bin/sh"). Setting this one line enables UTF-8 support for me. For
> > any other shell script, executed during boot, you can just set
> > "export LANG=..." at the start of the script. So what is wrong with
> > this?
>
> Thanks for the reply.  I already have it working in login shells
> just fine.  The problem is mainly when I use a busybox script
> as /init inside an initrd (initramfs).  It runs as process 1.
> The /init script is called directly by the bootloader.  Its
> environment is controlled by the command line parameters which
> are set by the *user*.   Worse, in this situation:
>
>     export LANG=...
>
> DOES NOT WORK for changing subsequent unicode behavior.  The
> behavior is set and locked in by the value of LANG in the initial
> environment which I don't have control over.  It works fine from the
> command line on my development system.  I only see the strange behavior
> when it is run inside an initrd to boot a system which is what I
> actually want to do.  I only run it from the command line on my
> development system for testing.
>
> This is why I am writing to this list.  When busybox is called by
> a bootloader, exporting LANG does not change the unicode behavior.
> If exporting worked then the original code I posted would work:
>
>     echo -n "$x" | LANG=utf sed 's/./x/g' | wc -c
>
> but this does not work when I use it in the initrd.  The sed in
> that code ignores its own environment and goes by whatever the
> very first value of LANG was in the first busybox shell that
> runs.
>
> Perhaps I can phrase it more succinctly as:
>
>     The command "export LANG=..." has no effect on subsequent
>     busybox unicode behavior when a busybox script is called directly
>     from a bootloader.  The only thing that controls the unicode
>     behavior in this case is the initial value of LANG when the
>     first busybox shell script is called as /init.
>
> I admit this is strange, bordering on unbelievable. Especially
> so because this same problem does not exist when running from
> the command line.  In that case "export LANG=..." works just
> as expected for changing subsequent unicode behavior and the
> line of code above works as expected.
>
> Someone else hit this problem back in January:
> http://lists.busybox.net/pipermail/busybox/2014-June/081021.html
>
> They said:
>
> > Exporting LANG in rcS didnt have an effect
>
> They ended up submitting a patch to hard-code the LANG variable
> into busybox.  If "export LANG=..." had worked as we expect then
> there would have been no need for that patch to hard-code a
> default LANG.
>
> BTW: I've found an ugly kludge that fixes my problem but does
> not fix the busybox unicode problem in general. Instead of running:
>
>     echo -n "$x" | LANG=utf sed 's/./x/g' | wc -c
>
> I can use (effectively):
>
>     echo -n "$x" | tr -d '\200-\277' | wc -c
>
> This will give the correct length for all utf-8 sequences.  The
> busybox tr does not allow octal to mix with range, "-", so all 64
> octal sequences must be used.
>
> TR='\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217'
> TR=$TR'\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237'
> TR=$TR'\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257'
> TR=$TR'\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277'
>
>     echo -n "$x" | tr -d $TR | wc -c
>
> I'm still willing to help to find a more general solution to the
> busybox unicode problem.
>
>
> Peace, James
> _______________________________________________
> busybox mailing list
> busybox at busybox.net
> http://lists.busybox.net/mailman/listinfo/busybox
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20140806/ca2569ed/attachment.html>


More information about the busybox mailing list