How do I (unconditionally) enable unicode support in busybox?

Harald Becker ralda at gmx.de
Tue Aug 12 06:21:18 UTC 2014


Hi James!

 >> I assume the cases where you got correct results, was with
>> glibc linked BB.
>
> Yes, but I only just got it working in the chroot today by
> copying a 32 Meg locale-archive file into the chroot.

This is the problem with glibc, it's heavily overdone (means big). The 
reason why so many people link small programs (for e.g. initramfs / 
chroot statically with a smaller lib). One of those tested lib's is 
uClic, but this doesn't mean there are no bugs. Most functions work, but 
some flaws may be seen.

> it when I was using glibc.  I believe I was told here that if I
> was using glibc then I need to provide locale files in order to
> get unicode support to work.  But no one would tell me *which*
> files were needed.

Sorry for this: Have you ever tried to google for "glibc locale files"?
... but don't expect to get a simple answer. That's the big trouble with 
glibc. Either use it all (do a full install) or be on your own.
In additon things changed between glibc versions, so I don't know the 
exact files required to get things going. So you need to try.


> I couldn't get it to work and I think it is
> really silly anyway because it's horribly wasteful IMO to require
> a bunch of files to provide the system with a single bit of
> information: "treat strings as unicode".

Locale information is a bit more of information then a single bit, but 
tell it the glibc developers ... or look for an alternative. uClibc is 
one of those alternatives supported by Busybox (as Rob Landly create 
prebuild root systems with his tool chains based on uClibc), but this 
not the only alternative. Each libc, however, has it's own type of 
configuration and trouble. Sorry for this, but we are not living in a 
perfect world.

> Since both uclibc and busybox have separate config options for
> enabling unicode and using  locales, I thought uclibc was going
> to be the answer but I was wrong.

UClibc is the well know answer to going into small intramfs/chroot 
environments ... but then you hit this sed bug ... sorry. I could detct 
there is a bug in regexp handling of uClibc. So someone of the uClibc 
maintainers need to look for this. So you may forward this to them.

> files!  The curve-ball I wasn't expecting was (as you report)
> unicode support in sed using uclibc is unconditionally broken.

No one did expect this, so we were confused and looked at the wrong 
topics first, especially as your explanation was confusing.

>> If the config options are right, there is the simple option of
>> setting LANG=UTF-8, and that works as you see with wc -m. The
>> sed problem is a different bug. Don't mix those.
>
> No.  That is simply not true.  There is other breakage.
>
>    1) it is claimed that unspecified and potentially large files
>        need to be included in the initrd to get unicode support to
>        work but they are not always needed and they don't always
>        work (except for "wc -m").  IMO until the exact files that
>        are needed are specified and until those files are
>        reasonably small, this is still a problem for users of
>        busybox.

The files required is a question of the libc you use. As we are here on 
BB list, don't expect to get every information outside the scope of BB 
cooked to the well done state. Many information can be found using a 
search engine. Beside this we try to give as much information as 
possible, but I couldn't tell you all the required file names as I do 
not know them all.

>    2) sed is still broken but it sometimes works if I include
>       locale files.  The strange thing isn't that it is broken;
>       the strange thing is that it works sometimes even without
>       locale files.  This was the only way I got busybox to count
>       unicode characters at first.  I got it to work sometimes in
>       the initrd without locale files so my focus was on getting
>       it to work there reliably.  I admit this may have been
>       misleading but it certainly seemed reasonable at the time.

I don't know what versions of libc/options you used for this. May be 
uClibc can be build with a working regexp. Currently I'm able to 
reproduce your problem with uClibc versions of BB (which don't need 
locale files, but all fail for this sed example. Based on the confusing 
descriptions you gave, I assume you mixed things during your tests. 
There were at least irregularities in your description.

>    3) Then there was the bug (I think) of not calling setlocale
>       if pid=1.  This was very unexpected and it made things
>       more murker.

Missleading problem. This isn't the reason. Correct setting of the LANG 
variable shall work, even if setlocal is not done in pid 1, as all UTF-8 
handling is done in processes forked from your initial script.

>    4) ${#x} is unconditionally broken unless the spec is for it to
>       count bytes instead of characters.

This is a nasty question which behavior is right. How is the 
specification. Unconditionally returning number of characters may lead 
to buffers allocated with the wrong number of bytes. You can't simply 
change that, without looking for the reference.

>    5) printf "%Ns" seems to be unconditionally broken. On my
>       host system it counts characters not bytes.

No. C printf is specified for the number of bytes. It looks ugly with 
UTF-8 and this is indeed a glitch which may lead to changes in spec.

Currently I don't know of such changes. So the behavior is correct to 
spec, if it sounds strange. I didn't do the spec.

> Things are starting to clear up but they are generally still
> quite murky.

They are "murky," as you call it, as long as you mix things up.

> There is conflicting advice about which busybox
> options to use.  There was (until the last day or two) incorrect
> advice about which library to use.

"incorrect advice"? There were suggestions what you can try to solve 
your problem. The only (more or less) complete recipe, known to lead to 
working environments, is to use prebuild systems, which you didn't liked 
to follow.

> It is still unclear to me
> which locale files are needed and how small they can be.

You are asking the wrong people. This depends on glibc.

> Not running setlocale when pid=1 seems bizarre.

Yes and we need to interrogate this further, but this doesn't hit your 
problem. Would only hit you if the shell in pid 1 is used interactively 
without forking. As soon as you fork the environment is exported and may 
be set using your script.


>  It is still not
> clear to me what works and what doesn't (unicode wise) and how
> that changes with which libc you use and how you configure it.

You are looking it the wrong way. Chose a working libc you like, then 
build Busybox. Denys told you the three or four options of BB required 
for this. Questions about the libc are at the wrong place. BB is an 
application which needs functioning libc. That's the short answer.

> Below I conclude with the suggestion that I could help work on a
> README.unicode file that tries to supply clear information to
> users of busybox who want unicode support.

I don't assume you are fully understanding how things are tied together, 
but it is not wrong to save your story of success/failure for other 
users looking for this problem. UTF-8 support is still not complete in 
all systems, all those stepping slowly to a more consistent state, but 
all this needs to be done in conformance of the specs, which may not 
always suffer your needs.

>> You mixed tests with glibc and uClibc. Giving us results of
>> tests without telling which versions where involved mislead us
>> wrong assumptions.
>
> I thought I was clear that I was using glibc up until I said I
> was trying to link it with uclibc to see if that fixed the
> problem.

I told you I jumped anywhere into the thread, as I noted you need help, 
which I may be able to give. Didn't get any detail from the beginning.

 > Part of the reason I did this is that I thought you
> were telling me it worked on your system and I assumed you were
> using uclibc.

I'm currently in the middle of installing a new system. Got a brand new 
ZOTAC ZBOX CI310 with a quad core Celeron N2930 - fanless = silent 
system). The only working systems I do have at the moment are different 
distributions, including there BB builds. On desktop I work with a full 
glibc system for embedded usage I build or pick a statical BB with 
either uCLibc or musl (my older versions are with uClibc, my last try 
with musl), but that doesn't mean every versions works for every 
application.

> Ever since you gave me the "wc -m" solution, my main motivation
> for pursuing this was to help out busybox by reporting on a
> strange bug or perhaps a series of strange bugs.

bugs of uClibc regexp not BB. Where the regexp matching char "." (dot) 
matches bytes instead of characters. That is the regexp doesn't seam to 
have UTF support :(

--
Harald


More information about the busybox mailing list