[Bug 15748] New: with leading quote, printf prints value of first byte of a character instead of its numeric value in the codset

bugzilla at busybox.net bugzilla at busybox.net
Thu Aug 31 07:15:14 UTC 2023


https://bugs.busybox.net/show_bug.cgi?id=15748

            Bug ID: 15748
           Summary: with leading quote, printf prints value of first byte
                    of a character instead of its numeric value in the
                    codset
           Product: Busybox
           Version: 1.35.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Standard Compliance
          Assignee: unassigned at busybox.net
          Reporter: cslycord at gmail.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

According to posix standard for printf:
If the leading character is a single-quote or double-quote, the value shall be
the numeric value in the underlying codeset of the character following the
single-quote or double-quote.

This implies that it should be the character's codepoint, which is what is used
in coreutils and bash.

In busybox, instead it can return the value of the first of byte of the
character.

Examples:
* 바
   HEX codepoint: BC14
   DEC codepoint: 48148
   Hex UTF-8 bytes: EB B0 94
   (UTF-8 bytes converted to DEC): 235 176 148
https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%EB%B0%94&mode=char
* 학
   HEX codepoint: D559
   DEC codepoint: 54617
   Hex UTF-8 bytes: ED 95 99
   (UTF-8 bytes converted to DEC): 237 149 153
https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%ED%95%99&mode=char


# busybox printf '%X' "'바"
EB
busybox printf '%X' "'학"
ED
# busybox printf '%d' "'바"
235
# busybox printf '%d' "'학"
237
(these are the HEX and DEC values of the first byte of the character)


Then the printf from coreutils/bash
# printf '%X' "'바"
BC14
# printf '%X' "'학"
D559
# printf '%d' "'바"
48148
# printf '%d' "'학"
54617
(which are the HEX and DEC values of the character's codepoint)

Same happens with multibyte Chinese characters.
# (coreutils) printf '%X' "'传"
4F20
# busybox printf '%X' "'传"
E4

传 has HEX codepoint 4F20 and Hex UTF-8 bytes: E4 BC A0

-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the busybox-cvs mailing list