From bugzilla at busybox.net Tue Feb 1 18:48:38 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Tue, 01 Feb 2022 18:48:38 +0000 Subject: [Bug 14041] New: using modprobe/insmod with compressed modules gives scary kernel warnings Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14041 Bug ID: 14041 Summary: using modprobe/insmod with compressed modules gives scary kernel warnings Product: Busybox Version: 1.33.x Hardware: All OS: Linux Status: NEW Severity: normal Priority: P5 Component: Other Assignee: unassigned at busybox.net Reporter: nolange79 at gmail.com CC: busybox-cvs at busybox.net Target Milestone: --- I am using kernel 5.4 on x86_64 fro an embedded system. loading a compressed module will give the kernel log error: kernel: Module has invalid ELF structures steps to reproduce are simply: # busybox insmod nbd.ko.gz Some points: this happens with gzip and xz the util-linux insmod/modprobe work without log entry the module seems correctly loaded and works. decompressing the module (with the same busybox executable) and then loading it will lead to no log entry # (after unloading the module again) # busybox gzip -d nbd.ko.gz # busybox insmod nbd.ko I dont know if there is any functional issue, but I am tempted to raise severity since I cant rule it out either. --- Comment #1 from sylvain.prat at gmail.com --- I also fell into the problem. I was wondering how alpine linux worked out with compressed linux modules and I finally found out that they don't use busybox's modprobe anymore... Since the decompression methods already exist in busybox, it shouldn't be too hard to implement I guess, but I'm not competent enough to do it myself. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Tue Feb 1 20:16:08 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Tue, 01 Feb 2022 20:16:08 +0000 Subject: [Bug 14541] sed: s-command with "semi-special" delimiters get wrong behaviour In-Reply-To: References: Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14541 Christoph Anton Mitterer changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|REOPENED |RESOLVED --- Comment #4 from Christoph Anton Mitterer --- I hadn't seen the 2nd commit, f12fb1e4092900f26f7f8c71cde44b1cd7d26439, when testing. That also fixes the case from comment #3. Now, BusyBox sed seems to behave identically to GNU sed in all the cases I had given in: https://www.austingroupbugs.net/view.php?id=1551#c5612 Especially, it also seems to consider "un-delimitered" delimiters that are also special characters as "still special" (or at least I tried that with '.') - which is, while IMO not clearly defined by POSIX, identical to the behaviour of GNU sed, see https://www.austingroupbugs.net/view.php?id=1551#c5648 for test cases.) Thus closing again. Thanks. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Wed Feb 2 05:11:03 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Wed, 02 Feb 2022 05:11:03 +0000 Subject: [Bug 14566] New: ifupdown: Document supported stanzas for interfaces file Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14566 Bug ID: 14566 Summary: ifupdown: Document supported stanzas for interfaces file Product: Busybox Version: 1.33.x Hardware: All OS: Linux Status: NEW Severity: normal Priority: P5 Component: Networking Assignee: unassigned at busybox.net Reporter: michael at cassaniti.id.au CC: busybox-cvs at busybox.net Target Milestone: --- Hi, First, thank you so much for Busybox. It makes my life very easy I must say. I'm using Busybox 1.33.1 under Alpine Linux 3.14. The current configuration should be at this URL: https://git.alpinelinux.org/aports/tree/main/busybox/busyboxconfig?id=1aa6700d1e4ef810f2319506e48a8b5316d17abe I've read the man page for interfaces from these URLs and they don't all agree on the supported stanzas: - https://salsa.debian.org/debian/ifupdown/-/raw/19052e2ecb0a908428813b5bc25d5bd0283c5a18/interfaces.5.pre - https://manpages.org/etc-network-interfaces/5 - https://www.systutorials.com/docs/linux/man/5-interfaces/ I'm likely not the only one to encounter confusion about what stanzas Busybox does and does not support. I did read the source code and do my best to determine what is supported. This documentation would cover what is __natively__ supported since using additional scripts essentially allows extending the syntax. I found the following directives so far are not supported and I assume I've missed some: - rename - inherits - allow- stanzas (e.g.: allow-hotplug) - no-auto-down - no-scripts - description - template - source-dir is supported but not source-directory Please note I'm not requesting any of these stanzas be supported as part of this bug. It would personally be nice to have rename supported, but I can understand why others mentioned above are not included. -- You are receiving this mail because: You are on the CC list for the bug. From vda.linux at googlemail.com Thu Feb 3 13:58:02 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Thu, 3 Feb 2022 14:58:02 +0100 Subject: [git commit] libbb/sha256: optional x86 hardware accelerated hashing Message-ID: <20220203135206.237C982911@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=6472ac942898437e040171cec991de1c0b962f72 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master 64 bit: function old new delta sha256_process_block64_shaNI - 730 +730 .rodata 108314 108586 +272 sha256_begin 31 83 +52 ------------------------------------------------------------------------------ (add/remove: 5/1 grow/shrink: 2/0 up/down: 1055/-1) Total: 1054 bytes 32 bit: function old new delta sha256_process_block64_shaNI - 747 +747 .rodata 104318 104590 +272 sha256_begin 29 84 +55 ------------------------------------------------------------------------------ (add/remove: 5/1 grow/shrink: 2/0 up/down: 1075/-1) Total: 1074 bytes Signed-off-by: Denys Vlasenko --- libbb/Config.src | 6 + libbb/Kbuild.src | 2 + libbb/hash_md5_sha.c | 54 ++++--- libbb/hash_md5_sha256_x86-32_shaNI.S | 283 +++++++++++++++++++++++++++++++++++ libbb/hash_md5_sha256_x86-64_shaNI.S | 281 ++++++++++++++++++++++++++++++++++ libbb/hash_md5_sha_x86-32_shaNI.S | 4 +- libbb/hash_md5_sha_x86-64.S | 2 +- libbb/hash_md5_sha_x86-64.S.sh | 2 +- libbb/hash_md5_sha_x86-64_shaNI.S | 4 +- 9 files changed, 612 insertions(+), 26 deletions(-) diff --git a/libbb/Config.src b/libbb/Config.src index 708d3b0c8..0ecd5bd46 100644 --- a/libbb/Config.src +++ b/libbb/Config.src @@ -70,6 +70,12 @@ config SHA1_HWACCEL On x86, this adds ~590 bytes of code. Throughput is about twice as fast as fully-unrolled generic code. +config SHA256_HWACCEL + bool "SHA256: Use hardware accelerated instructions if possible" + default y + help + On x86, this adds ~1k bytes of code. + config SHA3_SMALL int "SHA3: Trade bytes for speed (0:fast, 1:slow)" default 1 # all "fast or small" options default to small diff --git a/libbb/Kbuild.src b/libbb/Kbuild.src index b9d34de8e..653025e56 100644 --- a/libbb/Kbuild.src +++ b/libbb/Kbuild.src @@ -59,6 +59,8 @@ lib-y += hash_md5_sha.o lib-y += hash_md5_sha_x86-64.o lib-y += hash_md5_sha_x86-64_shaNI.o lib-y += hash_md5_sha_x86-32_shaNI.o +lib-y += hash_md5_sha256_x86-64_shaNI.o +lib-y += hash_md5_sha256_x86-32_shaNI.o # Alternative (disabled) MD5 implementation #lib-y += hash_md5prime.o lib-y += messages.o diff --git a/libbb/hash_md5_sha.c b/libbb/hash_md5_sha.c index a23db5152..880ffab01 100644 --- a/libbb/hash_md5_sha.c +++ b/libbb/hash_md5_sha.c @@ -13,6 +13,27 @@ #define NEED_SHA512 (ENABLE_SHA512SUM || ENABLE_USE_BB_CRYPT_SHA) +#if ENABLE_SHA1_HWACCEL || ENABLE_SHA256_HWACCEL +# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) +static void cpuid(unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx) +{ + asm ("cpuid" + : "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx) + : "0"(*eax), "1"(*ebx), "2"(*ecx), "3"(*edx) + ); +} +static smallint shaNI; +void FAST_FUNC sha1_process_block64_shaNI(sha1_ctx_t *ctx); +void FAST_FUNC sha256_process_block64_shaNI(sha256_ctx_t *ctx); +# if defined(__i386__) +struct ASM_expects_76_shaNI { char t[1 - 2*(offsetof(sha256_ctx_t, hash) != 76)]; }; +# endif +# if defined(__x86_64__) +struct ASM_expects_80_shaNI { char t[1 - 2*(offsetof(sha256_ctx_t, hash) != 80)]; }; +# endif +# endif +#endif + /* gcc 4.2.1 optimizes rotr64 better with inline than with macro * (for rotX32, there is no difference). Why? My guess is that * macro requires clever common subexpression elimination heuristics @@ -1142,25 +1163,6 @@ static void FAST_FUNC sha512_process_block128(sha512_ctx_t *ctx) } #endif /* NEED_SHA512 */ -#if ENABLE_SHA1_HWACCEL -# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) -static void cpuid(unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx) -{ - asm ("cpuid" - : "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx) - : "0"(*eax), "1"(*ebx), "2"(*ecx), "3"(*edx) - ); -} -void FAST_FUNC sha1_process_block64_shaNI(sha1_ctx_t *ctx); -# if defined(__i386__) -struct ASM_expects_76_shaNI { char t[1 - 2*(offsetof(sha1_ctx_t, hash) != 76)]; }; -# endif -# if defined(__x86_64__) -struct ASM_expects_80_shaNI { char t[1 - 2*(offsetof(sha1_ctx_t, hash) != 80)]; }; -# endif -# endif -#endif - void FAST_FUNC sha1_begin(sha1_ctx_t *ctx) { ctx->hash[0] = 0x67452301; @@ -1173,7 +1175,6 @@ void FAST_FUNC sha1_begin(sha1_ctx_t *ctx) #if ENABLE_SHA1_HWACCEL # if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) { - static smallint shaNI; if (!shaNI) { unsigned eax = 7, ebx = ebx, ecx = 0, edx = edx; cpuid(&eax, &ebx, &ecx, &edx); @@ -1225,6 +1226,19 @@ void FAST_FUNC sha256_begin(sha256_ctx_t *ctx) memcpy(&ctx->total64, init256, sizeof(init256)); /*ctx->total64 = 0; - done by prepending two 32-bit zeros to init256 */ ctx->process_block = sha256_process_block64; +#if ENABLE_SHA256_HWACCEL +# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__)) + { + if (!shaNI) { + unsigned eax = 7, ebx = ebx, ecx = 0, edx = edx; + cpuid(&eax, &ebx, &ecx, &edx); + shaNI = ((ebx >> 29) << 1) - 1; + } + if (shaNI > 0) + ctx->process_block = sha256_process_block64_shaNI; + } +# endif +#endif } #if NEED_SHA512 diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S new file mode 100644 index 000000000..56e37fa38 --- /dev/null +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -0,0 +1,283 @@ +#if ENABLE_SHA256_HWACCEL && defined(__GNUC__) && defined(__i386__) +/* The code is adapted from Linux kernel's source */ + +// We use shorter insns, even though they are for "wrong" +// data type (fp, not int). +// For Intel, there is no penalty for doing it at all +// (CPUs which do have such penalty do not support SHA1 insns). +// For AMD, the penalty is one extra cycle +// (allegedly: I failed to find measurable difference). + +//#define mova128 movdqa +#define mova128 movaps +//#define movu128 movdqu +#define movu128 movups +//#define shuf128_32 pshufd +#define shuf128_32 shufps + + .section .text.sha256_process_block64_shaNI, "ax", @progbits + .globl sha256_process_block64_shaNI + .hidden sha256_process_block64_shaNI + .type sha256_process_block64_shaNI, @function + +#define DATA_PTR %eax + +#define SHA256CONSTANTS %ecx + +#define MSG %xmm0 +#define STATE0 %xmm1 +#define STATE1 %xmm2 +#define MSGTMP0 %xmm3 +#define MSGTMP1 %xmm4 +#define MSGTMP2 %xmm5 +#define MSGTMP3 %xmm6 +#define MSGTMP4 %xmm7 + + .balign 8 # allow decoders to fetch at least 3 first insns +sha256_process_block64_shaNI: + pushl %ebp + movl %esp, %ebp + subl $32, %esp + andl $~0xF, %esp # paddd needs aligned memory operand + + movu128 76+0*16(%eax), STATE0 + movu128 76+1*16(%eax), STATE1 + + shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ + shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + mova128 STATE0, MSGTMP4 + palignr $8, STATE1, STATE0 /* ABEF */ + pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ + +# mova128 PSHUFFLE_BSWAP32_FLIP_MASK, SHUF_MASK + lea K256, SHA256CONSTANTS + + /* Save hash values for addition after rounds */ + mova128 STATE0, 0*16(%esp) + mova128 STATE1, 1*16(%esp) + + /* Rounds 0-3 */ + movu128 0*16(DATA_PTR), MSG + pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + mova128 MSG, MSGTMP0 + paddd 0*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 4-7 */ + movu128 1*16(DATA_PTR), MSG + pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + mova128 MSG, MSGTMP1 + paddd 1*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 8-11 */ + movu128 2*16(DATA_PTR), MSG + pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + mova128 MSG, MSGTMP2 + paddd 2*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 12-15 */ + movu128 3*16(DATA_PTR), MSG + pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + mova128 MSG, MSGTMP3 + paddd 3*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 16-19 */ + mova128 MSGTMP0, MSG + paddd 4*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 20-23 */ + mova128 MSGTMP1, MSG + paddd 5*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 24-27 */ + mova128 MSGTMP2, MSG + paddd 6*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 28-31 */ + mova128 MSGTMP3, MSG + paddd 7*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 32-35 */ + mova128 MSGTMP0, MSG + paddd 8*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 36-39 */ + mova128 MSGTMP1, MSG + paddd 9*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 40-43 */ + mova128 MSGTMP2, MSG + paddd 10*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 44-47 */ + mova128 MSGTMP3, MSG + paddd 11*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 48-51 */ + mova128 MSGTMP0, MSG + paddd 12*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 52-55 */ + mova128 MSGTMP1, MSG + paddd 13*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 56-59 */ + mova128 MSGTMP2, MSG + paddd 14*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 60-63 */ + mova128 MSGTMP3, MSG + paddd 15*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Add current hash values with previously saved */ + paddd 0*16(%esp), STATE0 + paddd 1*16(%esp), STATE1 + + /* Write hash values back in the correct order */ + shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ + shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + mova128 STATE0, MSGTMP4 + pblendw $0xF0, STATE1, STATE0 /* DCBA */ + palignr $8, MSGTMP4, STATE1 /* HGFE */ + + movu128 STATE0, 76+0*16(%eax) + movu128 STATE1, 76+1*16(%eax) + + movl %ebp, %esp + popl %ebp + ret + .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI + +.section .rodata.cst256.K256, "aM", @progbits, 256 +.balign 16 +K256: + .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 + .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 + .long 0xd807aa98,0x12835b01,0x243185be,0x550c7dc3 + .long 0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174 + .long 0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc + .long 0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da + .long 0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7 + .long 0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967 + .long 0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13 + .long 0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85 + .long 0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3 + .long 0xd192e819,0xd6990624,0xf40e3585,0x106aa070 + .long 0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5 + .long 0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3 + .long 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208 + .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 + +.section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 +.balign 16 +PSHUFFLE_BSWAP32_FLIP_MASK: + .octa 0x0c0d0e0f08090a0b0405060700010203 + +#endif diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S new file mode 100644 index 000000000..1c2b75af3 --- /dev/null +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -0,0 +1,281 @@ +#if ENABLE_SHA256_HWACCEL && defined(__GNUC__) && defined(__x86_64__) +/* The code is adapted from Linux kernel's source */ + +// We use shorter insns, even though they are for "wrong" +// data type (fp, not int). +// For Intel, there is no penalty for doing it at all +// (CPUs which do have such penalty do not support SHA1 insns). +// For AMD, the penalty is one extra cycle +// (allegedly: I failed to find measurable difference). + +//#define mova128 movdqa +#define mova128 movaps +//#define movu128 movdqu +#define movu128 movups +//#define shuf128_32 pshufd +#define shuf128_32 shufps + + .section .text.sha256_process_block64_shaNI, "ax", @progbits + .globl sha256_process_block64_shaNI + .hidden sha256_process_block64_shaNI + .type sha256_process_block64_shaNI, @function + +#define DATA_PTR %rdi + +#define SHA256CONSTANTS %rax + +#define MSG %xmm0 +#define STATE0 %xmm1 +#define STATE1 %xmm2 +#define MSGTMP0 %xmm3 +#define MSGTMP1 %xmm4 +#define MSGTMP2 %xmm5 +#define MSGTMP3 %xmm6 +#define MSGTMP4 %xmm7 + +#define SHUF_MASK %xmm8 + +#define ABEF_SAVE %xmm9 +#define CDGH_SAVE %xmm10 + + .balign 8 # allow decoders to fetch at least 2 first insns +sha256_process_block64_shaNI: + movu128 80+0*16(%rdi), STATE0 + movu128 80+1*16(%rdi), STATE1 + + shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ + shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + mova128 STATE0, MSGTMP4 + palignr $8, STATE1, STATE0 /* ABEF */ + pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ + + mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK + lea K256(%rip), SHA256CONSTANTS + + /* Save hash values for addition after rounds */ + mova128 STATE0, ABEF_SAVE + mova128 STATE1, CDGH_SAVE + + /* Rounds 0-3 */ + movu128 0*16(DATA_PTR), MSG + pshufb SHUF_MASK, MSG + mova128 MSG, MSGTMP0 + paddd 0*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 4-7 */ + movu128 1*16(DATA_PTR), MSG + pshufb SHUF_MASK, MSG + mova128 MSG, MSGTMP1 + paddd 1*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 8-11 */ + movu128 2*16(DATA_PTR), MSG + pshufb SHUF_MASK, MSG + mova128 MSG, MSGTMP2 + paddd 2*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 12-15 */ + movu128 3*16(DATA_PTR), MSG + pshufb SHUF_MASK, MSG + mova128 MSG, MSGTMP3 + paddd 3*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 16-19 */ + mova128 MSGTMP0, MSG + paddd 4*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 20-23 */ + mova128 MSGTMP1, MSG + paddd 5*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 24-27 */ + mova128 MSGTMP2, MSG + paddd 6*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 28-31 */ + mova128 MSGTMP3, MSG + paddd 7*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 32-35 */ + mova128 MSGTMP0, MSG + paddd 8*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 36-39 */ + mova128 MSGTMP1, MSG + paddd 9*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP1, MSGTMP0 + + /* Rounds 40-43 */ + mova128 MSGTMP2, MSG + paddd 10*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP2, MSGTMP1 + + /* Rounds 44-47 */ + mova128 MSGTMP3, MSG + paddd 11*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP3, MSGTMP4 + palignr $4, MSGTMP2, MSGTMP4 + paddd MSGTMP4, MSGTMP0 + sha256msg2 MSGTMP3, MSGTMP0 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP3, MSGTMP2 + + /* Rounds 48-51 */ + mova128 MSGTMP0, MSG + paddd 12*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP0, MSGTMP4 + palignr $4, MSGTMP3, MSGTMP4 + paddd MSGTMP4, MSGTMP1 + sha256msg2 MSGTMP0, MSGTMP1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + sha256msg1 MSGTMP0, MSGTMP3 + + /* Rounds 52-55 */ + mova128 MSGTMP1, MSG + paddd 13*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP1, MSGTMP4 + palignr $4, MSGTMP0, MSGTMP4 + paddd MSGTMP4, MSGTMP2 + sha256msg2 MSGTMP1, MSGTMP2 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 56-59 */ + mova128 MSGTMP2, MSG + paddd 14*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + mova128 MSGTMP2, MSGTMP4 + palignr $4, MSGTMP1, MSGTMP4 + paddd MSGTMP4, MSGTMP3 + sha256msg2 MSGTMP2, MSGTMP3 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Rounds 60-63 */ + mova128 MSGTMP3, MSG + paddd 15*16(SHA256CONSTANTS), MSG + sha256rnds2 STATE0, STATE1 + shuf128_32 $0x0E, MSG, MSG + sha256rnds2 STATE1, STATE0 + + /* Add current hash values with previously saved */ + paddd ABEF_SAVE, STATE0 + paddd CDGH_SAVE, STATE1 + + /* Write hash values back in the correct order */ + shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ + shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + mova128 STATE0, MSGTMP4 + pblendw $0xF0, STATE1, STATE0 /* DCBA */ + palignr $8, MSGTMP4, STATE1 /* HGFE */ + + movu128 STATE0, 80+0*16(%rdi) + movu128 STATE1, 80+1*16(%rdi) + + ret + .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI + +.section .rodata.cst256.K256, "aM", @progbits, 256 +.balign 16 +K256: + .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 + .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 + .long 0xd807aa98,0x12835b01,0x243185be,0x550c7dc3 + .long 0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174 + .long 0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc + .long 0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da + .long 0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7 + .long 0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967 + .long 0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13 + .long 0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85 + .long 0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3 + .long 0xd192e819,0xd6990624,0xf40e3585,0x106aa070 + .long 0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5 + .long 0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3 + .long 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208 + .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 + +.section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 +.balign 16 +PSHUFFLE_BSWAP32_FLIP_MASK: + .octa 0x0c0d0e0f08090a0b0405060700010203 + +#endif diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index 166cfd38a..11b855e26 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -20,7 +20,7 @@ #define extr128_32 pextrd //#define extr128_32 extractps # not shorter - .section .text.sha1_process_block64_shaNI,"ax", at progbits + .section .text.sha1_process_block64_shaNI, "ax", @progbits .globl sha1_process_block64_shaNI .hidden sha1_process_block64_shaNI .type sha1_process_block64_shaNI, @function @@ -224,7 +224,7 @@ sha1_process_block64_shaNI: .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 -.align 16 +.balign 16 PSHUFFLE_BYTE_FLIP_MASK: .octa 0x000102030405060708090a0b0c0d0e0f diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index 743269d98..47ace60de 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -1394,7 +1394,7 @@ sha1_process_block64: .size sha1_process_block64, .-sha1_process_block64 .section .rodata.cst16.sha1const, "aM", @progbits, 16 - .align 16 + .balign 16 rconst0x5A827999: .long 0x5A827999 .long 0x5A827999 diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index 47c40af0d..656fb5414 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -433,7 +433,7 @@ echo " .size sha1_process_block64, .-sha1_process_block64 .section .rodata.cst16.sha1const, \"aM\", @progbits, 16 - .align 16 + .balign 16 rconst0x5A827999: .long 0x5A827999 .long 0x5A827999 diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index 33cc3bf7f..ba92f09df 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -20,7 +20,7 @@ #define extr128_32 pextrd //#define extr128_32 extractps # not shorter - .section .text.sha1_process_block64_shaNI,"ax", at progbits + .section .text.sha1_process_block64_shaNI, "ax", @progbits .globl sha1_process_block64_shaNI .hidden sha1_process_block64_shaNI .type sha1_process_block64_shaNI, @function @@ -218,7 +218,7 @@ sha1_process_block64_shaNI: .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 -.align 16 +.balign 16 PSHUFFLE_BYTE_FLIP_MASK: .octa 0x000102030405060708090a0b0c0d0e0f From vda.linux at googlemail.com Thu Feb 3 14:11:23 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Thu, 3 Feb 2022 15:11:23 +0100 Subject: [git commit] libbb/sha256: code shrink in 32-bit x86 Message-ID: <20220203140444.EF21982A68@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=de6cb4bed82356db72af81890c7c26d7e85fb50d branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 747 722 -25 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 35 +++++++++++++++++------------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 56e37fa38..632dab7e6 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -49,8 +49,7 @@ sha256_process_block64_shaNI: palignr $8, STATE1, STATE0 /* ABEF */ pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ -# mova128 PSHUFFLE_BSWAP32_FLIP_MASK, SHUF_MASK - lea K256, SHA256CONSTANTS + movl $K256+8*16, SHA256CONSTANTS /* Save hash values for addition after rounds */ mova128 STATE0, 0*16(%esp) @@ -60,7 +59,7 @@ sha256_process_block64_shaNI: movu128 0*16(DATA_PTR), MSG pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG mova128 MSG, MSGTMP0 - paddd 0*16(SHA256CONSTANTS), MSG + paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -69,7 +68,7 @@ sha256_process_block64_shaNI: movu128 1*16(DATA_PTR), MSG pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG mova128 MSG, MSGTMP1 - paddd 1*16(SHA256CONSTANTS), MSG + paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -79,7 +78,7 @@ sha256_process_block64_shaNI: movu128 2*16(DATA_PTR), MSG pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG mova128 MSG, MSGTMP2 - paddd 2*16(SHA256CONSTANTS), MSG + paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -89,7 +88,7 @@ sha256_process_block64_shaNI: movu128 3*16(DATA_PTR), MSG pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG mova128 MSG, MSGTMP3 - paddd 3*16(SHA256CONSTANTS), MSG + paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -101,7 +100,7 @@ sha256_process_block64_shaNI: /* Rounds 16-19 */ mova128 MSGTMP0, MSG - paddd 4*16(SHA256CONSTANTS), MSG + paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -113,7 +112,7 @@ sha256_process_block64_shaNI: /* Rounds 20-23 */ mova128 MSGTMP1, MSG - paddd 5*16(SHA256CONSTANTS), MSG + paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -125,7 +124,7 @@ sha256_process_block64_shaNI: /* Rounds 24-27 */ mova128 MSGTMP2, MSG - paddd 6*16(SHA256CONSTANTS), MSG + paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -137,7 +136,7 @@ sha256_process_block64_shaNI: /* Rounds 28-31 */ mova128 MSGTMP3, MSG - paddd 7*16(SHA256CONSTANTS), MSG + paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -149,7 +148,7 @@ sha256_process_block64_shaNI: /* Rounds 32-35 */ mova128 MSGTMP0, MSG - paddd 8*16(SHA256CONSTANTS), MSG + paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -161,7 +160,7 @@ sha256_process_block64_shaNI: /* Rounds 36-39 */ mova128 MSGTMP1, MSG - paddd 9*16(SHA256CONSTANTS), MSG + paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -173,7 +172,7 @@ sha256_process_block64_shaNI: /* Rounds 40-43 */ mova128 MSGTMP2, MSG - paddd 10*16(SHA256CONSTANTS), MSG + paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -185,7 +184,7 @@ sha256_process_block64_shaNI: /* Rounds 44-47 */ mova128 MSGTMP3, MSG - paddd 11*16(SHA256CONSTANTS), MSG + paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -197,7 +196,7 @@ sha256_process_block64_shaNI: /* Rounds 48-51 */ mova128 MSGTMP0, MSG - paddd 12*16(SHA256CONSTANTS), MSG + paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -209,7 +208,7 @@ sha256_process_block64_shaNI: /* Rounds 52-55 */ mova128 MSGTMP1, MSG - paddd 13*16(SHA256CONSTANTS), MSG + paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -220,7 +219,7 @@ sha256_process_block64_shaNI: /* Rounds 56-59 */ mova128 MSGTMP2, MSG - paddd 14*16(SHA256CONSTANTS), MSG + paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -231,7 +230,7 @@ sha256_process_block64_shaNI: /* Rounds 60-63 */ mova128 MSGTMP3, MSG - paddd 15*16(SHA256CONSTANTS), MSG + paddd 15*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 From vda.linux at googlemail.com Thu Feb 3 14:17:42 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Thu, 3 Feb 2022 15:17:42 +0100 Subject: [git commit] libbb/sha256: code shrink in 64-bit x86 Message-ID: <20220203141101.202F882A2F@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=a1429fbb8ca373efc01939d599f6f65969b1a366 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 730 706 -24 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-64_shaNI.S | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index 1c2b75af3..f3df541e4 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -50,7 +50,7 @@ sha256_process_block64_shaNI: pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK - lea K256(%rip), SHA256CONSTANTS + leaq K256+8*16(%rip), SHA256CONSTANTS /* Save hash values for addition after rounds */ mova128 STATE0, ABEF_SAVE @@ -60,7 +60,7 @@ sha256_process_block64_shaNI: movu128 0*16(DATA_PTR), MSG pshufb SHUF_MASK, MSG mova128 MSG, MSGTMP0 - paddd 0*16(SHA256CONSTANTS), MSG + paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -69,7 +69,7 @@ sha256_process_block64_shaNI: movu128 1*16(DATA_PTR), MSG pshufb SHUF_MASK, MSG mova128 MSG, MSGTMP1 - paddd 1*16(SHA256CONSTANTS), MSG + paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -79,7 +79,7 @@ sha256_process_block64_shaNI: movu128 2*16(DATA_PTR), MSG pshufb SHUF_MASK, MSG mova128 MSG, MSGTMP2 - paddd 2*16(SHA256CONSTANTS), MSG + paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -89,7 +89,7 @@ sha256_process_block64_shaNI: movu128 3*16(DATA_PTR), MSG pshufb SHUF_MASK, MSG mova128 MSG, MSGTMP3 - paddd 3*16(SHA256CONSTANTS), MSG + paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -101,7 +101,7 @@ sha256_process_block64_shaNI: /* Rounds 16-19 */ mova128 MSGTMP0, MSG - paddd 4*16(SHA256CONSTANTS), MSG + paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -113,7 +113,7 @@ sha256_process_block64_shaNI: /* Rounds 20-23 */ mova128 MSGTMP1, MSG - paddd 5*16(SHA256CONSTANTS), MSG + paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -125,7 +125,7 @@ sha256_process_block64_shaNI: /* Rounds 24-27 */ mova128 MSGTMP2, MSG - paddd 6*16(SHA256CONSTANTS), MSG + paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -137,7 +137,7 @@ sha256_process_block64_shaNI: /* Rounds 28-31 */ mova128 MSGTMP3, MSG - paddd 7*16(SHA256CONSTANTS), MSG + paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -149,7 +149,7 @@ sha256_process_block64_shaNI: /* Rounds 32-35 */ mova128 MSGTMP0, MSG - paddd 8*16(SHA256CONSTANTS), MSG + paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -161,7 +161,7 @@ sha256_process_block64_shaNI: /* Rounds 36-39 */ mova128 MSGTMP1, MSG - paddd 9*16(SHA256CONSTANTS), MSG + paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -173,7 +173,7 @@ sha256_process_block64_shaNI: /* Rounds 40-43 */ mova128 MSGTMP2, MSG - paddd 10*16(SHA256CONSTANTS), MSG + paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -185,7 +185,7 @@ sha256_process_block64_shaNI: /* Rounds 44-47 */ mova128 MSGTMP3, MSG - paddd 11*16(SHA256CONSTANTS), MSG + paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP3, MSGTMP4 palignr $4, MSGTMP2, MSGTMP4 @@ -197,7 +197,7 @@ sha256_process_block64_shaNI: /* Rounds 48-51 */ mova128 MSGTMP0, MSG - paddd 12*16(SHA256CONSTANTS), MSG + paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP0, MSGTMP4 palignr $4, MSGTMP3, MSGTMP4 @@ -209,7 +209,7 @@ sha256_process_block64_shaNI: /* Rounds 52-55 */ mova128 MSGTMP1, MSG - paddd 13*16(SHA256CONSTANTS), MSG + paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP1, MSGTMP4 palignr $4, MSGTMP0, MSGTMP4 @@ -220,7 +220,7 @@ sha256_process_block64_shaNI: /* Rounds 56-59 */ mova128 MSGTMP2, MSG - paddd 14*16(SHA256CONSTANTS), MSG + paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 mova128 MSGTMP2, MSGTMP4 palignr $4, MSGTMP1, MSGTMP4 @@ -231,7 +231,7 @@ sha256_process_block64_shaNI: /* Rounds 60-63 */ mova128 MSGTMP3, MSG - paddd 15*16(SHA256CONSTANTS), MSG + paddd 15*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 From vda.linux at googlemail.com Sat Feb 5 23:33:42 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Sun, 6 Feb 2022 00:33:42 +0100 Subject: [git commit] libbb/sha256: code shrink in 64-bit x86 Message-ID: <20220205234357.20D04819E6@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=31c1c310772fa6c897ee1585ea15fc38f3ab3dff branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 706 701 -5 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-64_shaNI.S | 96 ++++++++++++++++++------------------ 1 file changed, 48 insertions(+), 48 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index f3df541e4..dbf391135 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -31,9 +31,7 @@ #define MSGTMP1 %xmm4 #define MSGTMP2 %xmm5 #define MSGTMP3 %xmm6 -#define MSGTMP4 %xmm7 - -#define SHUF_MASK %xmm8 +#define XMMTMP4 %xmm7 #define ABEF_SAVE %xmm9 #define CDGH_SAVE %xmm10 @@ -45,11 +43,12 @@ sha256_process_block64_shaNI: shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ - mova128 STATE0, MSGTMP4 + mova128 STATE0, XMMTMP4 palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ + pblendw $0xF0, XMMTMP4, STATE1 /* CDGH */ - mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK +/* XMMTMP4 holds flip mask from here... */ + mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP4 leaq K256+8*16(%rip), SHA256CONSTANTS /* Save hash values for addition after rounds */ @@ -58,7 +57,7 @@ sha256_process_block64_shaNI: /* Rounds 0-3 */ movu128 0*16(DATA_PTR), MSG - pshufb SHUF_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -67,7 +66,7 @@ sha256_process_block64_shaNI: /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG - pshufb SHUF_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -77,7 +76,7 @@ sha256_process_block64_shaNI: /* Rounds 8-11 */ movu128 2*16(DATA_PTR), MSG - pshufb SHUF_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -87,13 +86,14 @@ sha256_process_block64_shaNI: /* Rounds 12-15 */ movu128 3*16(DATA_PTR), MSG - pshufb SHUF_MASK, MSG + pshufb XMMTMP4, MSG +/* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -103,9 +103,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -115,9 +115,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -127,9 +127,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -139,9 +139,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -151,9 +151,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -163,9 +163,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -175,9 +175,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -187,9 +187,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -199,9 +199,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -211,9 +211,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -222,9 +222,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -243,9 +243,9 @@ sha256_process_block64_shaNI: /* Write hash values back in the correct order */ shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ - mova128 STATE0, MSGTMP4 + mova128 STATE0, XMMTMP4 pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, MSGTMP4, STATE1 /* HGFE */ + palignr $8, XMMTMP4, STATE1 /* HGFE */ movu128 STATE0, 80+0*16(%rdi) movu128 STATE1, 80+1*16(%rdi) From vda.linux at googlemail.com Sat Feb 5 23:56:13 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Sun, 6 Feb 2022 00:56:13 +0100 Subject: [git commit] libbb/sha256: code shrink in 32-bit x86 Message-ID: <20220205235036.87148821F1@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=4f40735c87f8292a87c066b3b7099b0be007cf59 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 722 713 -9 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 93 +++++++++++++++++++----------------- 1 file changed, 48 insertions(+), 45 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 632dab7e6..417da37d8 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -31,7 +31,7 @@ #define MSGTMP1 %xmm4 #define MSGTMP2 %xmm5 #define MSGTMP3 %xmm6 -#define MSGTMP4 %xmm7 +#define XMMTMP4 %xmm7 .balign 8 # allow decoders to fetch at least 3 first insns sha256_process_block64_shaNI: @@ -45,10 +45,12 @@ sha256_process_block64_shaNI: shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ - mova128 STATE0, MSGTMP4 + mova128 STATE0, XMMTMP4 palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, MSGTMP4, STATE1 /* CDGH */ + pblendw $0xF0, XMMTMP4, STATE1 /* CDGH */ +/* XMMTMP4 holds flip mask from here... */ + mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP4 movl $K256+8*16, SHA256CONSTANTS /* Save hash values for addition after rounds */ @@ -57,7 +59,7 @@ sha256_process_block64_shaNI: /* Rounds 0-3 */ movu128 0*16(DATA_PTR), MSG - pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -66,7 +68,7 @@ sha256_process_block64_shaNI: /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG - pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -76,7 +78,7 @@ sha256_process_block64_shaNI: /* Rounds 8-11 */ movu128 2*16(DATA_PTR), MSG - pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + pshufb XMMTMP4, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -86,13 +88,14 @@ sha256_process_block64_shaNI: /* Rounds 12-15 */ movu128 3*16(DATA_PTR), MSG - pshufb PSHUFFLE_BSWAP32_FLIP_MASK, MSG + pshufb XMMTMP4, MSG +/* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -102,9 +105,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -114,9 +117,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -126,9 +129,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -138,9 +141,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -150,9 +153,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -162,9 +165,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -174,9 +177,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -186,9 +189,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, MSGTMP4 - palignr $4, MSGTMP2, MSGTMP4 - paddd MSGTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP4 + palignr $4, MSGTMP2, XMMTMP4 + paddd XMMTMP4, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -198,9 +201,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, MSGTMP4 - palignr $4, MSGTMP3, MSGTMP4 - paddd MSGTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP4 + palignr $4, MSGTMP3, XMMTMP4 + paddd XMMTMP4, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -210,9 +213,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, MSGTMP4 - palignr $4, MSGTMP0, MSGTMP4 - paddd MSGTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP4 + palignr $4, MSGTMP0, XMMTMP4 + paddd XMMTMP4, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -221,9 +224,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, MSGTMP4 - palignr $4, MSGTMP1, MSGTMP4 - paddd MSGTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP4 + palignr $4, MSGTMP1, XMMTMP4 + paddd XMMTMP4, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -242,9 +245,9 @@ sha256_process_block64_shaNI: /* Write hash values back in the correct order */ shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ - mova128 STATE0, MSGTMP4 + mova128 STATE0, XMMTMP4 pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, MSGTMP4, STATE1 /* HGFE */ + palignr $8, XMMTMP4, STATE1 /* HGFE */ movu128 STATE0, 76+0*16(%eax) movu128 STATE1, 76+1*16(%eax) From vda.linux at googlemail.com Sun Feb 6 18:53:10 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Sun, 6 Feb 2022 19:53:10 +0100 Subject: [git commit] *: slap on a few ALIGN* where appropriate Message-ID: <20220206184644.726FD82C83@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=ca466f385ac985a8b3491daa9f326dc480cdee70 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master The result of looking at "grep -F -B2 '*fill*' busybox_unstripped.map" function old new delta .rodata 108586 108460 -126 ------------------------------------------------------------------------------ (add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-126) Total: -126 bytes text data bss dec hex filename 970412 4219 1848 976479 ee65f busybox_old 970286 4219 1848 976353 ee5e1 busybox_unstripped Signed-off-by: Denys Vlasenko --- console-tools/reset.c | 2 +- coreutils/od.c | 2 +- include/platform.h | 1 + libbb/appletlib.c | 2 +- libbb/get_console.c | 2 +- miscutils/bc.c | 2 +- miscutils/man.c | 2 +- networking/ifupdown.c | 8 ++++---- networking/interface.c | 6 +++--- networking/libiproute/ipaddress.c | 2 +- networking/udhcp/common.c | 2 +- networking/udhcp/d6_dhcpc.c | 2 +- shell/ash.c | 2 +- util-linux/hexdump.c | 2 +- util-linux/nsenter.c | 2 +- util-linux/unshare.c | 2 +- 16 files changed, 21 insertions(+), 20 deletions(-) diff --git a/console-tools/reset.c b/console-tools/reset.c index b3acf69f8..cc04e4fcc 100644 --- a/console-tools/reset.c +++ b/console-tools/reset.c @@ -36,7 +36,7 @@ int stty_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE; int reset_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE; int reset_main(int argc UNUSED_PARAM, char **argv UNUSED_PARAM) { - static const char *const args[] = { + static const char *const args[] ALIGN_PTR = { "stty", "sane", NULL }; diff --git a/coreutils/od.c b/coreutils/od.c index 9a888dd5f..6f22331e0 100644 --- a/coreutils/od.c +++ b/coreutils/od.c @@ -144,7 +144,7 @@ odoffset(dumper_t *dumper, int argc, char ***argvp) } } -static const char *const add_strings[] = { +static const char *const add_strings[] ALIGN_PTR = { "16/1 \"%3_u \" \"\\n\"", /* a */ "8/2 \" %06o \" \"\\n\"", /* B, o */ "16/1 \"%03o \" \"\\n\"", /* b */ diff --git a/include/platform.h b/include/platform.h index ad27bb31a..ea0512f36 100644 --- a/include/platform.h +++ b/include/platform.h @@ -346,6 +346,7 @@ typedef unsigned smalluint; # define ALIGN4 #endif #define ALIGN8 __attribute__((aligned(8))) +#define ALIGN_INT __attribute__((aligned(sizeof(int)))) #define ALIGN_PTR __attribute__((aligned(sizeof(void*)))) /* diff --git a/libbb/appletlib.c b/libbb/appletlib.c index 03389f541..841b3b873 100644 --- a/libbb/appletlib.c +++ b/libbb/appletlib.c @@ -651,7 +651,7 @@ static void check_suid(int applet_no) # if ENABLE_FEATURE_INSTALLER static const char usr_bin [] ALIGN1 = "/usr/bin/"; static const char usr_sbin[] ALIGN1 = "/usr/sbin/"; -static const char *const install_dir[] = { +static const char *const install_dir[] ALIGN_PTR = { &usr_bin [8], /* "/" */ &usr_bin [4], /* "/bin/" */ &usr_sbin[4] /* "/sbin/" */ diff --git a/libbb/get_console.c b/libbb/get_console.c index 7f2c75332..9044efea1 100644 --- a/libbb/get_console.c +++ b/libbb/get_console.c @@ -37,7 +37,7 @@ static int open_a_console(const char *fnam) */ int FAST_FUNC get_console_fd_or_die(void) { - static const char *const console_names[] = { + static const char *const console_names[] ALIGN_PTR = { DEV_CONSOLE, CURRENT_VC, CURRENT_TTY }; diff --git a/miscutils/bc.c b/miscutils/bc.c index ae370ff55..ab785bbc8 100644 --- a/miscutils/bc.c +++ b/miscutils/bc.c @@ -6011,7 +6011,7 @@ static BC_STATUS zxc_program_assign(char inst) #endif if (ib || sc || left->t == XC_RESULT_OBASE) { - static const char *const msg[] = { + static const char *const msg[] ALIGN_PTR = { "bad ibase; must be [2,16]", //XC_RESULT_IBASE "bad obase; must be [2,"BC_MAX_OBASE_STR"]", //XC_RESULT_OBASE "bad scale; must be [0,"BC_MAX_SCALE_STR"]", //XC_RESULT_SCALE diff --git a/miscutils/man.c b/miscutils/man.c index d319e8bba..deaf9e5ab 100644 --- a/miscutils/man.c +++ b/miscutils/man.c @@ -303,7 +303,7 @@ int man_main(int argc UNUSED_PARAM, char **argv) config_close(parser); if (!man_path_list) { - static const char *const mpl[] = { "/usr/man", "/usr/share/man", NULL }; + static const char *const mpl[] ALIGN_PTR = { "/usr/man", "/usr/share/man", NULL }; man_path_list = (char**)mpl; /*count_mp = 2; - not used below anyway */ } diff --git a/networking/ifupdown.c b/networking/ifupdown.c index 737113dd4..6c4ae27f2 100644 --- a/networking/ifupdown.c +++ b/networking/ifupdown.c @@ -532,7 +532,7 @@ static int FAST_FUNC v4tunnel_down(struct interface_defn_t * ifd, execfn * exec) } # endif -static const struct method_t methods6[] = { +static const struct method_t methods6[] ALIGN_PTR = { # if ENABLE_FEATURE_IFUPDOWN_IP { "v4tunnel" , v4tunnel_up , v4tunnel_down , }, # endif @@ -627,7 +627,7 @@ struct dhcp_client_t { const char *stopcmd; }; -static const struct dhcp_client_t ext_dhcp_clients[] = { +static const struct dhcp_client_t ext_dhcp_clients[] ALIGN_PTR = { { "dhcpcd", "dhcpcd[[ -h %hostname%]][[ -i %vendor%]][[ -I %client%]][[ -l %leasetime%]] %iface%", "dhcpcd -k %iface%", @@ -774,7 +774,7 @@ static int FAST_FUNC wvdial_down(struct interface_defn_t *ifd, execfn *exec) "-p /var/run/wvdial.%iface% -s 2", ifd, exec); } -static const struct method_t methods[] = { +static const struct method_t methods[] ALIGN_PTR = { { "manual" , manual_up_down, manual_up_down, }, { "wvdial" , wvdial_up , wvdial_down , }, { "ppp" , ppp_up , ppp_down , }, @@ -797,7 +797,7 @@ static int FAST_FUNC link_up_down(struct interface_defn_t *ifd UNUSED_PARAM, exe return 1; } -static const struct method_t link_methods[] = { +static const struct method_t link_methods[] ALIGN_PTR = { { "none", link_up_down, link_up_down } }; diff --git a/networking/interface.c b/networking/interface.c index ea6a2c8a8..6b6c0944a 100644 --- a/networking/interface.c +++ b/networking/interface.c @@ -446,13 +446,13 @@ static char *get_name(char name[IFNAMSIZ], char *p) * %n specifiers (even the size of integers may not match). */ #if INT_MAX == LONG_MAX -static const char *const ss_fmt[] = { +static const char *const ss_fmt[] ALIGN_PTR = { "%n%llu%u%u%u%u%n%n%n%llu%u%u%u%u%u", "%llu%llu%u%u%u%u%n%n%llu%llu%u%u%u%u%u", "%llu%llu%u%u%u%u%u%u%llu%llu%u%u%u%u%u%u" }; #else -static const char *const ss_fmt[] = { +static const char *const ss_fmt[] ALIGN_PTR = { "%n%llu%lu%lu%lu%lu%n%n%n%llu%lu%lu%lu%lu%lu", "%llu%llu%lu%lu%lu%lu%n%n%llu%llu%lu%lu%lu%lu%lu", "%llu%llu%lu%lu%lu%lu%lu%lu%llu%llu%lu%lu%lu%lu%lu%lu" @@ -731,7 +731,7 @@ static const struct hwtype ib_hwtype = { #endif -static const struct hwtype *const hwtypes[] = { +static const struct hwtype *const hwtypes[] ALIGN_PTR = { &loop_hwtype, ðer_hwtype, &ppp_hwtype, diff --git a/networking/libiproute/ipaddress.c b/networking/libiproute/ipaddress.c index 17a838411..ecc3848ff 100644 --- a/networking/libiproute/ipaddress.c +++ b/networking/libiproute/ipaddress.c @@ -58,7 +58,7 @@ typedef struct filter_t filter_t; static void print_link_flags(unsigned flags, unsigned mdown) { - static const int flag_masks[] = { + static const int flag_masks[] ALIGN_INT = { IFF_LOOPBACK, IFF_BROADCAST, IFF_POINTOPOINT, IFF_MULTICAST, IFF_NOARP, IFF_UP, IFF_LOWER_UP }; static const char flag_labels[] ALIGN1 = diff --git a/networking/udhcp/common.c b/networking/udhcp/common.c index 8e9b93655..ae818db05 100644 --- a/networking/udhcp/common.c +++ b/networking/udhcp/common.c @@ -19,7 +19,7 @@ const uint8_t MAC_BCAST_ADDR[6] ALIGN2 = { * See RFC2132 for more options. * OPTION_REQ: these options are requested by udhcpc (unless -o). */ -const struct dhcp_optflag dhcp_optflags[] = { +const struct dhcp_optflag dhcp_optflags[] ALIGN2 = { /* flags code */ { OPTION_IP | OPTION_REQ, 0x01 }, /* DHCP_SUBNET */ { OPTION_S32 , 0x02 }, /* DHCP_TIME_OFFSET */ diff --git a/networking/udhcp/d6_dhcpc.c b/networking/udhcp/d6_dhcpc.c index 9d2a8f5d3..9fc690315 100644 --- a/networking/udhcp/d6_dhcpc.c +++ b/networking/udhcp/d6_dhcpc.c @@ -65,7 +65,7 @@ /* "struct client_data_t client_data" is in bb_common_bufsiz1 */ -static const struct dhcp_optflag d6_optflags[] = { +static const struct dhcp_optflag d6_optflags[] ALIGN2 = { #if ENABLE_FEATURE_UDHCPC6_RFC3646 { OPTION_6RD | OPTION_LIST | OPTION_REQ, D6_OPT_DNS_SERVERS }, { OPTION_DNS_STRING | OPTION_LIST | OPTION_REQ, D6_OPT_DOMAIN_LIST }, diff --git a/shell/ash.c b/shell/ash.c index 55df54bd0..adb0f223a 100644 --- a/shell/ash.c +++ b/shell/ash.c @@ -313,7 +313,7 @@ typedef long arith_t; /* ============ Shell options */ /* If you add/change options hare, update --help text too */ -static const char *const optletters_optnames[] = { +static const char *const optletters_optnames[] ALIGN_PTR = { "e" "errexit", "f" "noglob", /* bash has '-o ignoreeof', but no short synonym -I for it */ diff --git a/util-linux/hexdump.c b/util-linux/hexdump.c index 57e7e8db7..307a84803 100644 --- a/util-linux/hexdump.c +++ b/util-linux/hexdump.c @@ -71,7 +71,7 @@ static void bb_dump_addfile(dumper_t *dumper, char *name) fclose(fp); } -static const char *const add_strings[] = { +static const char *const add_strings[] ALIGN_PTR = { "\"%07.7_ax \"16/1 \"%03o \"\"\n\"", /* b */ "\"%07.7_ax \"16/1 \"%3_c \"\"\n\"", /* c */ "\"%07.7_ax \"8/2 \" %05u \"\"\n\"", /* d */ diff --git a/util-linux/nsenter.c b/util-linux/nsenter.c index e6339da2f..1aa045b35 100644 --- a/util-linux/nsenter.c +++ b/util-linux/nsenter.c @@ -93,7 +93,7 @@ enum { * The user namespace comes first, so that it is entered first. * This gives an unprivileged user the potential to enter other namespaces. */ -static const struct namespace_descr ns_list[] = { +static const struct namespace_descr ns_list[] ALIGN_INT = { { CLONE_NEWUSER, "ns/user", }, { CLONE_NEWIPC, "ns/ipc", }, { CLONE_NEWUTS, "ns/uts", }, diff --git a/util-linux/unshare.c b/util-linux/unshare.c index 68ccdd874..06b938074 100644 --- a/util-linux/unshare.c +++ b/util-linux/unshare.c @@ -120,7 +120,7 @@ enum { NS_USR_POS, /* OPT_user, NS_USR_POS, and ns_list[] index must match! */ NS_COUNT, }; -static const struct namespace_descr ns_list[] = { +static const struct namespace_descr ns_list[] ALIGN_INT = { { CLONE_NEWNS, "mnt" }, { CLONE_NEWUTS, "uts" }, { CLONE_NEWIPC, "ipc" }, From vda.linux at googlemail.com Sun Feb 6 19:07:12 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Sun, 6 Feb 2022 20:07:12 +0100 Subject: [git commit] *: slap on a few ALIGN_PTR where appropriate Message-ID: <20220206190009.C99FD82B4D@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=987be932ed3cbea56b68bbe85649191c13b66015 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- coreutils/test.c | 2 +- e2fsprogs/fsck.c | 2 +- libbb/getopt32.c | 2 +- miscutils/devfsd.c | 4 ++-- modutils/modutils-24.c | 4 ++-- networking/inetd.c | 2 +- procps/nmeter.c | 2 +- selinux/setenforce.c | 2 +- shell/hush.c | 10 +++++----- 9 files changed, 15 insertions(+), 15 deletions(-) diff --git a/coreutils/test.c b/coreutils/test.c index a914c7490..840a0daaf 100644 --- a/coreutils/test.c +++ b/coreutils/test.c @@ -242,7 +242,7 @@ int depth; depth--; \ return __res; \ } while (0) -static const char *const TOKSTR[] = { +static const char *const TOKSTR[] ALIGN_PTR = { "EOI", "FILRD", "FILWR", diff --git a/e2fsprogs/fsck.c b/e2fsprogs/fsck.c index 96c1e51e0..028f8a803 100644 --- a/e2fsprogs/fsck.c +++ b/e2fsprogs/fsck.c @@ -190,7 +190,7 @@ struct globals { * Required for the uber-silly devfs /dev/ide/host1/bus2/target3/lun3 * pathames. */ -static const char *const devfs_hier[] = { +static const char *const devfs_hier[] ALIGN_PTR = { "host", "bus", "target", "lun", NULL }; #endif diff --git a/libbb/getopt32.c b/libbb/getopt32.c index 5ab4d66f1..e861d0567 100644 --- a/libbb/getopt32.c +++ b/libbb/getopt32.c @@ -296,7 +296,7 @@ Special characters: /* Code here assumes that 'unsigned' is at least 32 bits wide */ -const char *const bb_argv_dash[] = { "-", NULL }; +const char *const bb_argv_dash[] ALIGN_PTR = { "-", NULL }; enum { PARAM_STRING, diff --git a/miscutils/devfsd.c b/miscutils/devfsd.c index 839d00fd0..fb9ebcf60 100644 --- a/miscutils/devfsd.c +++ b/miscutils/devfsd.c @@ -928,7 +928,7 @@ static void action_compat(const struct devfsd_notify_struct *info, unsigned int unsigned int i; char rewind_; /* 1 to 5 "scsi/" , 6 to 9 "ide/host" */ - static const char *const fmt[] = { + static const char *const fmt[] ALIGN_PTR = { NULL , "sg/c%db%dt%du%d", /* scsi/generic */ "sd/c%db%dt%du%d", /* scsi/disc */ @@ -1468,7 +1468,7 @@ const char *get_old_name(const char *devname, unsigned int namelen, const char *pty1; const char *pty2; /* 1 to 5 "scsi/" , 6 to 9 "ide/host", 10 sbp/, 11 vcc/, 12 pty/ */ - static const char *const fmt[] = { + static const char *const fmt[] ALIGN_PTR = { NULL , "sg%u", /* scsi/generic */ NULL, /* scsi/disc */ diff --git a/modutils/modutils-24.c b/modutils/modutils-24.c index ac8632481..d0bc2a6ef 100644 --- a/modutils/modutils-24.c +++ b/modutils/modutils-24.c @@ -3458,7 +3458,7 @@ static int obj_load_progbits(char *image, size_t image_size, struct obj_file *f, static void hide_special_symbols(struct obj_file *f) { - static const char *const specials[] = { + static const char *const specials[] ALIGN_PTR = { SPFX "cleanup_module", SPFX "init_module", SPFX "kernel_version", @@ -3484,7 +3484,7 @@ static int obj_gpl_license(struct obj_file *f, const char **license) * linux/include/linux/module.h. Checking for leading "GPL" will not * work, somebody will use "GPL sucks, this is proprietary". */ - static const char *const gpl_licenses[] = { + static const char *const gpl_licenses[] ALIGN_PTR = { "GPL", "GPL v2", "GPL and additional rights", diff --git a/networking/inetd.c b/networking/inetd.c index e71be51c3..fb2fbe323 100644 --- a/networking/inetd.c +++ b/networking/inetd.c @@ -1538,7 +1538,7 @@ int inetd_main(int argc UNUSED_PARAM, char **argv) #if ENABLE_FEATURE_INETD_SUPPORT_BUILTIN_ECHO \ || ENABLE_FEATURE_INETD_SUPPORT_BUILTIN_DISCARD # if !BB_MMU -static const char *const cat_args[] = { "cat", NULL }; +static const char *const cat_args[] ALIGN_PTR = { "cat", NULL }; # endif #endif diff --git a/procps/nmeter.c b/procps/nmeter.c index 2310e9844..088d366bf 100644 --- a/procps/nmeter.c +++ b/procps/nmeter.c @@ -70,7 +70,7 @@ typedef struct proc_file { smallint last_gen; } proc_file; -static const char *const proc_name[] = { +static const char *const proc_name[] ALIGN_PTR = { "stat", // Must match the order of proc_file's! "loadavg", "net/dev", diff --git a/selinux/setenforce.c b/selinux/setenforce.c index 996034f8e..2267be451 100644 --- a/selinux/setenforce.c +++ b/selinux/setenforce.c @@ -26,7 +26,7 @@ /* These strings are arranged so that odd ones * result in security_setenforce(1) being done, * the rest will do security_setenforce(0) */ -static const char *const setenforce_cmd[] = { +static const char *const setenforce_cmd[] ALIGN_PTR = { "0", "1", "permissive", diff --git a/shell/hush.c b/shell/hush.c index 6dc2ecaac..ae81f0da5 100644 --- a/shell/hush.c +++ b/shell/hush.c @@ -564,7 +564,7 @@ enum { #define NULL_O_STRING { NULL } #ifndef debug_printf_parse -static const char *const assignment_flag[] = { +static const char *const assignment_flag[] ALIGN_PTR = { "MAYBE_ASSIGNMENT", "DEFINITELY_ASSIGNMENT", "NOT_ASSIGNMENT", @@ -3682,7 +3682,7 @@ static void free_pipe_list(struct pipe *pi) #ifndef debug_print_tree static void debug_print_tree(struct pipe *pi, int lvl) { - static const char *const PIPE[] = { + static const char *const PIPE[] ALIGN_PTR = { [PIPE_SEQ] = "SEQ", [PIPE_AND] = "AND", [PIPE_OR ] = "OR" , @@ -3717,7 +3717,7 @@ static void debug_print_tree(struct pipe *pi, int lvl) [RES_XXXX ] = "XXXX" , [RES_SNTX ] = "SNTX" , }; - static const char *const CMDTYPE[] = { + static const char *const CMDTYPE[] ALIGN_PTR = { "{}", "()", "[noglob]", @@ -7659,7 +7659,7 @@ static int generate_stream_from_string(const char *s, pid_t *pid_p) if (is_prefixed_with(s, "trap") && skip_whitespace(s + 4)[0] == '\0' ) { - static const char *const argv[] = { NULL, NULL }; + static const char *const argv[] ALIGN_PTR = { NULL, NULL }; builtin_trap((char**)argv); fflush_all(); /* important */ _exit(0); @@ -9826,7 +9826,7 @@ static int run_list(struct pipe *pi) static const char encoded_dollar_at[] ALIGN1 = { SPECIAL_VAR_SYMBOL, '@' | 0x80, SPECIAL_VAR_SYMBOL, '\0' }; /* encoded representation of "$@" */ - static const char *const encoded_dollar_at_argv[] = { + static const char *const encoded_dollar_at_argv[] ALIGN_PTR = { encoded_dollar_at, NULL }; /* argv list with one element: "$@" */ char **vals; From vda.linux at googlemail.com Tue Feb 8 02:29:16 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Tue, 8 Feb 2022 03:29:16 +0100 Subject: [git commit] libbb/sha1: shrink unrolled x86-64 code Message-ID: <20220208022853.C4572831C9@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=4923f74e5873b25b8205a4059964cff75ee731a8 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64 3482 3481 -1 .rodata 108460 108412 -48 ------------------------------------------------------------------------------ (add/remove: 1/4 grow/shrink: 0/2 up/down: 0/-49) Total: -49 bytes Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-64.S | 33 ++++++++++----------------------- libbb/hash_md5_sha_x86-64.S.sh | 34 +++++++++++----------------------- 2 files changed, 21 insertions(+), 46 deletions(-) diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index e26c46f25..287cfe547 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -24,6 +24,7 @@ sha1_process_block64: # xmm0..xmm3: W[] # xmm4,xmm5: temps # xmm6: current round constant +# xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units movl 80(%rdi), %eax # a = ctx->hash[0] @@ -32,16 +33,17 @@ sha1_process_block64: movl 92(%rdi), %edx # d = ctx->hash[3] movl 96(%rdi), %ebp # e = ctx->hash[4] - movaps rconst0x5A827999(%rip), %xmm6 + movaps sha1const(%rip), %xmm7 + pshufd $0x00, %xmm7, %xmm6 # Load W[] to xmm registers, byteswapping on the fly. # # For iterations 0..15, we pass W[] in rsi,r8..r14 - # for use in RD1A's instead of spilling them to stack. + # for use in RD1As instead of spilling them to stack. # We lose parallelized addition of RCONST, but LEA - # can do two additions at once, so it's probably a wash. + # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # LEAs in two first RD1A's shorter by one byte). + # LEAs in two first RD1As shorter by one byte). movq 4*0(%rdi), %rsi movq 4*2(%rdi), %r8 bswapq %rsi @@ -253,7 +255,7 @@ sha1_process_block64: roll $5, %edi # rotl32(a,5) addl %edi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) - movaps rconst0x6ED9EBA1(%rip), %xmm6 + pshufd $0x55, %xmm7, %xmm6 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp) movaps %xmm0, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) @@ -614,7 +616,7 @@ sha1_process_block64: roll $5, %esi # rotl32(a,5) addl %esi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) - movaps rconst0x8F1BBCDC(%rip), %xmm6 + pshufd $0xaa, %xmm7, %xmm6 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp) movaps %xmm1, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) @@ -1001,7 +1003,7 @@ sha1_process_block64: roll $5, %esi # rotl32(a,5) addl %esi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) - movaps rconst0xCA62C1D6(%rip), %xmm6 + pshufd $0xff, %xmm7, %xmm6 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp) movaps %xmm2, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) @@ -1475,25 +1477,10 @@ sha1_process_block64: .section .rodata.cst16.sha1const, "aM", @progbits, 16 .balign 16 -rconst0x5A827999: +sha1const: .long 0x5A827999 - .long 0x5A827999 - .long 0x5A827999 - .long 0x5A827999 -rconst0x6ED9EBA1: - .long 0x6ED9EBA1 - .long 0x6ED9EBA1 - .long 0x6ED9EBA1 .long 0x6ED9EBA1 -rconst0x8F1BBCDC: .long 0x8F1BBCDC - .long 0x8F1BBCDC - .long 0x8F1BBCDC - .long 0x8F1BBCDC -rconst0xCA62C1D6: - .long 0xCA62C1D6 - .long 0xCA62C1D6 - .long 0xCA62C1D6 .long 0xCA62C1D6 #endif diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index fb1e4b57e..a10ac411d 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -34,6 +34,7 @@ exec >hash_md5_sha_x86-64.S xmmT1="%xmm4" xmmT2="%xmm5" xmmRCONST="%xmm6" +xmmALLRCONST="%xmm7" T=`printf '\t'` # SSE instructions are longer than 4 bytes on average. @@ -125,6 +126,7 @@ sha1_process_block64: # xmm0..xmm3: W[] # xmm4,xmm5: temps # xmm6: current round constant +# xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units movl 80(%rdi), %eax # a = ctx->hash[0] @@ -133,16 +135,17 @@ sha1_process_block64: movl 92(%rdi), %edx # d = ctx->hash[3] movl 96(%rdi), %ebp # e = ctx->hash[4] - movaps rconst0x5A827999(%rip), $xmmRCONST + movaps sha1const(%rip), $xmmALLRCONST + pshufd \$0x00, $xmmALLRCONST, $xmmRCONST # Load W[] to xmm registers, byteswapping on the fly. # # For iterations 0..15, we pass W[] in rsi,r8..r14 - # for use in RD1A's instead of spilling them to stack. + # for use in RD1As instead of spilling them to stack. # We lose parallelized addition of RCONST, but LEA - # can do two additions at once, so it's probably a wash. + # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # LEAs in two first RD1A's shorter by one byte). + # LEAs in two first RD1As shorter by one byte). movq 4*0(%rdi), %rsi movq 4*2(%rdi), %r8 bswapq %rsi @@ -359,7 +362,7 @@ RD1A bx cx dx bp ax 4; RD1A ax bx cx dx bp 5; RD1A bp ax bx cx dx 6; RD1A dx a=`PREP %xmm0 %xmm1 %xmm2 %xmm3 "-64+16*0(%rsp)"` b=`RD1A cx dx bp ax bx 8; RD1A bx cx dx bp ax 9; RD1A ax bx cx dx bp 10; RD1A bp ax bx cx dx 11;` INTERLEAVE "$a" "$b" -a=`echo " movaps rconst0x6ED9EBA1(%rip), $xmmRCONST" +a=`echo " pshufd \\$0x55, $xmmALLRCONST, $xmmRCONST" PREP %xmm1 %xmm2 %xmm3 %xmm0 "-64+16*1(%rsp)"` b=`RD1A dx bp ax bx cx 12; RD1A cx dx bp ax bx 13; RD1A bx cx dx bp ax 14; RD1A ax bx cx dx bp 15;` INTERLEAVE "$a" "$b" @@ -378,7 +381,7 @@ INTERLEAVE "$a" "$b" a=`PREP %xmm1 %xmm2 %xmm3 %xmm0 "-64+16*1(%rsp)"` b=`RD2 cx dx bp ax bx 28; RD2 bx cx dx bp ax 29; RD2 ax bx cx dx bp 30; RD2 bp ax bx cx dx 31;` INTERLEAVE "$a" "$b" -a=`echo " movaps rconst0x8F1BBCDC(%rip), $xmmRCONST" +a=`echo " pshufd \\$0xaa, $xmmALLRCONST, $xmmRCONST" PREP %xmm2 %xmm3 %xmm0 %xmm1 "-64+16*2(%rsp)"` b=`RD2 dx bp ax bx cx 32; RD2 cx dx bp ax bx 33; RD2 bx cx dx bp ax 34; RD2 ax bx cx dx bp 35;` INTERLEAVE "$a" "$b" @@ -397,7 +400,7 @@ INTERLEAVE "$a" "$b" a=`PREP %xmm2 %xmm3 %xmm0 %xmm1 "-64+16*2(%rsp)"` b=`RD3 cx dx bp ax bx 48; RD3 bx cx dx bp ax 49; RD3 ax bx cx dx bp 50; RD3 bp ax bx cx dx 51;` INTERLEAVE "$a" "$b" -a=`echo " movaps rconst0xCA62C1D6(%rip), $xmmRCONST" +a=`echo " pshufd \\$0xff, $xmmALLRCONST, $xmmRCONST" PREP %xmm3 %xmm0 %xmm1 %xmm2 "-64+16*3(%rsp)"` b=`RD3 dx bp ax bx cx 52; RD3 cx dx bp ax bx 53; RD3 bx cx dx bp ax 54; RD3 ax bx cx dx bp 55;` INTERLEAVE "$a" "$b" @@ -439,25 +442,10 @@ echo " .section .rodata.cst16.sha1const, \"aM\", @progbits, 16 .balign 16 -rconst0x5A827999: +sha1const: .long 0x5A827999 - .long 0x5A827999 - .long 0x5A827999 - .long 0x5A827999 -rconst0x6ED9EBA1: - .long 0x6ED9EBA1 - .long 0x6ED9EBA1 - .long 0x6ED9EBA1 .long 0x6ED9EBA1 -rconst0x8F1BBCDC: .long 0x8F1BBCDC - .long 0x8F1BBCDC - .long 0x8F1BBCDC - .long 0x8F1BBCDC -rconst0xCA62C1D6: - .long 0xCA62C1D6 - .long 0xCA62C1D6 - .long 0xCA62C1D6 .long 0xCA62C1D6 #endif" From vda.linux at googlemail.com Mon Feb 7 01:34:04 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Mon, 7 Feb 2022 02:34:04 +0100 Subject: [git commit] libbb/sha1: shrink and speed up unrolled x86-64 code Message-ID: <20220208022853.B8D66831C4@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=c193cbd6dfd095c6b8346bab1ea6ba7106b3e5bb branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64 3514 3482 -32 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 8 +- libbb/hash_md5_sha256_x86-64_shaNI.S | 8 +- libbb/hash_md5_sha_x86-32_shaNI.S | 4 +- libbb/hash_md5_sha_x86-64.S | 144 +++++++++++++++++++++++++++-------- libbb/hash_md5_sha_x86-64.S.sh | 9 ++- libbb/hash_md5_sha_x86-64_shaNI.S | 4 +- 6 files changed, 131 insertions(+), 46 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 417da37d8..39e2baf41 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -257,8 +257,8 @@ sha256_process_block64_shaNI: ret .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI -.section .rodata.cst256.K256, "aM", @progbits, 256 -.balign 16 + .section .rodata.cst256.K256, "aM", @progbits, 256 + .balign 16 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 @@ -277,8 +277,8 @@ K256: .long 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208 .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 -.section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 -.balign 16 + .section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 + .balign 16 PSHUFFLE_BSWAP32_FLIP_MASK: .octa 0x0c0d0e0f08090a0b0405060700010203 diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index dbf391135..c6c931341 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -253,8 +253,8 @@ sha256_process_block64_shaNI: ret .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI -.section .rodata.cst256.K256, "aM", @progbits, 256 -.balign 16 + .section .rodata.cst256.K256, "aM", @progbits, 256 + .balign 16 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 @@ -273,8 +273,8 @@ K256: .long 0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208 .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 -.section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 -.balign 16 + .section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 + .balign 16 PSHUFFLE_BSWAP32_FLIP_MASK: .octa 0x0c0d0e0f08090a0b0405060700010203 diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index 11b855e26..5d082ebfb 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -223,8 +223,8 @@ sha1_process_block64_shaNI: ret .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI -.section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 -.balign 16 + .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 + .balign 16 PSHUFFLE_BYTE_FLIP_MASK: .octa 0x000102030405060708090a0b0c0d0e0f diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index 47ace60de..e26c46f25 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -180,8 +180,13 @@ sha1_process_block64: # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp) movaps %xmm3, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm0, %xmm5 + shufps $0x4e, %xmm1, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm2, %xmm0 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm0 # ^ @@ -252,8 +257,13 @@ sha1_process_block64: # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp) movaps %xmm0, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm1, %xmm5 + shufps $0x4e, %xmm2, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm3, %xmm1 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm1 # ^ @@ -323,8 +333,13 @@ sha1_process_block64: # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp) movaps %xmm1, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm2, %xmm5 + shufps $0x4e, %xmm3, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm0, %xmm2 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm2 # ^ @@ -392,8 +407,13 @@ sha1_process_block64: # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp) movaps %xmm2, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm3, %xmm5 + shufps $0x4e, %xmm0, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm1, %xmm3 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm3 # ^ @@ -457,8 +477,13 @@ sha1_process_block64: # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp) movaps %xmm3, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm0, %xmm5 + shufps $0x4e, %xmm1, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm2, %xmm0 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm0 # ^ @@ -522,8 +547,13 @@ sha1_process_block64: # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp) movaps %xmm0, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm1, %xmm5 + shufps $0x4e, %xmm2, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm3, %xmm1 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm1 # ^ @@ -588,8 +618,13 @@ sha1_process_block64: # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp) movaps %xmm1, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm2, %xmm5 + shufps $0x4e, %xmm3, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm0, %xmm2 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm2 # ^ @@ -653,8 +688,13 @@ sha1_process_block64: # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp) movaps %xmm2, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm3, %xmm5 + shufps $0x4e, %xmm0, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm1, %xmm3 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm3 # ^ @@ -718,8 +758,13 @@ sha1_process_block64: # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp) movaps %xmm3, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm0, %xmm5 + shufps $0x4e, %xmm1, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm2, %xmm0 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm0 # ^ @@ -795,8 +840,13 @@ sha1_process_block64: # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp) movaps %xmm0, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm1, %xmm5 + shufps $0x4e, %xmm2, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm3, %xmm1 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm1 # ^ @@ -872,8 +922,13 @@ sha1_process_block64: # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp) movaps %xmm1, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm2, %xmm5 + shufps $0x4e, %xmm3, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm0, %xmm2 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm2 # ^ @@ -950,8 +1005,13 @@ sha1_process_block64: # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp) movaps %xmm2, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm3, %xmm5 + shufps $0x4e, %xmm0, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm1, %xmm3 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm3 # ^ @@ -1027,8 +1087,13 @@ sha1_process_block64: # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp) movaps %xmm3, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm0, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm1, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm0, %xmm5 + shufps $0x4e, %xmm1, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm2, %xmm0 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm0 # ^ @@ -1104,8 +1169,13 @@ sha1_process_block64: # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp) movaps %xmm0, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm1, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm2, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm1, %xmm5 + shufps $0x4e, %xmm2, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm3, %xmm1 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm1 # ^ @@ -1169,8 +1239,13 @@ sha1_process_block64: # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp) movaps %xmm1, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm2, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm3, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm2, %xmm5 + shufps $0x4e, %xmm3, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm0, %xmm2 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm2 # ^ @@ -1234,8 +1309,13 @@ sha1_process_block64: # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp) movaps %xmm2, %xmm4 psrldq $4, %xmm4 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd $0x4e, %xmm3, %xmm5 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq %xmm0, %xmm5 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps %xmm3, %xmm5 + shufps $0x4e, %xmm0, %xmm5 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps %xmm1, %xmm3 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps %xmm4, %xmm5 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) xorps %xmm5, %xmm3 # ^ diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index 656fb5414..fb1e4b57e 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -203,8 +203,13 @@ echo "# PREP $@ movaps $xmmW12, $xmmT1 psrldq \$4, $xmmT1 # rshift by 4 bytes: T1 = ([13],[14],[15],0) - pshufd \$0x4e, $xmmW0, $xmmT2 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) - punpcklqdq $xmmW4, $xmmT2 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# pshufd \$0x4e, $xmmW0, $xmmT2 # 01001110=2,3,0,1 shuffle, ([2],[3],x,x) +# punpcklqdq $xmmW4, $xmmT2 # T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5]) +# same result as above, but shorter and faster: +# pshufd/shufps are subtly different: pshufd takes all dwords from source operand, +# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one! + movaps $xmmW0, $xmmT2 + shufps \$0x4e, $xmmW4, $xmmT2 # 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5]) xorps $xmmW8, $xmmW0 # ([8],[9],[10],[11]) ^ ([0],[1],[2],[3]) xorps $xmmT1, $xmmT2 # ([13],[14],[15],0) ^ ([2],[3],[4],[5]) diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index ba92f09df..8ddec87ce 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -217,8 +217,8 @@ sha1_process_block64_shaNI: ret .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI -.section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 -.balign 16 + .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 + .balign 16 PSHUFFLE_BYTE_FLIP_MASK: .octa 0x000102030405060708090a0b0c0d0e0f From vda.linux at googlemail.com Tue Feb 8 07:22:17 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Tue, 8 Feb 2022 08:22:17 +0100 Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing Message-ID: <20220208073205.50AE1813BB@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=71a1cccaad679bd102f87283f78c581a8fb0e255 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64_shaNI 32-bit 524 517 -7 sha1_process_block64_shaNI 64-bit 510 508 -2 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-32_shaNI.S | 37 +++++++++++++++++-------------------- libbb/hash_md5_sha_x86-64_shaNI.S | 24 ++++++++++++------------ 2 files changed, 29 insertions(+), 32 deletions(-) diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index 5d082ebfb..0f3fe57ca 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -32,14 +32,10 @@ #define MSG1 %xmm4 #define MSG2 %xmm5 #define MSG3 %xmm6 -#define SHUF_MASK %xmm7 - .balign 8 # allow decoders to fetch at least 3 first insns + .balign 8 # allow decoders to fetch at least 2 first insns sha1_process_block64_shaNI: - pushl %ebp - movl %esp, %ebp - subl $32, %esp - andl $~0xF, %esp # paddd needs aligned memory operand + subl $16, %esp /* load initial hash values */ xor128 E0, E0 @@ -47,30 +43,33 @@ sha1_process_block64_shaNI: pinsrd $3, 76+4*4(%eax), E0 # load to uppermost 32-bit word shuf128_32 $0x1B, ABCD, ABCD # DCBA -> ABCD - mova128 PSHUFFLE_BYTE_FLIP_MASK, SHUF_MASK + mova128 PSHUFFLE_BYTE_FLIP_MASK, %xmm7 + + movu128 0*16(%eax), MSG0 + pshufb %xmm7, MSG0 + movu128 1*16(%eax), MSG1 + pshufb %xmm7, MSG1 + movu128 2*16(%eax), MSG2 + pshufb %xmm7, MSG2 + movu128 3*16(%eax), MSG3 + pshufb %xmm7, MSG3 /* Save hash values for addition after rounds */ - movu128 E0, 16(%esp) + movu128 E0, %xmm7 movu128 ABCD, (%esp) /* Rounds 0-3 */ - movu128 0*16(%eax), MSG0 - pshufb SHUF_MASK, MSG0 paddd MSG0, E0 mova128 ABCD, E1 sha1rnds4 $0, E0, ABCD /* Rounds 4-7 */ - movu128 1*16(%eax), MSG1 - pshufb SHUF_MASK, MSG1 sha1nexte MSG1, E1 mova128 ABCD, E0 sha1rnds4 $0, E1, ABCD sha1msg1 MSG1, MSG0 /* Rounds 8-11 */ - movu128 2*16(%eax), MSG2 - pshufb SHUF_MASK, MSG2 sha1nexte MSG2, E0 mova128 ABCD, E1 sha1rnds4 $0, E0, ABCD @@ -78,8 +77,6 @@ sha1_process_block64_shaNI: xor128 MSG2, MSG0 /* Rounds 12-15 */ - movu128 3*16(%eax), MSG3 - pshufb SHUF_MASK, MSG3 sha1nexte MSG3, E1 mova128 ABCD, E0 sha1msg2 MSG3, MSG0 @@ -210,16 +207,16 @@ sha1_process_block64_shaNI: sha1rnds4 $3, E1, ABCD /* Add current hash values with previously saved */ - sha1nexte 16(%esp), E0 - paddd (%esp), ABCD + sha1nexte %xmm7, E0 + movu128 (%esp), %xmm7 + paddd %xmm7, ABCD /* Write hash values back in the correct order */ shuf128_32 $0x1B, ABCD, ABCD movu128 ABCD, 76(%eax) extr128_32 $3, E0, 76+4*4(%eax) - movl %ebp, %esp - popl %ebp + addl $16, %esp ret .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index 8ddec87ce..fc2ca92e8 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -32,7 +32,6 @@ #define MSG1 %xmm4 #define MSG2 %xmm5 #define MSG3 %xmm6 -#define SHUF_MASK %xmm7 .balign 8 # allow decoders to fetch at least 2 first insns sha1_process_block64_shaNI: @@ -43,30 +42,33 @@ sha1_process_block64_shaNI: pinsrd $3, 80+4*4(%rdi), E0 # load to uppermost 32-bit word shuf128_32 $0x1B, ABCD, ABCD # DCBA -> ABCD - mova128 PSHUFFLE_BYTE_FLIP_MASK(%rip), SHUF_MASK + mova128 PSHUFFLE_BYTE_FLIP_MASK(%rip), %xmm7 + + movu128 0*16(%rdi), MSG0 + pshufb %xmm7, MSG0 + movu128 1*16(%rdi), MSG1 + pshufb %xmm7, MSG1 + movu128 2*16(%rdi), MSG2 + pshufb %xmm7, MSG2 + movu128 3*16(%rdi), MSG3 + pshufb %xmm7, MSG3 /* Save hash values for addition after rounds */ - mova128 E0, %xmm9 + mova128 E0, %xmm7 mova128 ABCD, %xmm8 /* Rounds 0-3 */ - movu128 0*16(%rdi), MSG0 - pshufb SHUF_MASK, MSG0 paddd MSG0, E0 mova128 ABCD, E1 sha1rnds4 $0, E0, ABCD /* Rounds 4-7 */ - movu128 1*16(%rdi), MSG1 - pshufb SHUF_MASK, MSG1 sha1nexte MSG1, E1 mova128 ABCD, E0 sha1rnds4 $0, E1, ABCD sha1msg1 MSG1, MSG0 /* Rounds 8-11 */ - movu128 2*16(%rdi), MSG2 - pshufb SHUF_MASK, MSG2 sha1nexte MSG2, E0 mova128 ABCD, E1 sha1rnds4 $0, E0, ABCD @@ -74,8 +76,6 @@ sha1_process_block64_shaNI: xor128 MSG2, MSG0 /* Rounds 12-15 */ - movu128 3*16(%rdi), MSG3 - pshufb SHUF_MASK, MSG3 sha1nexte MSG3, E1 mova128 ABCD, E0 sha1msg2 MSG3, MSG0 @@ -206,7 +206,7 @@ sha1_process_block64_shaNI: sha1rnds4 $3, E1, ABCD /* Add current hash values with previously saved */ - sha1nexte %xmm9, E0 + sha1nexte %xmm7, E0 paddd %xmm8, ABCD /* Write hash values back in the correct order */ From vda.linux at googlemail.com Tue Feb 8 14:23:26 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Tue, 8 Feb 2022 15:23:26 +0100 Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing (32-bit) Message-ID: <20220208141656.F168E829FC@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=eb52e7fa522d829fb400461ca4c808ee5c1d6428 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64_shaNI 517 511 -6 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-32_shaNI.S | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index 0f3fe57ca..ad814a21b 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -35,11 +35,9 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha1_process_block64_shaNI: - subl $16, %esp - /* load initial hash values */ - xor128 E0, E0 movu128 76(%eax), ABCD + xor128 E0, E0 pinsrd $3, 76+4*4(%eax), E0 # load to uppermost 32-bit word shuf128_32 $0x1B, ABCD, ABCD # DCBA -> ABCD @@ -56,7 +54,7 @@ sha1_process_block64_shaNI: /* Save hash values for addition after rounds */ movu128 E0, %xmm7 - movu128 ABCD, (%esp) + /*movu128 ABCD, %xmm8 - NOPE, 32bit has no xmm8 */ /* Rounds 0-3 */ paddd MSG0, E0 @@ -208,7 +206,9 @@ sha1_process_block64_shaNI: /* Add current hash values with previously saved */ sha1nexte %xmm7, E0 - movu128 (%esp), %xmm7 + /*paddd %xmm8, ABCD - 32-bit mode has no xmm8 */ + movu128 76(%eax), %xmm7 # recreate original ABCD + shuf128_32 $0x1B, %xmm7, %xmm7 # DCBA -> ABCD paddd %xmm7, ABCD /* Write hash values back in the correct order */ @@ -216,7 +216,6 @@ sha1_process_block64_shaNI: movu128 ABCD, 76(%eax) extr128_32 $3, E0, 76+4*4(%eax) - addl $16, %esp ret .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI From vda.linux at googlemail.com Tue Feb 8 14:34:02 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Tue, 8 Feb 2022 15:34:02 +0100 Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing (32-bit) Message-ID: <20220208142920.E0AE082B5D@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=eb8d5f3b8f3c91f3ed82a52b4ce52a154c146ede branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64_shaNI 511 507 -4 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-32_shaNI.S | 9 ++++----- libbb/hash_md5_sha_x86-64_shaNI.S | 3 +-- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index ad814a21b..a61b3cbed 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -53,8 +53,8 @@ sha1_process_block64_shaNI: pshufb %xmm7, MSG3 /* Save hash values for addition after rounds */ - movu128 E0, %xmm7 - /*movu128 ABCD, %xmm8 - NOPE, 32bit has no xmm8 */ + mova128 E0, %xmm7 + /*mova128 ABCD, %xmm8 - NOPE, 32bit has no xmm8 */ /* Rounds 0-3 */ paddd MSG0, E0 @@ -207,12 +207,11 @@ sha1_process_block64_shaNI: /* Add current hash values with previously saved */ sha1nexte %xmm7, E0 /*paddd %xmm8, ABCD - 32-bit mode has no xmm8 */ - movu128 76(%eax), %xmm7 # recreate original ABCD - shuf128_32 $0x1B, %xmm7, %xmm7 # DCBA -> ABCD - paddd %xmm7, ABCD + movu128 76(%eax), %xmm7 # get original ABCD (not shuffled)... /* Write hash values back in the correct order */ shuf128_32 $0x1B, ABCD, ABCD + paddd %xmm7, ABCD # ...add it to final ABCD movu128 ABCD, 76(%eax) extr128_32 $3, E0, 76+4*4(%eax) diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index fc2ca92e8..b32029360 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -36,9 +36,8 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha1_process_block64_shaNI: /* load initial hash values */ - - xor128 E0, E0 movu128 80(%rdi), ABCD + xor128 E0, E0 pinsrd $3, 80+4*4(%rdi), E0 # load to uppermost 32-bit word shuf128_32 $0x1B, ABCD, ABCD # DCBA -> ABCD From bugzilla at busybox.net Tue Feb 8 15:48:37 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Tue, 08 Feb 2022 15:48:37 +0000 Subject: [Bug 14571] New: ash crashes with fork (&) and stty -echo Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14571 Bug ID: 14571 Summary: ash crashes with fork (&) and stty -echo Product: Busybox Version: 1.33.x Hardware: All OS: Linux Status: NEW Severity: normal Priority: P5 Component: Other Assignee: unassigned at busybox.net Reporter: cyrilbur at gmail.com CC: busybox-cvs at busybox.net Target Milestone: --- Setting -echo and leaving a dangling fork results in an ash crash. I have a relatively stripped down busybox, I am using the busybox coreutls. Reproduce: stty -echo sleep 1 & ps & ^ This is the problem ash will crash. I have done the same thing with bash and dash and neither crash. If I have time I will endeavour to get more information. -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Tue Feb 8 18:41:34 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Tue, 08 Feb 2022 18:41:34 +0000 Subject: [Bug 14576] New: unzip: test skipped with bad archive Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14576 Bug ID: 14576 Summary: unzip: test skipped with bad archive Product: Busybox Version: 1.33.x Hardware: All OS: Linux Status: NEW Severity: major Priority: P5 Component: Standard Compliance Assignee: unassigned at busybox.net Reporter: dharanendiran at gmail.com CC: busybox-cvs at busybox.net Target Milestone: --- When I run the testsuite, the unzip (bad archive) was skipped. So the expected results is always skipped here or it require further components to get succeed. # ./runtest -v unzip ====================== echo -ne '' >input echo -ne '' | unzip -q foo.zip foo/ && test -d foo && test ! -f foo/bar && echo yes PASS: unzip (subdir only) SKIPPED: unzip (bad archive) ====================== echo -ne '' >input echo -ne '' | unzip -p ../unzip_bad_lzma_1.zip 2>&1; echo $? PASS: unzip (archive with corrupted lzma 1) ====================== echo -ne '' >input echo -ne '' | unzip -p ../unzip_bad_lzma_2.zip 2>&1; echo $? PASS: unzip (archive with corrupted lzma 2) # The following config options are enabled in busybox: FEATURE_UNZIP_CDF CONFIG_UNICODE_SUPPORT UUDECODE -- You are receiving this mail because: You are on the CC list for the bug. From vda.linux at googlemail.com Wed Feb 9 00:30:23 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Wed, 9 Feb 2022 01:30:23 +0100 Subject: [git commit] libbb/sha256: code shrink in 32-bit x86 Message-ID: <20220209003846.5A4608148C@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=461a994b09c5022b93bccccf903b39438d61bbf1 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 697 676 -21 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index a849dfcc2..846230e3e 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -34,16 +34,18 @@ #define XMMTMP %xmm7 +#define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6)) + .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 76+0*16(%eax), STATE0 - movu128 76+1*16(%eax), STATE1 - shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ - shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + movu128 76+0*16(%eax), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */ + movu128 76+1*16(%eax), STATE0 /* HGFE */ +/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ mova128 STATE0, XMMTMP - palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, XMMTMP, STATE1 /* CDGH */ + shufps SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */ + shufps SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */ + mova128 XMMTMP, STATE1 /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP @@ -231,18 +233,19 @@ sha256_process_block64_shaNI: sha256rnds2 STATE1, STATE0 /* Write hash values back in the correct order */ - shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ - shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + /* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */ + /* STATE1: CDGH */ mova128 STATE0, XMMTMP - pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, XMMTMP, STATE1 /* HGFE */ +/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ + shufps SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */ + shufps SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */ /* add current hash values to previous ones */ + movu128 76+1*16(%eax), STATE1 + paddd XMMTMP, STATE1 + movu128 STATE1, 76+1*16(%eax) movu128 76+0*16(%eax), XMMTMP paddd XMMTMP, STATE0 - movu128 76+1*16(%eax), XMMTMP movu128 STATE0, 76+0*16(%eax) - paddd XMMTMP, STATE1 - movu128 STATE1, 76+1*16(%eax) ret .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI From vda.linux at googlemail.com Tue Feb 8 23:33:39 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Wed, 9 Feb 2022 00:33:39 +0100 Subject: [git commit] libbb/sha256: code shrink in 32-bit x86 Message-ID: <20220209003846.4FB3582DFD@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=c0ff0d4528d718c20b9ca2290bd10d59e9f794a3 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 713 697 -16 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 130 ++++++++++++++++------------------- libbb/hash_md5_sha256_x86-64_shaNI.S | 107 ++++++++++++++-------------- 2 files changed, 114 insertions(+), 123 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 39e2baf41..a849dfcc2 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -31,35 +31,27 @@ #define MSGTMP1 %xmm4 #define MSGTMP2 %xmm5 #define MSGTMP3 %xmm6 -#define XMMTMP4 %xmm7 - .balign 8 # allow decoders to fetch at least 3 first insns -sha256_process_block64_shaNI: - pushl %ebp - movl %esp, %ebp - subl $32, %esp - andl $~0xF, %esp # paddd needs aligned memory operand +#define XMMTMP %xmm7 + .balign 8 # allow decoders to fetch at least 2 first insns +sha256_process_block64_shaNI: movu128 76+0*16(%eax), STATE0 movu128 76+1*16(%eax), STATE1 - shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ - shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ - mova128 STATE0, XMMTMP4 - palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, XMMTMP4, STATE1 /* CDGH */ + shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ + shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + mova128 STATE0, XMMTMP + palignr $8, STATE1, STATE0 /* ABEF */ + pblendw $0xF0, XMMTMP, STATE1 /* CDGH */ -/* XMMTMP4 holds flip mask from here... */ - mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP4 +/* XMMTMP holds flip mask from here... */ + mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP movl $K256+8*16, SHA256CONSTANTS - /* Save hash values for addition after rounds */ - mova128 STATE0, 0*16(%esp) - mova128 STATE1, 1*16(%esp) - /* Rounds 0-3 */ movu128 0*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -68,7 +60,7 @@ sha256_process_block64_shaNI: /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -78,7 +70,7 @@ sha256_process_block64_shaNI: /* Rounds 8-11 */ movu128 2*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -88,14 +80,14 @@ sha256_process_block64_shaNI: /* Rounds 12-15 */ movu128 3*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG /* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -105,9 +97,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -117,9 +109,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -129,9 +121,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -141,9 +133,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -153,9 +145,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -165,9 +157,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -177,9 +169,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -189,9 +181,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -201,9 +193,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -213,9 +205,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -224,9 +216,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -238,22 +230,20 @@ sha256_process_block64_shaNI: shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 - /* Add current hash values with previously saved */ - paddd 0*16(%esp), STATE0 - paddd 1*16(%esp), STATE1 - /* Write hash values back in the correct order */ - shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ - shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ - mova128 STATE0, XMMTMP4 - pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, XMMTMP4, STATE1 /* HGFE */ - + shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ + shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + mova128 STATE0, XMMTMP + pblendw $0xF0, STATE1, STATE0 /* DCBA */ + palignr $8, XMMTMP, STATE1 /* HGFE */ + /* add current hash values to previous ones */ + movu128 76+0*16(%eax), XMMTMP + paddd XMMTMP, STATE0 + movu128 76+1*16(%eax), XMMTMP movu128 STATE0, 76+0*16(%eax) + paddd XMMTMP, STATE1 movu128 STATE1, 76+1*16(%eax) - movl %ebp, %esp - popl %ebp ret .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index c6c931341..b5c950a9a 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -31,7 +31,8 @@ #define MSGTMP1 %xmm4 #define MSGTMP2 %xmm5 #define MSGTMP3 %xmm6 -#define XMMTMP4 %xmm7 + +#define XMMTMP %xmm7 #define ABEF_SAVE %xmm9 #define CDGH_SAVE %xmm10 @@ -41,14 +42,14 @@ sha256_process_block64_shaNI: movu128 80+0*16(%rdi), STATE0 movu128 80+1*16(%rdi), STATE1 - shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ - shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ - mova128 STATE0, XMMTMP4 - palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, XMMTMP4, STATE1 /* CDGH */ + shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ + shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + mova128 STATE0, XMMTMP + palignr $8, STATE1, STATE0 /* ABEF */ + pblendw $0xF0, XMMTMP, STATE1 /* CDGH */ -/* XMMTMP4 holds flip mask from here... */ - mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP4 +/* XMMTMP holds flip mask from here... */ + mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP leaq K256+8*16(%rip), SHA256CONSTANTS /* Save hash values for addition after rounds */ @@ -57,7 +58,7 @@ sha256_process_block64_shaNI: /* Rounds 0-3 */ movu128 0*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -66,7 +67,7 @@ sha256_process_block64_shaNI: /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -76,7 +77,7 @@ sha256_process_block64_shaNI: /* Rounds 8-11 */ movu128 2*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 @@ -86,14 +87,14 @@ sha256_process_block64_shaNI: /* Rounds 12-15 */ movu128 3*16(DATA_PTR), MSG - pshufb XMMTMP4, MSG + pshufb XMMTMP, MSG /* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -103,9 +104,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -115,9 +116,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -127,9 +128,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -139,9 +140,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -151,9 +152,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -163,9 +164,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -175,9 +176,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -187,9 +188,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP3, XMMTMP4 - palignr $4, MSGTMP2, XMMTMP4 - paddd XMMTMP4, MSGTMP0 + mova128 MSGTMP3, XMMTMP + palignr $4, MSGTMP2, XMMTMP + paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -199,9 +200,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP0, XMMTMP4 - palignr $4, MSGTMP3, XMMTMP4 - paddd XMMTMP4, MSGTMP1 + mova128 MSGTMP0, XMMTMP + palignr $4, MSGTMP3, XMMTMP + paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -211,9 +212,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP1, XMMTMP4 - palignr $4, MSGTMP0, XMMTMP4 - paddd XMMTMP4, MSGTMP2 + mova128 MSGTMP1, XMMTMP + palignr $4, MSGTMP0, XMMTMP + paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -222,9 +223,9 @@ sha256_process_block64_shaNI: mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG sha256rnds2 STATE0, STATE1 - mova128 MSGTMP2, XMMTMP4 - palignr $4, MSGTMP1, XMMTMP4 - paddd XMMTMP4, MSGTMP3 + mova128 MSGTMP2, XMMTMP + palignr $4, MSGTMP1, XMMTMP + paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG sha256rnds2 STATE1, STATE0 @@ -241,11 +242,11 @@ sha256_process_block64_shaNI: paddd CDGH_SAVE, STATE1 /* Write hash values back in the correct order */ - shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ - shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ - mova128 STATE0, XMMTMP4 - pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, XMMTMP4, STATE1 /* HGFE */ + shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ + shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + mova128 STATE0, XMMTMP + pblendw $0xF0, STATE1, STATE0 /* DCBA */ + palignr $8, XMMTMP, STATE1 /* HGFE */ movu128 STATE0, 80+0*16(%rdi) movu128 STATE1, 80+1*16(%rdi) From vda.linux at googlemail.com Wed Feb 9 00:42:49 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Wed, 9 Feb 2022 01:42:49 +0100 Subject: [git commit] libbb/sha256: code shrink in 64-bit x86 Message-ID: <20220209003846.64EAB8315B@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=11bcea7ac0ac4b2156c1b2d53f926d789b9792b4 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 701 680 -21 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-64_shaNI.S | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index b5c950a9a..bc063b9cc 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -37,16 +37,18 @@ #define ABEF_SAVE %xmm9 #define CDGH_SAVE %xmm10 +#define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6)) + .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 80+0*16(%rdi), STATE0 - movu128 80+1*16(%rdi), STATE1 - shuf128_32 $0xB1, STATE0, STATE0 /* CDAB */ - shuf128_32 $0x1B, STATE1, STATE1 /* EFGH */ + movu128 80+0*16(%rdi), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */ + movu128 80+1*16(%rdi), STATE0 /* HGFE */ +/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ mova128 STATE0, XMMTMP - palignr $8, STATE1, STATE0 /* ABEF */ - pblendw $0xF0, XMMTMP, STATE1 /* CDGH */ + shufps SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */ + shufps SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */ + mova128 XMMTMP, STATE1 /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP @@ -242,14 +244,15 @@ sha256_process_block64_shaNI: paddd CDGH_SAVE, STATE1 /* Write hash values back in the correct order */ - shuf128_32 $0x1B, STATE0, STATE0 /* FEBA */ - shuf128_32 $0xB1, STATE1, STATE1 /* DCHG */ + /* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */ + /* STATE1: CDGH */ mova128 STATE0, XMMTMP - pblendw $0xF0, STATE1, STATE0 /* DCBA */ - palignr $8, XMMTMP, STATE1 /* HGFE */ +/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ + shufps SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */ + shufps SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */ movu128 STATE0, 80+0*16(%rdi) - movu128 STATE1, 80+1*16(%rdi) + movu128 XMMTMP, 80+1*16(%rdi) ret .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI From vda.linux at googlemail.com Wed Feb 9 00:50:22 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Wed, 9 Feb 2022 01:50:22 +0100 Subject: [git commit] libbb/sha256: code shrink in x86 assembly Message-ID: <20220209004444.C2B4182B8E@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=caa9c4f707b661cf398f2c2d66f54f5b0d8adfe2 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha256_process_block64_shaNI 32-bit 676 673 -3 sha256_process_block64_shaNI 64-bit 680 677 -3 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 11 +++++------ libbb/hash_md5_sha256_x86-64_shaNI.S | 11 +++++------ 2 files changed, 10 insertions(+), 12 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 846230e3e..aa68193bd 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -39,13 +39,12 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 76+0*16(%eax), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */ - movu128 76+1*16(%eax), STATE0 /* HGFE */ + movu128 76+0*16(%eax), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */ + movu128 76+1*16(%eax), STATE1 /* HGFE */ /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ - mova128 STATE0, XMMTMP - shufps SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */ - shufps SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */ - mova128 XMMTMP, STATE1 + mova128 STATE1, STATE0 + shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */ + shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */ /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index bc063b9cc..4663f750a 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -42,13 +42,12 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 80+0*16(%rdi), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */ - movu128 80+1*16(%rdi), STATE0 /* HGFE */ + movu128 80+0*16(%rdi), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */ + movu128 80+1*16(%rdi), STATE1 /* HGFE */ /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ - mova128 STATE0, XMMTMP - shufps SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */ - shufps SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */ - mova128 XMMTMP, STATE1 + mova128 STATE1, STATE0 + shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */ + shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */ /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP From vda.linux at googlemail.com Wed Feb 9 10:29:23 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Wed, 9 Feb 2022 11:29:23 +0100 Subject: [git commit] whitespace fix Message-ID: <20220209102223.E965181D55@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=6a6c1c0ea91edeeb18736190feb5a7278d3d1141 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 6 +++--- libbb/hash_md5_sha256_x86-64_shaNI.S | 6 +++--- libbb/hash_md5_sha_x86-32_shaNI.S | 4 ++-- libbb/hash_md5_sha_x86-64_shaNI.S | 4 ++-- 4 files changed, 10 insertions(+), 10 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index aa68193bd..413e2df9e 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -250,7 +250,7 @@ sha256_process_block64_shaNI: .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI .section .rodata.cst256.K256, "aM", @progbits, 256 - .balign 16 + .balign 16 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 @@ -270,8 +270,8 @@ K256: .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 .section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 - .balign 16 + .balign 16 PSHUFFLE_BSWAP32_FLIP_MASK: - .octa 0x0c0d0e0f08090a0b0405060700010203 + .octa 0x0c0d0e0f08090a0b0405060700010203 #endif diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index 4663f750a..c246762aa 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -257,7 +257,7 @@ sha256_process_block64_shaNI: .size sha256_process_block64_shaNI, .-sha256_process_block64_shaNI .section .rodata.cst256.K256, "aM", @progbits, 256 - .balign 16 + .balign 16 K256: .long 0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5 .long 0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5 @@ -277,8 +277,8 @@ K256: .long 0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2 .section .rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16 - .balign 16 + .balign 16 PSHUFFLE_BSWAP32_FLIP_MASK: - .octa 0x0c0d0e0f08090a0b0405060700010203 + .octa 0x0c0d0e0f08090a0b0405060700010203 #endif diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index a61b3cbed..afca98a62 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -219,8 +219,8 @@ sha1_process_block64_shaNI: .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 - .balign 16 + .balign 16 PSHUFFLE_BYTE_FLIP_MASK: - .octa 0x000102030405060708090a0b0c0d0e0f + .octa 0x000102030405060708090a0b0c0d0e0f #endif diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index b32029360..54d122788 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -217,8 +217,8 @@ sha1_process_block64_shaNI: .size sha1_process_block64_shaNI, .-sha1_process_block64_shaNI .section .rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16 - .balign 16 + .balign 16 PSHUFFLE_BYTE_FLIP_MASK: - .octa 0x000102030405060708090a0b0c0d0e0f + .octa 0x000102030405060708090a0b0c0d0e0f #endif From vda.linux at googlemail.com Thu Feb 10 14:38:10 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Thu, 10 Feb 2022 15:38:10 +0100 Subject: [git commit] libbb/sha: improve comments Message-ID: <20220210143100.BAFC48142B@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=6f56fa17131b3cbb84e887c6c5fb202f2492169e branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 18 +++++++++--------- libbb/hash_md5_sha256_x86-64_shaNI.S | 19 +++++++++---------- libbb/hash_md5_sha_x86-32_shaNI.S | 2 +- libbb/hash_md5_sha_x86-64_shaNI.S | 2 +- 4 files changed, 20 insertions(+), 21 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 413e2df9e..4b33449d4 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -4,7 +4,7 @@ // We use shorter insns, even though they are for "wrong" // data type (fp, not int). // For Intel, there is no penalty for doing it at all -// (CPUs which do have such penalty do not support SHA1 insns). +// (CPUs which do have such penalty do not support SHA insns). // For AMD, the penalty is one extra cycle // (allegedly: I failed to find measurable difference). @@ -39,12 +39,13 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 76+0*16(%eax), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */ - movu128 76+1*16(%eax), STATE1 /* HGFE */ + movu128 76+0*16(%eax), XMMTMP /* ABCD (little-endian dword order) */ + movu128 76+1*16(%eax), STATE1 /* EFGH */ /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ mova128 STATE1, STATE0 - shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */ - shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */ + /* --- -------------- ABCD -- EFGH */ + shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* FEBA */ + shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* HGDC */ /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP @@ -232,12 +233,11 @@ sha256_process_block64_shaNI: sha256rnds2 STATE1, STATE0 /* Write hash values back in the correct order */ - /* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */ - /* STATE1: CDGH */ mova128 STATE0, XMMTMP /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ - shufps SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */ - shufps SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */ + /* --- -------------- HGDC -- FEBA */ + shufps SHUF(3,2,3,2), STATE1, STATE0 /* ABCD */ + shufps SHUF(1,0,1,0), STATE1, XMMTMP /* EFGH */ /* add current hash values to previous ones */ movu128 76+1*16(%eax), STATE1 paddd XMMTMP, STATE1 diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index c246762aa..5ed80c2ef 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -4,7 +4,7 @@ // We use shorter insns, even though they are for "wrong" // data type (fp, not int). // For Intel, there is no penalty for doing it at all -// (CPUs which do have such penalty do not support SHA1 insns). +// (CPUs which do have such penalty do not support SHA insns). // For AMD, the penalty is one extra cycle // (allegedly: I failed to find measurable difference). @@ -42,12 +42,13 @@ .balign 8 # allow decoders to fetch at least 2 first insns sha256_process_block64_shaNI: - movu128 80+0*16(%rdi), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */ - movu128 80+1*16(%rdi), STATE1 /* HGFE */ + movu128 80+0*16(%rdi), XMMTMP /* ABCD (little-endian dword order) */ + movu128 80+1*16(%rdi), STATE1 /* EFGH */ /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ mova128 STATE1, STATE0 - shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */ - shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */ + /* --- -------------- ABCD -- EFGH */ + shufps SHUF(1,0,1,0), XMMTMP, STATE0 /* FEBA */ + shufps SHUF(3,2,3,2), XMMTMP, STATE1 /* HGDC */ /* XMMTMP holds flip mask from here... */ mova128 PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP @@ -243,13 +244,11 @@ sha256_process_block64_shaNI: paddd CDGH_SAVE, STATE1 /* Write hash values back in the correct order */ - /* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */ - /* STATE1: CDGH */ mova128 STATE0, XMMTMP /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */ - shufps SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */ - shufps SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */ - + /* --- -------------- HGDC -- FEBA */ + shufps SHUF(3,2,3,2), STATE1, STATE0 /* ABCD */ + shufps SHUF(1,0,1,0), STATE1, XMMTMP /* EFGH */ movu128 STATE0, 80+0*16(%rdi) movu128 XMMTMP, 80+1*16(%rdi) diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index afca98a62..c7fb243ce 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -4,7 +4,7 @@ // We use shorter insns, even though they are for "wrong" // data type (fp, not int). // For Intel, there is no penalty for doing it at all -// (CPUs which do have such penalty do not support SHA1 insns). +// (CPUs which do have such penalty do not support SHA insns). // For AMD, the penalty is one extra cycle // (allegedly: I failed to find measurable difference). diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index 54d122788..c13cdec07 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -4,7 +4,7 @@ // We use shorter insns, even though they are for "wrong" // data type (fp, not int). // For Intel, there is no penalty for doing it at all -// (CPUs which do have such penalty do not support SHA1 insns). +// (CPUs which do have such penalty do not support SHA insns). // For AMD, the penalty is one extra cycle // (allegedly: I failed to find measurable difference). From vda.linux at googlemail.com Fri Feb 11 05:08:27 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Fri, 11 Feb 2022 06:08:27 +0100 Subject: [git commit] libbb/sha1: shrink unrolled x86-64 code Message-ID: <20220211050806.E034782212@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=8154146be491bc66ab34d5d5f2a2466ddbdcff52 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master function old new delta sha1_process_block64 3481 3384 -97 Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-64.S | 129 ++++++++++++++++++++--------------------- libbb/hash_md5_sha_x86-64.S.sh | 111 +++++++++++++++++------------------ 2 files changed, 117 insertions(+), 123 deletions(-) diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index 287cfe547..51fde082a 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -27,68 +27,60 @@ sha1_process_block64: # xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units - movl 80(%rdi), %eax # a = ctx->hash[0] - movl 84(%rdi), %ebx # b = ctx->hash[1] - movl 88(%rdi), %ecx # c = ctx->hash[2] - movl 92(%rdi), %edx # d = ctx->hash[3] - movl 96(%rdi), %ebp # e = ctx->hash[4] - movaps sha1const(%rip), %xmm7 + movaps bswap32_mask(%rip), %xmm4 pshufd $0x00, %xmm7, %xmm6 - # Load W[] to xmm registers, byteswapping on the fly. + # Load W[] to xmm0..3, byteswapping on the fly. # - # For iterations 0..15, we pass W[] in rsi,r8..r14 + # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 # for use in RD1As instead of spilling them to stack. - # We lose parallelized addition of RCONST, but LEA - # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # LEAs in two first RD1As shorter by one byte). - movq 4*0(%rdi), %rsi - movq 4*2(%rdi), %r8 - bswapq %rsi - bswapq %r8 - rolq $32, %rsi # rsi = W[1]:W[0] - rolq $32, %r8 # r8 = W[3]:W[2] - movq %rsi, %xmm0 - movq %r8, %xmm4 - punpcklqdq %xmm4, %xmm0 # xmm0 = r8:rsi = (W[0],W[1],W[2],W[3]) -# movaps %xmm0, %xmm4 # add RCONST, spill to stack -# paddd %xmm6, %xmm4 -# movups %xmm4, -64+16*0(%rsp) + # ADDs in two first RD1As shorter by one byte). + movups 16*0(%rdi), %xmm0 + pshufb %xmm4, %xmm0 + movaps %xmm0, %xmm5 + paddd %xmm6, %xmm5 + movq %xmm5, %rsi +# pextrq $1, %xmm5, %r8 #SSE4.1 insn +# movhpd %xmm5, %r8 #can only move to mem, not to reg + shufps $0x0e, %xmm5, %xmm5 + movq %xmm5, %r8 + + movups 16*1(%rdi), %xmm1 + pshufb %xmm4, %xmm1 + movaps %xmm1, %xmm5 + paddd %xmm6, %xmm5 + movq %xmm5, %r9 + shufps $0x0e, %xmm5, %xmm5 + movq %xmm5, %r10 - movq 4*4(%rdi), %r9 - movq 4*6(%rdi), %r10 - bswapq %r9 - bswapq %r10 - rolq $32, %r9 # r9 = W[5]:W[4] - rolq $32, %r10 # r10 = W[7]:W[6] - movq %r9, %xmm1 - movq %r10, %xmm4 - punpcklqdq %xmm4, %xmm1 # xmm1 = r10:r9 = (W[4],W[5],W[6],W[7]) + movups 16*2(%rdi), %xmm2 + pshufb %xmm4, %xmm2 + movaps %xmm2, %xmm5 + paddd %xmm6, %xmm5 + movq %xmm5, %r11 + shufps $0x0e, %xmm5, %xmm5 + movq %xmm5, %r12 - movq 4*8(%rdi), %r11 - movq 4*10(%rdi), %r12 - bswapq %r11 - bswapq %r12 - rolq $32, %r11 # r11 = W[9]:W[8] - rolq $32, %r12 # r12 = W[11]:W[10] - movq %r11, %xmm2 - movq %r12, %xmm4 - punpcklqdq %xmm4, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) + movups 16*3(%rdi), %xmm3 + pshufb %xmm4, %xmm3 + movaps %xmm3, %xmm5 + paddd %xmm6, %xmm5 + movq %xmm5, %r13 + shufps $0x0e, %xmm5, %xmm5 + movq %xmm5, %r14 - movq 4*12(%rdi), %r13 - movq 4*14(%rdi), %r14 - bswapq %r13 - bswapq %r14 - rolq $32, %r13 # r13 = W[13]:W[12] - rolq $32, %r14 # r14 = W[15]:W[14] - movq %r13, %xmm3 - movq %r14, %xmm4 - punpcklqdq %xmm4, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) + # MOVQs to GPRs (above) have somewhat high latency. + # Load hash[] while they are completing: + movl 80(%rdi), %eax # a = ctx->hash[0] + movl 84(%rdi), %ebx # b = ctx->hash[1] + movl 88(%rdi), %ecx # c = ctx->hash[2] + movl 92(%rdi), %edx # d = ctx->hash[3] + movl 96(%rdi), %ebp # e = ctx->hash[4] # 0 - leal 0x5A827999(%rbp,%rsi), %ebp # e += RCONST + W[n] + addl %esi, %ebp # e += RCONST + W[n] shrq $32, %rsi movl %ecx, %edi # c xorl %edx, %edi # ^d @@ -100,7 +92,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 1 - leal 0x5A827999(%rdx,%rsi), %edx # e += RCONST + W[n] + addl %esi, %edx # e += RCONST + W[n] movl %ebx, %edi # c xorl %ecx, %edi # ^d andl %eax, %edi # &b @@ -111,7 +103,7 @@ sha1_process_block64: addl %edi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) # 2 - leal 0x5A827999(%rcx,%r8), %ecx # e += RCONST + W[n] + addl %r8d, %ecx # e += RCONST + W[n] shrq $32, %r8 movl %eax, %edi # c xorl %ebx, %edi # ^d @@ -123,7 +115,7 @@ sha1_process_block64: addl %edi, %ecx # e += rotl32(a,5) rorl $2, %ebp # b = rotl32(b,30) # 3 - leal 0x5A827999(%rbx,%r8), %ebx # e += RCONST + W[n] + addl %r8d, %ebx # e += RCONST + W[n] movl %ebp, %edi # c xorl %eax, %edi # ^d andl %edx, %edi # &b @@ -134,7 +126,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 4 - leal 0x5A827999(%rax,%r9), %eax # e += RCONST + W[n] + addl %r9d, %eax # e += RCONST + W[n] shrq $32, %r9 movl %edx, %edi # c xorl %ebp, %edi # ^d @@ -146,7 +138,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 5 - leal 0x5A827999(%rbp,%r9), %ebp # e += RCONST + W[n] + addl %r9d, %ebp # e += RCONST + W[n] movl %ecx, %edi # c xorl %edx, %edi # ^d andl %ebx, %edi # &b @@ -157,7 +149,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 6 - leal 0x5A827999(%rdx,%r10), %edx # e += RCONST + W[n] + addl %r10d, %edx # e += RCONST + W[n] shrq $32, %r10 movl %ebx, %edi # c xorl %ecx, %edi # ^d @@ -169,7 +161,7 @@ sha1_process_block64: addl %edi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) # 7 - leal 0x5A827999(%rcx,%r10), %ecx # e += RCONST + W[n] + addl %r10d, %ecx # e += RCONST + W[n] movl %eax, %edi # c xorl %ebx, %edi # ^d andl %ebp, %edi # &b @@ -210,7 +202,7 @@ sha1_process_block64: paddd %xmm6, %xmm5 movups %xmm5, -64+16*0(%rsp) # 8 - leal 0x5A827999(%rbx,%r11), %ebx # e += RCONST + W[n] + addl %r11d, %ebx # e += RCONST + W[n] shrq $32, %r11 movl %ebp, %edi # c xorl %eax, %edi # ^d @@ -222,7 +214,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 9 - leal 0x5A827999(%rax,%r11), %eax # e += RCONST + W[n] + addl %r11d, %eax # e += RCONST + W[n] movl %edx, %edi # c xorl %ebp, %edi # ^d andl %ecx, %edi # &b @@ -233,7 +225,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 10 - leal 0x5A827999(%rbp,%r12), %ebp # e += RCONST + W[n] + addl %r12d, %ebp # e += RCONST + W[n] shrq $32, %r12 movl %ecx, %edi # c xorl %edx, %edi # ^d @@ -245,7 +237,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 11 - leal 0x5A827999(%rdx,%r12), %edx # e += RCONST + W[n] + addl %r12d, %edx # e += RCONST + W[n] movl %ebx, %edi # c xorl %ecx, %edi # ^d andl %eax, %edi # &b @@ -287,7 +279,7 @@ sha1_process_block64: paddd %xmm6, %xmm5 movups %xmm5, -64+16*1(%rsp) # 12 - leal 0x5A827999(%rcx,%r13), %ecx # e += RCONST + W[n] + addl %r13d, %ecx # e += RCONST + W[n] shrq $32, %r13 movl %eax, %edi # c xorl %ebx, %edi # ^d @@ -299,7 +291,7 @@ sha1_process_block64: addl %edi, %ecx # e += rotl32(a,5) rorl $2, %ebp # b = rotl32(b,30) # 13 - leal 0x5A827999(%rbx,%r13), %ebx # e += RCONST + W[n] + addl %r13d, %ebx # e += RCONST + W[n] movl %ebp, %edi # c xorl %eax, %edi # ^d andl %edx, %edi # &b @@ -310,7 +302,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 14 - leal 0x5A827999(%rax,%r14), %eax # e += RCONST + W[n] + addl %r14d, %eax # e += RCONST + W[n] shrq $32, %r14 movl %edx, %edi # c xorl %ebp, %edi # ^d @@ -322,7 +314,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 15 - leal 0x5A827999(%rbp,%r14), %ebp # e += RCONST + W[n] + addl %r14d, %ebp # e += RCONST + W[n] movl %ecx, %edi # c xorl %edx, %edi # ^d andl %ebx, %edi # &b @@ -1475,6 +1467,11 @@ sha1_process_block64: ret .size sha1_process_block64, .-sha1_process_block64 + .section .rodata.cst16.bswap32_mask, "aM", @progbits, 16 + .balign 16 +bswap32_mask: + .octa 0x0c0d0e0f08090a0b0405060700010203 + .section .rodata.cst16.sha1const, "aM", @progbits, 16 .balign 16 sha1const: diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index a10ac411d..f34e6e6fa 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -129,65 +129,57 @@ sha1_process_block64: # xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units - movl 80(%rdi), %eax # a = ctx->hash[0] - movl 84(%rdi), %ebx # b = ctx->hash[1] - movl 88(%rdi), %ecx # c = ctx->hash[2] - movl 92(%rdi), %edx # d = ctx->hash[3] - movl 96(%rdi), %ebp # e = ctx->hash[4] - movaps sha1const(%rip), $xmmALLRCONST + movaps bswap32_mask(%rip), $xmmT1 pshufd \$0x00, $xmmALLRCONST, $xmmRCONST - # Load W[] to xmm registers, byteswapping on the fly. + # Load W[] to xmm0..3, byteswapping on the fly. # - # For iterations 0..15, we pass W[] in rsi,r8..r14 + # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 # for use in RD1As instead of spilling them to stack. - # We lose parallelized addition of RCONST, but LEA - # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # LEAs in two first RD1As shorter by one byte). - movq 4*0(%rdi), %rsi - movq 4*2(%rdi), %r8 - bswapq %rsi - bswapq %r8 - rolq \$32, %rsi # rsi = W[1]:W[0] - rolq \$32, %r8 # r8 = W[3]:W[2] - movq %rsi, %xmm0 - movq %r8, $xmmT1 - punpcklqdq $xmmT1, %xmm0 # xmm0 = r8:rsi = (W[0],W[1],W[2],W[3]) -# movaps %xmm0, $xmmT1 # add RCONST, spill to stack -# paddd $xmmRCONST, $xmmT1 -# movups $xmmT1, -64+16*0(%rsp) - - movq 4*4(%rdi), %r9 - movq 4*6(%rdi), %r10 - bswapq %r9 - bswapq %r10 - rolq \$32, %r9 # r9 = W[5]:W[4] - rolq \$32, %r10 # r10 = W[7]:W[6] - movq %r9, %xmm1 - movq %r10, $xmmT1 - punpcklqdq $xmmT1, %xmm1 # xmm1 = r10:r9 = (W[4],W[5],W[6],W[7]) - - movq 4*8(%rdi), %r11 - movq 4*10(%rdi), %r12 - bswapq %r11 - bswapq %r12 - rolq \$32, %r11 # r11 = W[9]:W[8] - rolq \$32, %r12 # r12 = W[11]:W[10] - movq %r11, %xmm2 - movq %r12, $xmmT1 - punpcklqdq $xmmT1, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) - - movq 4*12(%rdi), %r13 - movq 4*14(%rdi), %r14 - bswapq %r13 - bswapq %r14 - rolq \$32, %r13 # r13 = W[13]:W[12] - rolq \$32, %r14 # r14 = W[15]:W[14] - movq %r13, %xmm3 - movq %r14, $xmmT1 - punpcklqdq $xmmT1, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) + # ADDs in two first RD1As shorter by one byte). + movups 16*0(%rdi), %xmm0 + pshufb $xmmT1, %xmm0 + movaps %xmm0, $xmmT2 + paddd $xmmRCONST, $xmmT2 + movq $xmmT2, %rsi +# pextrq \$1, $xmmT2, %r8 #SSE4.1 insn +# movhpd $xmmT2, %r8 #can only move to mem, not to reg + shufps \$0x0e, $xmmT2, $xmmT2 + movq $xmmT2, %r8 + + movups 16*1(%rdi), %xmm1 + pshufb $xmmT1, %xmm1 + movaps %xmm1, $xmmT2 + paddd $xmmRCONST, $xmmT2 + movq $xmmT2, %r9 + shufps \$0x0e, $xmmT2, $xmmT2 + movq $xmmT2, %r10 + + movups 16*2(%rdi), %xmm2 + pshufb $xmmT1, %xmm2 + movaps %xmm2, $xmmT2 + paddd $xmmRCONST, $xmmT2 + movq $xmmT2, %r11 + shufps \$0x0e, $xmmT2, $xmmT2 + movq $xmmT2, %r12 + + movups 16*3(%rdi), %xmm3 + pshufb $xmmT1, %xmm3 + movaps %xmm3, $xmmT2 + paddd $xmmRCONST, $xmmT2 + movq $xmmT2, %r13 + shufps \$0x0e, $xmmT2, $xmmT2 + movq $xmmT2, %r14 + + # MOVQs to GPRs (above) have somewhat high latency. + # Load hash[] while they are completing: + movl 80(%rdi), %eax # a = ctx->hash[0] + movl 84(%rdi), %ebx # b = ctx->hash[1] + movl 88(%rdi), %ecx # c = ctx->hash[2] + movl 92(%rdi), %edx # d = ctx->hash[3] + movl 96(%rdi), %ebp # e = ctx->hash[4] " PREP() { @@ -266,15 +258,15 @@ local rN=$((7+n0/2)) echo " # $n ";test $n0 = 0 && echo " - leal $RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n] + addl %esi, %e$e # e += RCONST + W[n] shrq \$32, %rsi ";test $n0 = 1 && echo " - leal $RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n] + addl %esi, %e$e # e += RCONST + W[n] ";test $n0 -ge 2 && test $((n0 & 1)) = 0 && echo " - leal $RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n] + addl %r${rN}d, %e$e # e += RCONST + W[n] shrq \$32, %r$rN ";test $n0 -ge 2 && test $((n0 & 1)) = 1 && echo " - leal $RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n] + addl %r${rN}d, %e$e # e += RCONST + W[n] ";echo " movl %e$c, %edi # c xorl %e$d, %edi # ^d @@ -440,6 +432,11 @@ echo " ret .size sha1_process_block64, .-sha1_process_block64 + .section .rodata.cst16.bswap32_mask, \"aM\", @progbits, 16 + .balign 16 +bswap32_mask: + .octa 0x0c0d0e0f08090a0b0405060700010203 + .section .rodata.cst16.sha1const, \"aM\", @progbits, 16 .balign 16 sha1const: From vda.linux at googlemail.com Fri Feb 11 13:53:26 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Fri, 11 Feb 2022 14:53:26 +0100 Subject: [git commit] libbb/sha1: revert last commit: pshufb is a SSSE3 insn, can't use it Message-ID: <20220211134649.1F2D782E01@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=dda77e83762861b52d62f0f161e2b4bf8092eacf branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 4 ++ libbb/hash_md5_sha256_x86-64_shaNI.S | 4 ++ libbb/hash_md5_sha_x86-32_shaNI.S | 5 ++ libbb/hash_md5_sha_x86-64.S | 127 +++++++++++++++++---------------- libbb/hash_md5_sha_x86-64.S.sh | 133 +++++++++++++++++++++-------------- libbb/hash_md5_sha_x86-64_shaNI.S | 5 ++ 6 files changed, 163 insertions(+), 115 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index 4b33449d4..c059fb18d 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -15,6 +15,10 @@ //#define shuf128_32 pshufd #define shuf128_32 shufps +// pshufb and palignr are SSSE3 insns. +// We do not check SSSE3 in cpuid, +// all SHA-capable CPUs support it as well. + .section .text.sha256_process_block64_shaNI, "ax", @progbits .globl sha256_process_block64_shaNI .hidden sha256_process_block64_shaNI diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index 5ed80c2ef..9578441f8 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -15,6 +15,10 @@ //#define shuf128_32 pshufd #define shuf128_32 shufps +// pshufb and palignr are SSSE3 insns. +// We do not check SSSE3 in cpuid, +// all SHA-capable CPUs support it as well. + .section .text.sha256_process_block64_shaNI, "ax", @progbits .globl sha256_process_block64_shaNI .hidden sha256_process_block64_shaNI diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S index c7fb243ce..2366b046a 100644 --- a/libbb/hash_md5_sha_x86-32_shaNI.S +++ b/libbb/hash_md5_sha_x86-32_shaNI.S @@ -20,6 +20,11 @@ #define extr128_32 pextrd //#define extr128_32 extractps # not shorter +// pshufb is a SSSE3 insn. +// pinsrd, pextrd, extractps are SSE4.1 insns. +// We do not check SSSE3/SSE4.1 in cpuid, +// all SHA-capable CPUs support them as well. + .section .text.sha1_process_block64_shaNI, "ax", @progbits .globl sha1_process_block64_shaNI .hidden sha1_process_block64_shaNI diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index 51fde082a..f0daa30f6 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -27,60 +27,68 @@ sha1_process_block64: # xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units + movl 80(%rdi), %eax # a = ctx->hash[0] + movl 84(%rdi), %ebx # b = ctx->hash[1] + movl 88(%rdi), %ecx # c = ctx->hash[2] + movl 92(%rdi), %edx # d = ctx->hash[3] + movl 96(%rdi), %ebp # e = ctx->hash[4] + movaps sha1const(%rip), %xmm7 - movaps bswap32_mask(%rip), %xmm4 pshufd $0x00, %xmm7, %xmm6 # Load W[] to xmm0..3, byteswapping on the fly. # - # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 + # For iterations 0..15, we pass W[] in rsi,r8..r14 # for use in RD1As instead of spilling them to stack. + # We lose parallelized addition of RCONST, but LEA + # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # ADDs in two first RD1As shorter by one byte). - movups 16*0(%rdi), %xmm0 - pshufb %xmm4, %xmm0 - movaps %xmm0, %xmm5 - paddd %xmm6, %xmm5 - movq %xmm5, %rsi -# pextrq $1, %xmm5, %r8 #SSE4.1 insn -# movhpd %xmm5, %r8 #can only move to mem, not to reg - shufps $0x0e, %xmm5, %xmm5 - movq %xmm5, %r8 - - movups 16*1(%rdi), %xmm1 - pshufb %xmm4, %xmm1 - movaps %xmm1, %xmm5 - paddd %xmm6, %xmm5 - movq %xmm5, %r9 - shufps $0x0e, %xmm5, %xmm5 - movq %xmm5, %r10 + # LEAs in two first RD1As shorter by one byte). + movq 4*0(%rdi), %rsi + movq 4*2(%rdi), %r8 + bswapq %rsi + bswapq %r8 + rolq $32, %rsi # rsi = W[1]:W[0] + rolq $32, %r8 # r8 = W[3]:W[2] + movq %rsi, %xmm0 + movq %r8, %xmm4 + punpcklqdq %xmm4, %xmm0 # xmm0 = r8:rsi = (W[0],W[1],W[2],W[3]) +# movaps %xmm0, %xmm4 # add RCONST, spill to stack +# paddd %xmm6, %xmm4 +# movups %xmm4, -64+16*0(%rsp) - movups 16*2(%rdi), %xmm2 - pshufb %xmm4, %xmm2 - movaps %xmm2, %xmm5 - paddd %xmm6, %xmm5 - movq %xmm5, %r11 - shufps $0x0e, %xmm5, %xmm5 - movq %xmm5, %r12 + movq 4*4(%rdi), %r9 + movq 4*6(%rdi), %r10 + bswapq %r9 + bswapq %r10 + rolq $32, %r9 # r9 = W[5]:W[4] + rolq $32, %r10 # r10 = W[7]:W[6] + movq %r9, %xmm1 + movq %r10, %xmm4 + punpcklqdq %xmm4, %xmm1 # xmm1 = r10:r9 = (W[4],W[5],W[6],W[7]) - movups 16*3(%rdi), %xmm3 - pshufb %xmm4, %xmm3 - movaps %xmm3, %xmm5 - paddd %xmm6, %xmm5 - movq %xmm5, %r13 - shufps $0x0e, %xmm5, %xmm5 - movq %xmm5, %r14 + movq 4*8(%rdi), %r11 + movq 4*10(%rdi), %r12 + bswapq %r11 + bswapq %r12 + rolq $32, %r11 # r11 = W[9]:W[8] + rolq $32, %r12 # r12 = W[11]:W[10] + movq %r11, %xmm2 + movq %r12, %xmm4 + punpcklqdq %xmm4, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) - # MOVQs to GPRs (above) have somewhat high latency. - # Load hash[] while they are completing: - movl 80(%rdi), %eax # a = ctx->hash[0] - movl 84(%rdi), %ebx # b = ctx->hash[1] - movl 88(%rdi), %ecx # c = ctx->hash[2] - movl 92(%rdi), %edx # d = ctx->hash[3] - movl 96(%rdi), %ebp # e = ctx->hash[4] + movq 4*12(%rdi), %r13 + movq 4*14(%rdi), %r14 + bswapq %r13 + bswapq %r14 + rolq $32, %r13 # r13 = W[13]:W[12] + rolq $32, %r14 # r14 = W[15]:W[14] + movq %r13, %xmm3 + movq %r14, %xmm4 + punpcklqdq %xmm4, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) # 0 - addl %esi, %ebp # e += RCONST + W[n] + leal 0x5A827999(%rbp,%rsi), %ebp # e += RCONST + W[n] shrq $32, %rsi movl %ecx, %edi # c xorl %edx, %edi # ^d @@ -92,7 +100,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 1 - addl %esi, %edx # e += RCONST + W[n] + leal 0x5A827999(%rdx,%rsi), %edx # e += RCONST + W[n] movl %ebx, %edi # c xorl %ecx, %edi # ^d andl %eax, %edi # &b @@ -103,7 +111,7 @@ sha1_process_block64: addl %edi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) # 2 - addl %r8d, %ecx # e += RCONST + W[n] + leal 0x5A827999(%rcx,%r8), %ecx # e += RCONST + W[n] shrq $32, %r8 movl %eax, %edi # c xorl %ebx, %edi # ^d @@ -115,7 +123,7 @@ sha1_process_block64: addl %edi, %ecx # e += rotl32(a,5) rorl $2, %ebp # b = rotl32(b,30) # 3 - addl %r8d, %ebx # e += RCONST + W[n] + leal 0x5A827999(%rbx,%r8), %ebx # e += RCONST + W[n] movl %ebp, %edi # c xorl %eax, %edi # ^d andl %edx, %edi # &b @@ -126,7 +134,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 4 - addl %r9d, %eax # e += RCONST + W[n] + leal 0x5A827999(%rax,%r9), %eax # e += RCONST + W[n] shrq $32, %r9 movl %edx, %edi # c xorl %ebp, %edi # ^d @@ -138,7 +146,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 5 - addl %r9d, %ebp # e += RCONST + W[n] + leal 0x5A827999(%rbp,%r9), %ebp # e += RCONST + W[n] movl %ecx, %edi # c xorl %edx, %edi # ^d andl %ebx, %edi # &b @@ -149,7 +157,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 6 - addl %r10d, %edx # e += RCONST + W[n] + leal 0x5A827999(%rdx,%r10), %edx # e += RCONST + W[n] shrq $32, %r10 movl %ebx, %edi # c xorl %ecx, %edi # ^d @@ -161,7 +169,7 @@ sha1_process_block64: addl %edi, %edx # e += rotl32(a,5) rorl $2, %eax # b = rotl32(b,30) # 7 - addl %r10d, %ecx # e += RCONST + W[n] + leal 0x5A827999(%rcx,%r10), %ecx # e += RCONST + W[n] movl %eax, %edi # c xorl %ebx, %edi # ^d andl %ebp, %edi # &b @@ -202,7 +210,7 @@ sha1_process_block64: paddd %xmm6, %xmm5 movups %xmm5, -64+16*0(%rsp) # 8 - addl %r11d, %ebx # e += RCONST + W[n] + leal 0x5A827999(%rbx,%r11), %ebx # e += RCONST + W[n] shrq $32, %r11 movl %ebp, %edi # c xorl %eax, %edi # ^d @@ -214,7 +222,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 9 - addl %r11d, %eax # e += RCONST + W[n] + leal 0x5A827999(%rax,%r11), %eax # e += RCONST + W[n] movl %edx, %edi # c xorl %ebp, %edi # ^d andl %ecx, %edi # &b @@ -225,7 +233,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 10 - addl %r12d, %ebp # e += RCONST + W[n] + leal 0x5A827999(%rbp,%r12), %ebp # e += RCONST + W[n] shrq $32, %r12 movl %ecx, %edi # c xorl %edx, %edi # ^d @@ -237,7 +245,7 @@ sha1_process_block64: addl %edi, %ebp # e += rotl32(a,5) rorl $2, %ebx # b = rotl32(b,30) # 11 - addl %r12d, %edx # e += RCONST + W[n] + leal 0x5A827999(%rdx,%r12), %edx # e += RCONST + W[n] movl %ebx, %edi # c xorl %ecx, %edi # ^d andl %eax, %edi # &b @@ -279,7 +287,7 @@ sha1_process_block64: paddd %xmm6, %xmm5 movups %xmm5, -64+16*1(%rsp) # 12 - addl %r13d, %ecx # e += RCONST + W[n] + leal 0x5A827999(%rcx,%r13), %ecx # e += RCONST + W[n] shrq $32, %r13 movl %eax, %edi # c xorl %ebx, %edi # ^d @@ -291,7 +299,7 @@ sha1_process_block64: addl %edi, %ecx # e += rotl32(a,5) rorl $2, %ebp # b = rotl32(b,30) # 13 - addl %r13d, %ebx # e += RCONST + W[n] + leal 0x5A827999(%rbx,%r13), %ebx # e += RCONST + W[n] movl %ebp, %edi # c xorl %eax, %edi # ^d andl %edx, %edi # &b @@ -302,7 +310,7 @@ sha1_process_block64: addl %edi, %ebx # e += rotl32(a,5) rorl $2, %edx # b = rotl32(b,30) # 14 - addl %r14d, %eax # e += RCONST + W[n] + leal 0x5A827999(%rax,%r14), %eax # e += RCONST + W[n] shrq $32, %r14 movl %edx, %edi # c xorl %ebp, %edi # ^d @@ -314,7 +322,7 @@ sha1_process_block64: addl %edi, %eax # e += rotl32(a,5) rorl $2, %ecx # b = rotl32(b,30) # 15 - addl %r14d, %ebp # e += RCONST + W[n] + leal 0x5A827999(%rbp,%r14), %ebp # e += RCONST + W[n] movl %ecx, %edi # c xorl %edx, %edi # ^d andl %ebx, %edi # &b @@ -1467,11 +1475,6 @@ sha1_process_block64: ret .size sha1_process_block64, .-sha1_process_block64 - .section .rodata.cst16.bswap32_mask, "aM", @progbits, 16 - .balign 16 -bswap32_mask: - .octa 0x0c0d0e0f08090a0b0405060700010203 - .section .rodata.cst16.sha1const, "aM", @progbits, 16 .balign 16 sha1const: diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index f34e6e6fa..57e77b118 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -99,6 +99,30 @@ INTERLEAVE() { ) } +# movaps bswap32_mask(%rip), $xmmT1 +# Load W[] to xmm0..3, byteswapping on the fly. +# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 +# for use in RD1As instead of spilling them to stack. +# (We use rsi instead of rN because this makes two +# ADDs in two first RD1As shorter by one byte). +# movups 16*0(%rdi), %xmm0 +# pshufb $xmmT1, %xmm0 #SSSE3 insn +# movaps %xmm0, $xmmT2 +# paddd $xmmRCONST, $xmmT2 +# movq $xmmT2, %rsi +# #pextrq \$1, $xmmT2, %r8 #SSE4.1 insn +# #movhpd $xmmT2, %r8 #can only move to mem, not to reg +# shufps \$0x0e, $xmmT2, $xmmT2 # have to use two-insn sequence +# movq $xmmT2, %r8 # instead +# ... +# +# ... +#- leal $RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n] +#+ addl %esi, %e$e # e += RCONST + W[n] +# ^^^^^^^^^^^^^^^^^^^^^^^^ +# The above is -97 bytes of code... +# ...but pshufb is a SSSE3 insn. Can't use it. + echo \ "### Generated by hash_md5_sha_x86-64.S.sh ### @@ -129,57 +153,65 @@ sha1_process_block64: # xmm7: all round constants # -64(%rsp): area for passing RCONST + W[] from vector to integer units + movl 80(%rdi), %eax # a = ctx->hash[0] + movl 84(%rdi), %ebx # b = ctx->hash[1] + movl 88(%rdi), %ecx # c = ctx->hash[2] + movl 92(%rdi), %edx # d = ctx->hash[3] + movl 96(%rdi), %ebp # e = ctx->hash[4] + movaps sha1const(%rip), $xmmALLRCONST - movaps bswap32_mask(%rip), $xmmT1 pshufd \$0x00, $xmmALLRCONST, $xmmRCONST # Load W[] to xmm0..3, byteswapping on the fly. # - # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 + # For iterations 0..15, we pass W[] in rsi,r8..r14 # for use in RD1As instead of spilling them to stack. + # We lose parallelized addition of RCONST, but LEA + # can do two additions at once, so it is probably a wash. # (We use rsi instead of rN because this makes two - # ADDs in two first RD1As shorter by one byte). - movups 16*0(%rdi), %xmm0 - pshufb $xmmT1, %xmm0 - movaps %xmm0, $xmmT2 - paddd $xmmRCONST, $xmmT2 - movq $xmmT2, %rsi -# pextrq \$1, $xmmT2, %r8 #SSE4.1 insn -# movhpd $xmmT2, %r8 #can only move to mem, not to reg - shufps \$0x0e, $xmmT2, $xmmT2 - movq $xmmT2, %r8 - - movups 16*1(%rdi), %xmm1 - pshufb $xmmT1, %xmm1 - movaps %xmm1, $xmmT2 - paddd $xmmRCONST, $xmmT2 - movq $xmmT2, %r9 - shufps \$0x0e, $xmmT2, $xmmT2 - movq $xmmT2, %r10 - - movups 16*2(%rdi), %xmm2 - pshufb $xmmT1, %xmm2 - movaps %xmm2, $xmmT2 - paddd $xmmRCONST, $xmmT2 - movq $xmmT2, %r11 - shufps \$0x0e, $xmmT2, $xmmT2 - movq $xmmT2, %r12 - - movups 16*3(%rdi), %xmm3 - pshufb $xmmT1, %xmm3 - movaps %xmm3, $xmmT2 - paddd $xmmRCONST, $xmmT2 - movq $xmmT2, %r13 - shufps \$0x0e, $xmmT2, $xmmT2 - movq $xmmT2, %r14 - - # MOVQs to GPRs (above) have somewhat high latency. - # Load hash[] while they are completing: - movl 80(%rdi), %eax # a = ctx->hash[0] - movl 84(%rdi), %ebx # b = ctx->hash[1] - movl 88(%rdi), %ecx # c = ctx->hash[2] - movl 92(%rdi), %edx # d = ctx->hash[3] - movl 96(%rdi), %ebp # e = ctx->hash[4] + # LEAs in two first RD1As shorter by one byte). + movq 4*0(%rdi), %rsi + movq 4*2(%rdi), %r8 + bswapq %rsi + bswapq %r8 + rolq \$32, %rsi # rsi = W[1]:W[0] + rolq \$32, %r8 # r8 = W[3]:W[2] + movq %rsi, %xmm0 + movq %r8, $xmmT1 + punpcklqdq $xmmT1, %xmm0 # xmm0 = r8:rsi = (W[0],W[1],W[2],W[3]) +# movaps %xmm0, $xmmT1 # add RCONST, spill to stack +# paddd $xmmRCONST, $xmmT1 +# movups $xmmT1, -64+16*0(%rsp) + + movq 4*4(%rdi), %r9 + movq 4*6(%rdi), %r10 + bswapq %r9 + bswapq %r10 + rolq \$32, %r9 # r9 = W[5]:W[4] + rolq \$32, %r10 # r10 = W[7]:W[6] + movq %r9, %xmm1 + movq %r10, $xmmT1 + punpcklqdq $xmmT1, %xmm1 # xmm1 = r10:r9 = (W[4],W[5],W[6],W[7]) + + movq 4*8(%rdi), %r11 + movq 4*10(%rdi), %r12 + bswapq %r11 + bswapq %r12 + rolq \$32, %r11 # r11 = W[9]:W[8] + rolq \$32, %r12 # r12 = W[11]:W[10] + movq %r11, %xmm2 + movq %r12, $xmmT1 + punpcklqdq $xmmT1, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) + + movq 4*12(%rdi), %r13 + movq 4*14(%rdi), %r14 + bswapq %r13 + bswapq %r14 + rolq \$32, %r13 # r13 = W[13]:W[12] + rolq \$32, %r14 # r14 = W[15]:W[14] + movq %r13, %xmm3 + movq %r14, $xmmT1 + punpcklqdq $xmmT1, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) " PREP() { @@ -258,15 +290,15 @@ local rN=$((7+n0/2)) echo " # $n ";test $n0 = 0 && echo " - addl %esi, %e$e # e += RCONST + W[n] + leal $RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n] shrq \$32, %rsi ";test $n0 = 1 && echo " - addl %esi, %e$e # e += RCONST + W[n] + leal $RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n] ";test $n0 -ge 2 && test $((n0 & 1)) = 0 && echo " - addl %r${rN}d, %e$e # e += RCONST + W[n] + leal $RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n] shrq \$32, %r$rN ";test $n0 -ge 2 && test $((n0 & 1)) = 1 && echo " - addl %r${rN}d, %e$e # e += RCONST + W[n] + leal $RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n] ";echo " movl %e$c, %edi # c xorl %e$d, %edi # ^d @@ -432,11 +464,6 @@ echo " ret .size sha1_process_block64, .-sha1_process_block64 - .section .rodata.cst16.bswap32_mask, \"aM\", @progbits, 16 - .balign 16 -bswap32_mask: - .octa 0x0c0d0e0f08090a0b0405060700010203 - .section .rodata.cst16.sha1const, \"aM\", @progbits, 16 .balign 16 sha1const: diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S index c13cdec07..794e97040 100644 --- a/libbb/hash_md5_sha_x86-64_shaNI.S +++ b/libbb/hash_md5_sha_x86-64_shaNI.S @@ -20,6 +20,11 @@ #define extr128_32 pextrd //#define extr128_32 extractps # not shorter +// pshufb is a SSSE3 insn. +// pinsrd, pextrd, extractps are SSE4.1 insns. +// We do not check SSSE3/SSE4.1 in cpuid, +// all SHA-capable CPUs support them as well. + .section .text.sha1_process_block64_shaNI, "ax", @progbits .globl sha1_process_block64_shaNI .hidden sha1_process_block64_shaNI From vda.linux at googlemail.com Fri Feb 11 22:03:27 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Fri, 11 Feb 2022 23:03:27 +0100 Subject: [git commit] whitespace fixes Message-ID: <20220211215609.0CC91831C4@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=1f272c06d02e7c7f0f3af1f97165722255c8828d branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha_x86-64.S | 8 ++++---- libbb/hash_md5_sha_x86-64.S.sh | 14 +++++++------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S index f0daa30f6..1d55b91f8 100644 --- a/libbb/hash_md5_sha_x86-64.S +++ b/libbb/hash_md5_sha_x86-64.S @@ -71,8 +71,8 @@ sha1_process_block64: movq 4*10(%rdi), %r12 bswapq %r11 bswapq %r12 - rolq $32, %r11 # r11 = W[9]:W[8] - rolq $32, %r12 # r12 = W[11]:W[10] + rolq $32, %r11 # r11 = W[9]:W[8] + rolq $32, %r12 # r12 = W[11]:W[10] movq %r11, %xmm2 movq %r12, %xmm4 punpcklqdq %xmm4, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) @@ -81,8 +81,8 @@ sha1_process_block64: movq 4*14(%rdi), %r14 bswapq %r13 bswapq %r14 - rolq $32, %r13 # r13 = W[13]:W[12] - rolq $32, %r14 # r14 = W[15]:W[14] + rolq $32, %r13 # r13 = W[13]:W[12] + rolq $32, %r14 # r14 = W[15]:W[14] movq %r13, %xmm3 movq %r14, %xmm4 punpcklqdq %xmm4, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh index 57e77b118..40c979d35 100755 --- a/libbb/hash_md5_sha_x86-64.S.sh +++ b/libbb/hash_md5_sha_x86-64.S.sh @@ -99,7 +99,7 @@ INTERLEAVE() { ) } -# movaps bswap32_mask(%rip), $xmmT1 +# movaps bswap32_mask(%rip), $xmmT1 # Load W[] to xmm0..3, byteswapping on the fly. # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14 # for use in RD1As instead of spilling them to stack. @@ -110,8 +110,8 @@ INTERLEAVE() { # movaps %xmm0, $xmmT2 # paddd $xmmRCONST, $xmmT2 # movq $xmmT2, %rsi -# #pextrq \$1, $xmmT2, %r8 #SSE4.1 insn -# #movhpd $xmmT2, %r8 #can only move to mem, not to reg +# #pextrq \$1, $xmmT2, %r8 #SSE4.1 insn +# #movhpd $xmmT2, %r8 #can only move to mem, not to reg # shufps \$0x0e, $xmmT2, $xmmT2 # have to use two-insn sequence # movq $xmmT2, %r8 # instead # ... @@ -197,8 +197,8 @@ sha1_process_block64: movq 4*10(%rdi), %r12 bswapq %r11 bswapq %r12 - rolq \$32, %r11 # r11 = W[9]:W[8] - rolq \$32, %r12 # r12 = W[11]:W[10] + rolq \$32, %r11 # r11 = W[9]:W[8] + rolq \$32, %r12 # r12 = W[11]:W[10] movq %r11, %xmm2 movq %r12, $xmmT1 punpcklqdq $xmmT1, %xmm2 # xmm2 = r12:r11 = (W[8],W[9],W[10],W[11]) @@ -207,8 +207,8 @@ sha1_process_block64: movq 4*14(%rdi), %r14 bswapq %r13 bswapq %r14 - rolq \$32, %r13 # r13 = W[13]:W[12] - rolq \$32, %r14 # r14 = W[15]:W[14] + rolq \$32, %r13 # r13 = W[13]:W[12] + rolq \$32, %r14 # r14 = W[15]:W[14] movq %r13, %xmm3 movq %r14, $xmmT1 punpcklqdq $xmmT1, %xmm3 # xmm3 = r14:r13 = (W[12],W[13],W[14],W[15]) From bugzilla at busybox.net Fri Feb 11 22:39:50 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Fri, 11 Feb 2022 22:39:50 +0000 Subject: [Bug 14586] lsof missing from command description page In-Reply-To: References: Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14586 Mike Frysinger changed: What |Removed |Added ---------------------------------------------------------------------------- CC|vapier at gentoo.org |busybox-cvs at busybox.net Assignee|unassigned at buildroot.uclibc |unassigned at busybox.net |.org | Component|Website |Website Product|Infrastructure |Busybox -- You are receiving this mail because: You are on the CC list for the bug. From vda.linux at googlemail.com Fri Feb 11 23:52:12 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Sat, 12 Feb 2022 00:52:12 +0100 Subject: [git commit] libbb/sha256: explicitly use sha256rnds2's %xmm0 (MSG) argument Message-ID: <20220211234704.AFEFE82DB5@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=c2e7780e526b0f421c3b43367a53019d1dc5f2d6 branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Else, the code seemingly does not use MSG. Signed-off-by: Denys Vlasenko --- libbb/hash_md5_sha256_x86-32_shaNI.S | 64 +++++++++++++++--------------- libbb/hash_md5_sha256_x86-64_shaNI.S | 76 ++++++++++++++++++------------------ 2 files changed, 70 insertions(+), 70 deletions(-) diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S index c059fb18d..3905bad9a 100644 --- a/libbb/hash_md5_sha256_x86-32_shaNI.S +++ b/libbb/hash_md5_sha256_x86-32_shaNI.S @@ -60,18 +60,18 @@ sha256_process_block64_shaNI: pshufb XMMTMP, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG pshufb XMMTMP, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 8-11 */ @@ -79,9 +79,9 @@ sha256_process_block64_shaNI: pshufb XMMTMP, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 12-15 */ @@ -90,151 +90,151 @@ sha256_process_block64_shaNI: /* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 16-19 */ mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 20-23 */ mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 24-27 */ mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 28-31 */ mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 32-35 */ mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 36-39 */ mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 40-43 */ mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 44-47 */ mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 48-51 */ mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 52-55 */ mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 56-59 */ mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 60-63 */ mova128 MSGTMP3, MSG paddd 15*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Write hash values back in the correct order */ mova128 STATE0, XMMTMP diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S index 9578441f8..082ceafe4 100644 --- a/libbb/hash_md5_sha256_x86-64_shaNI.S +++ b/libbb/hash_md5_sha256_x86-64_shaNI.S @@ -38,8 +38,8 @@ #define XMMTMP %xmm7 -#define ABEF_SAVE %xmm9 -#define CDGH_SAVE %xmm10 +#define SAVE0 %xmm8 +#define SAVE1 %xmm9 #define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6)) @@ -59,26 +59,26 @@ sha256_process_block64_shaNI: leaq K256+8*16(%rip), SHA256CONSTANTS /* Save hash values for addition after rounds */ - mova128 STATE0, ABEF_SAVE - mova128 STATE1, CDGH_SAVE + mova128 STATE0, SAVE0 + mova128 STATE1, SAVE1 /* Rounds 0-3 */ movu128 0*16(DATA_PTR), MSG pshufb XMMTMP, MSG mova128 MSG, MSGTMP0 paddd 0*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 4-7 */ movu128 1*16(DATA_PTR), MSG pshufb XMMTMP, MSG mova128 MSG, MSGTMP1 paddd 1*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 8-11 */ @@ -86,9 +86,9 @@ sha256_process_block64_shaNI: pshufb XMMTMP, MSG mova128 MSG, MSGTMP2 paddd 2*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 12-15 */ @@ -97,155 +97,155 @@ sha256_process_block64_shaNI: /* ...to here */ mova128 MSG, MSGTMP3 paddd 3*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 16-19 */ mova128 MSGTMP0, MSG paddd 4*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 20-23 */ mova128 MSGTMP1, MSG paddd 5*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 24-27 */ mova128 MSGTMP2, MSG paddd 6*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 28-31 */ mova128 MSGTMP3, MSG paddd 7*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 32-35 */ mova128 MSGTMP0, MSG paddd 8*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 36-39 */ mova128 MSGTMP1, MSG paddd 9*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP1, MSGTMP0 /* Rounds 40-43 */ mova128 MSGTMP2, MSG paddd 10*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP2, MSGTMP1 /* Rounds 44-47 */ mova128 MSGTMP3, MSG paddd 11*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP3, XMMTMP palignr $4, MSGTMP2, XMMTMP paddd XMMTMP, MSGTMP0 sha256msg2 MSGTMP3, MSGTMP0 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP3, MSGTMP2 /* Rounds 48-51 */ mova128 MSGTMP0, MSG paddd 12*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP0, XMMTMP palignr $4, MSGTMP3, XMMTMP paddd XMMTMP, MSGTMP1 sha256msg2 MSGTMP0, MSGTMP1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 sha256msg1 MSGTMP0, MSGTMP3 /* Rounds 52-55 */ mova128 MSGTMP1, MSG paddd 13*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP1, XMMTMP palignr $4, MSGTMP0, XMMTMP paddd XMMTMP, MSGTMP2 sha256msg2 MSGTMP1, MSGTMP2 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 56-59 */ mova128 MSGTMP2, MSG paddd 14*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 mova128 MSGTMP2, XMMTMP palignr $4, MSGTMP1, XMMTMP paddd XMMTMP, MSGTMP3 sha256msg2 MSGTMP2, MSGTMP3 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Rounds 60-63 */ mova128 MSGTMP3, MSG paddd 15*16-8*16(SHA256CONSTANTS), MSG - sha256rnds2 STATE0, STATE1 + sha256rnds2 MSG, STATE0, STATE1 shuf128_32 $0x0E, MSG, MSG - sha256rnds2 STATE1, STATE0 + sha256rnds2 MSG, STATE1, STATE0 /* Add current hash values with previously saved */ - paddd ABEF_SAVE, STATE0 - paddd CDGH_SAVE, STATE1 + paddd SAVE0, STATE0 + paddd SAVE1, STATE1 /* Write hash values back in the correct order */ mova128 STATE0, XMMTMP From bugzilla at busybox.net Sun Feb 13 20:22:13 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Sun, 13 Feb 2022 20:22:13 +0000 Subject: [Bug 14591] New: Online Manufacturing Service Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14591 Bug ID: 14591 Summary: Online Manufacturing Service Product: Busybox Version: 1.33.x Hardware: All OS: Linux Status: NEW Severity: normal Priority: P5 Component: Standard Compliance Assignee: unassigned at busybox.net Reporter: gavin77902 at balaket.com CC: busybox-cvs at busybox.net Target Milestone: --- 3DSculpLab offers professional online manufacturing services for prototypes, one-off goods, and short-run production using a variety of manufacturing processes, all through a single online platform. https://www.3dsculplab.xyz/ -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Sun Feb 13 20:22:52 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Sun, 13 Feb 2022 20:22:52 +0000 Subject: [Bug 14591] Online Manufacturing Service In-Reply-To: References: Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14591 3D Printing changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |documentation URL| |https://www.3dsculplab.xyz/ -- You are receiving this mail because: You are on the CC list for the bug. From vda.linux at googlemail.com Fri Feb 18 16:09:51 2022 From: vda.linux at googlemail.com (Denys Vlasenko) Date: Fri, 18 Feb 2022 17:09:51 +0100 Subject: [git commit] libbb/sha1: update config help text with new performance numbers Message-ID: <20220218161429.8F9DB813EE@busybox.osuosl.org> commit: https://git.busybox.net/busybox/commit/?id=1891fdda59092a215d3a407d9108bbbe6ab8df7a branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master Signed-off-by: Denys Vlasenko --- libbb/Config.src | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/libbb/Config.src b/libbb/Config.src index 0ecd5bd46..66a3ffa23 100644 --- a/libbb/Config.src +++ b/libbb/Config.src @@ -57,11 +57,12 @@ config SHA1_SMALL range 0 3 help Trade binary size versus speed for the sha1 algorithm. + With FEATURE_COPYBUF_KB=64: throughput MB/s size of sha1_process_block64 value 486 x86-64 486 x86-64 - 0 367 375 3657 3502 - 1 224 229 654 732 - 2,3 200 195 358 380 + 0 440 485 3481 3502 + 1 265 265 641 696 + 2,3 220 210 342 364 config SHA1_HWACCEL bool "SHA1: Use hardware accelerated instructions if possible" From bugzilla at busybox.net Sun Feb 20 04:57:25 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Sun, 20 Feb 2022 04:57:25 +0000 Subject: [Bug 14576] unzip: test skipped with bad archive In-Reply-To: References: Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14576 --- Comment #1 from dharan --- Hi Team, Can you please share an update on the requested SKIPPED test case? SKIPPED: unzip (bad archive) Regards, -Dharan -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Wed Feb 23 16:19:45 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Wed, 23 Feb 2022 16:19:45 +0000 Subject: [Bug 11736] KCONFIG_ALLCONFIG does not apply passed config (regression in 0b1c62934) In-Reply-To: References: Message-ID: https://bugs.busybox.net/show_bug.cgi?id=11736 --- Comment #1 from Axel Fontaine --- This issue is still present in the latest release. Is there any workaround? -- You are receiving this mail because: You are on the CC list for the bug. From bugzilla at busybox.net Mon Feb 28 19:15:08 2022 From: bugzilla at busybox.net (bugzilla at busybox.net) Date: Mon, 28 Feb 2022 19:15:08 +0000 Subject: [Bug 14616] New: Printf format code and data type do not match in taskset Message-ID: https://bugs.busybox.net/show_bug.cgi?id=14616 Bug ID: 14616 Summary: Printf format code and data type do not match in taskset Product: Busybox Version: 1.33.x Hardware: Other OS: Linux Status: NEW Severity: normal Priority: P5 Component: Other Assignee: unassigned at busybox.net Reporter: pdvb at yahoo.com CC: busybox-cvs at busybox.net Target Milestone: --- The following code uses an (unsigned long) "%lx" format code, but passes an (unsigned long long) value to printf. The result is that on architectures which use 32-bit for (unsigned long) and 64-bit for (unsigned long long) the printf produces incorrect output. #define TASKSET_PRINTF_MASK "%lx" static unsigned long long from_mask(ul *mask, unsigned sz_in_bytes UNUSED_PARAM) { return *mask; } This was broken by commit ef0e76cc on 1/29/2017 The quick fix is to define the function as: static unsigned long from_mask() -- You are receiving this mail because: You are on the CC list for the bug.