From bugzilla at busybox.net  Tue Feb  1 18:48:38 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Tue, 01 Feb 2022 18:48:38 +0000
Subject: [Bug 14041] New: using modprobe/insmod with compressed modules gives
 scary kernel warnings
Message-ID: <bug-14041-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14041

            Bug ID: 14041
           Summary: using modprobe/insmod with compressed modules gives
                    scary kernel warnings
           Product: Busybox
           Version: 1.33.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Other
          Assignee: unassigned at busybox.net
          Reporter: nolange79 at gmail.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

I am using kernel 5.4 on x86_64 fro an embedded system.

loading a compressed module will give the kernel log error:
kernel: Module has invalid ELF structures

steps to reproduce are simply:
# busybox insmod nbd.ko.gz

Some points:

this happens with gzip and xz

the util-linux insmod/modprobe work without log entry

the module seems correctly loaded and works.

decompressing the module (with the same busybox executable) and then loading it
will lead to no log entry

# (after unloading the module again)
# busybox gzip -d nbd.ko.gz
# busybox insmod nbd.ko


I dont know if there is any functional issue, but I am tempted to raise
severity since I cant rule it out either.

--- Comment #1 from sylvain.prat at gmail.com ---
I also fell into the problem. I was wondering how alpine linux worked out with
compressed linux modules and I finally found out that they don't use busybox's
modprobe anymore... Since the decompression methods already exist in busybox,
it shouldn't be too hard to implement I guess, but I'm not competent enough to
do it myself.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Tue Feb  1 20:16:08 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Tue, 01 Feb 2022 20:16:08 +0000
Subject: [Bug 14541] sed: s-command with "semi-special" delimiters get wrong
 behaviour
In-Reply-To: <bug-14541-161@https.bugs.busybox.net/>
References: <bug-14541-161@https.bugs.busybox.net/>
Message-ID: <bug-14541-161-TVip4QfqQQ@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14541

Christoph Anton Mitterer <calestyo at scientia.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|REOPENED                    |RESOLVED

--- Comment #4 from Christoph Anton Mitterer <calestyo at scientia.org> ---
I hadn't seen the 2nd commit, f12fb1e4092900f26f7f8c71cde44b1cd7d26439, when
testing.

That also fixes the case from comment #3.

Now, BusyBox sed seems to behave identically to GNU sed in all the cases I had
given in:
https://www.austingroupbugs.net/view.php?id=1551#c5612

Especially, it also seems to consider "un-delimitered" delimiters that are also
special characters as "still special" (or at least I tried that with '.') -
which is, while IMO not clearly defined by POSIX, identical to the behaviour of
GNU sed, see https://www.austingroupbugs.net/view.php?id=1551#c5648 for test
cases.)


Thus closing again.

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Wed Feb  2 05:11:03 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Wed, 02 Feb 2022 05:11:03 +0000
Subject: [Bug 14566] New: ifupdown: Document supported stanzas for interfaces
 file
Message-ID: <bug-14566-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14566

            Bug ID: 14566
           Summary: ifupdown: Document supported stanzas for interfaces
                    file
           Product: Busybox
           Version: 1.33.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Networking
          Assignee: unassigned at busybox.net
          Reporter: michael at cassaniti.id.au
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

Hi,
First, thank you so much for Busybox. It makes my life very easy I must say.
I'm using Busybox 1.33.1 under Alpine Linux 3.14. The current configuration
should be at this URL:
https://git.alpinelinux.org/aports/tree/main/busybox/busyboxconfig?id=1aa6700d1e4ef810f2319506e48a8b5316d17abe

I've read the man page for interfaces from these URLs and they don't all agree
on the supported stanzas:

  -
https://salsa.debian.org/debian/ifupdown/-/raw/19052e2ecb0a908428813b5bc25d5bd0283c5a18/interfaces.5.pre
  - https://manpages.org/etc-network-interfaces/5
  - https://www.systutorials.com/docs/linux/man/5-interfaces/

I'm likely not the only one to encounter confusion about what stanzas Busybox
does and does not support. I did read the source code and do my best to
determine what is supported. This documentation would cover what is
__natively__ supported since using additional scripts essentially allows
extending the syntax.

I found the following directives so far are not supported and I assume I've
missed some:

  - rename
  - inherits
  - allow- stanzas (e.g.: allow-hotplug)
  - no-auto-down
  - no-scripts
  - description
  - template
  - source-dir is supported but not source-directory

Please note I'm not requesting any of these stanzas be supported as part of
this bug. It would personally be nice to have rename supported, but I can
understand why others mentioned above are not included.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From vda.linux at googlemail.com  Thu Feb  3 13:58:02 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Thu, 3 Feb 2022 14:58:02 +0100
Subject: [git commit] libbb/sha256: optional x86 hardware accelerated hashing
Message-ID: <20220203135206.237C982911@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=6472ac942898437e040171cec991de1c0b962f72
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

64 bit:
function                                             old     new   delta
sha256_process_block64_shaNI                           -     730    +730
.rodata                                           108314  108586    +272
sha256_begin                                          31      83     +52
------------------------------------------------------------------------------
(add/remove: 5/1 grow/shrink: 2/0 up/down: 1055/-1)          Total: 1054 bytes

32 bit:
function                                             old     new   delta
sha256_process_block64_shaNI                           -     747    +747
.rodata                                           104318  104590    +272
sha256_begin                                          29      84     +55
------------------------------------------------------------------------------
(add/remove: 5/1 grow/shrink: 2/0 up/down: 1075/-1)          Total: 1074 bytes

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/Config.src                     |   6 +
 libbb/Kbuild.src                     |   2 +
 libbb/hash_md5_sha.c                 |  54 ++++---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 283 +++++++++++++++++++++++++++++++++++
 libbb/hash_md5_sha256_x86-64_shaNI.S | 281 ++++++++++++++++++++++++++++++++++
 libbb/hash_md5_sha_x86-32_shaNI.S    |   4 +-
 libbb/hash_md5_sha_x86-64.S          |   2 +-
 libbb/hash_md5_sha_x86-64.S.sh       |   2 +-
 libbb/hash_md5_sha_x86-64_shaNI.S    |   4 +-
 9 files changed, 612 insertions(+), 26 deletions(-)
diff --git a/libbb/Config.src b/libbb/Config.src
index 708d3b0c8..0ecd5bd46 100644
--- a/libbb/Config.src
+++ b/libbb/Config.src
@@ -70,6 +70,12 @@ config SHA1_HWACCEL
 	On x86, this adds ~590 bytes of code. Throughput
 	is about twice as fast as fully-unrolled generic code.
 
+config SHA256_HWACCEL
+	bool "SHA256: Use hardware accelerated instructions if possible"
+	default y
+	help
+	On x86, this adds ~1k bytes of code.
+
 config SHA3_SMALL
 	int "SHA3: Trade bytes for speed (0:fast, 1:slow)"
 	default 1  # all "fast or small" options default to small
diff --git a/libbb/Kbuild.src b/libbb/Kbuild.src
index b9d34de8e..653025e56 100644
--- a/libbb/Kbuild.src
+++ b/libbb/Kbuild.src
@@ -59,6 +59,8 @@ lib-y += hash_md5_sha.o
 lib-y += hash_md5_sha_x86-64.o
 lib-y += hash_md5_sha_x86-64_shaNI.o
 lib-y += hash_md5_sha_x86-32_shaNI.o
+lib-y += hash_md5_sha256_x86-64_shaNI.o
+lib-y += hash_md5_sha256_x86-32_shaNI.o
 # Alternative (disabled) MD5 implementation
 #lib-y += hash_md5prime.o
 lib-y += messages.o
diff --git a/libbb/hash_md5_sha.c b/libbb/hash_md5_sha.c
index a23db5152..880ffab01 100644
--- a/libbb/hash_md5_sha.c
+++ b/libbb/hash_md5_sha.c
@@ -13,6 +13,27 @@
 
 #define NEED_SHA512 (ENABLE_SHA512SUM || ENABLE_USE_BB_CRYPT_SHA)
 
+#if ENABLE_SHA1_HWACCEL || ENABLE_SHA256_HWACCEL
+# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+static void cpuid(unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx)
+{
+	asm ("cpuid"
+		: "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx)
+		: "0"(*eax),  "1"(*ebx),  "2"(*ecx),  "3"(*edx)
+	);
+}
+static smallint shaNI;
+void FAST_FUNC sha1_process_block64_shaNI(sha1_ctx_t *ctx);
+void FAST_FUNC sha256_process_block64_shaNI(sha256_ctx_t *ctx);
+#  if defined(__i386__)
+struct ASM_expects_76_shaNI { char t[1 - 2*(offsetof(sha256_ctx_t, hash) != 76)]; };
+#  endif
+#  if defined(__x86_64__)
+struct ASM_expects_80_shaNI { char t[1 - 2*(offsetof(sha256_ctx_t, hash) != 80)]; };
+#  endif
+# endif
+#endif
+
 /* gcc 4.2.1 optimizes rotr64 better with inline than with macro
  * (for rotX32, there is no difference). Why? My guess is that
  * macro requires clever common subexpression elimination heuristics
@@ -1142,25 +1163,6 @@ static void FAST_FUNC sha512_process_block128(sha512_ctx_t *ctx)
 }
 #endif /* NEED_SHA512 */
 
-#if ENABLE_SHA1_HWACCEL
-# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
-static void cpuid(unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx)
-{
-	asm ("cpuid"
-		: "=a"(*eax), "=b"(*ebx), "=c"(*ecx), "=d"(*edx)
-		: "0"(*eax),  "1"(*ebx),  "2"(*ecx),  "3"(*edx)
-	);
-}
-void FAST_FUNC sha1_process_block64_shaNI(sha1_ctx_t *ctx);
-#  if defined(__i386__)
-struct ASM_expects_76_shaNI { char t[1 - 2*(offsetof(sha1_ctx_t, hash) != 76)]; };
-#  endif
-#  if defined(__x86_64__)
-struct ASM_expects_80_shaNI { char t[1 - 2*(offsetof(sha1_ctx_t, hash) != 80)]; };
-#  endif
-# endif
-#endif
-
 void FAST_FUNC sha1_begin(sha1_ctx_t *ctx)
 {
 	ctx->hash[0] = 0x67452301;
@@ -1173,7 +1175,6 @@ void FAST_FUNC sha1_begin(sha1_ctx_t *ctx)
 #if ENABLE_SHA1_HWACCEL
 # if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
 	{
-		static smallint shaNI;
 		if (!shaNI) {
 			unsigned eax = 7, ebx = ebx, ecx = 0, edx = edx;
 			cpuid(&eax, &ebx, &ecx, &edx);
@@ -1225,6 +1226,19 @@ void FAST_FUNC sha256_begin(sha256_ctx_t *ctx)
 	memcpy(&ctx->total64, init256, sizeof(init256));
 	/*ctx->total64 = 0; - done by prepending two 32-bit zeros to init256 */
 	ctx->process_block = sha256_process_block64;
+#if ENABLE_SHA256_HWACCEL
+# if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
+	{
+		if (!shaNI) {
+			unsigned eax = 7, ebx = ebx, ecx = 0, edx = edx;
+			cpuid(&eax, &ebx, &ecx, &edx);
+			shaNI = ((ebx >> 29) << 1) - 1;
+		}
+		if (shaNI > 0)
+			ctx->process_block = sha256_process_block64_shaNI;
+	}
+# endif
+#endif
 }
 
 #if NEED_SHA512
diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
new file mode 100644
index 000000000..56e37fa38
--- /dev/null
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -0,0 +1,283 @@
+#if ENABLE_SHA256_HWACCEL && defined(__GNUC__) && defined(__i386__)
+/* The code is adapted from Linux kernel's source */
+
+// We use shorter insns, even though they are for "wrong"
+// data type (fp, not int).
+// For Intel, there is no penalty for doing it at all
+// (CPUs which do have such penalty do not support SHA1 insns).
+// For AMD, the penalty is one extra cycle
+// (allegedly: I failed to find measurable difference).
+
+//#define mova128 movdqa
+#define mova128 movaps
+//#define movu128 movdqu
+#define movu128 movups
+//#define shuf128_32 pshufd
+#define shuf128_32 shufps
+
+	.section	.text.sha256_process_block64_shaNI, "ax", @progbits
+	.globl	sha256_process_block64_shaNI
+	.hidden	sha256_process_block64_shaNI
+	.type	sha256_process_block64_shaNI, @function
+
+#define DATA_PTR	%eax
+
+#define SHA256CONSTANTS	%ecx
+
+#define MSG		%xmm0
+#define STATE0		%xmm1
+#define STATE1		%xmm2
+#define MSGTMP0		%xmm3
+#define MSGTMP1		%xmm4
+#define MSGTMP2		%xmm5
+#define MSGTMP3		%xmm6
+#define MSGTMP4		%xmm7
+
+	.balign	8	# allow decoders to fetch at least 3 first insns
+sha256_process_block64_shaNI:
+	pushl		%ebp
+	movl		%esp, %ebp
+	subl		$32, %esp
+	andl		$~0xF, %esp	# paddd needs aligned memory operand
+
+	movu128		76+0*16(%eax), STATE0
+	movu128		76+1*16(%eax), STATE1
+
+	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
+	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
+	mova128		STATE0, MSGTMP4
+	palignr		$8, STATE1,  STATE0		/* ABEF */
+	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
+
+#	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, SHUF_MASK
+	lea		K256, SHA256CONSTANTS
+
+	/* Save hash values for addition after rounds */
+	mova128		STATE0, 0*16(%esp)
+	mova128		STATE1, 1*16(%esp)
+
+	/* Rounds 0-3 */
+	movu128		0*16(DATA_PTR), MSG
+	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	mova128		MSG, MSGTMP0
+		paddd		0*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 4-7 */
+	movu128		1*16(DATA_PTR), MSG
+	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	mova128		MSG, MSGTMP1
+		paddd		1*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 8-11 */
+	movu128		2*16(DATA_PTR), MSG
+	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	mova128		MSG, MSGTMP2
+		paddd		2*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 12-15 */
+	movu128		3*16(DATA_PTR), MSG
+	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	mova128		MSG, MSGTMP3
+		paddd		3*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 16-19 */
+	mova128		MSGTMP0, MSG
+		paddd		4*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 20-23 */
+	mova128		MSGTMP1, MSG
+		paddd		5*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 24-27 */
+	mova128		MSGTMP2, MSG
+		paddd		6*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 28-31 */
+	mova128		MSGTMP3, MSG
+		paddd		7*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 32-35 */
+	mova128		MSGTMP0, MSG
+		paddd		8*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 36-39 */
+	mova128		MSGTMP1, MSG
+		paddd		9*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 40-43 */
+	mova128		MSGTMP2, MSG
+		paddd		10*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 44-47 */
+	mova128		MSGTMP3, MSG
+		paddd		11*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 48-51 */
+	mova128		MSGTMP0, MSG
+		paddd		12*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 52-55 */
+	mova128		MSGTMP1, MSG
+		paddd		13*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 56-59 */
+	mova128		MSGTMP2, MSG
+		paddd		14*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 60-63 */
+	mova128		MSGTMP3, MSG
+		paddd		15*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Add current hash values with previously saved */
+	paddd		0*16(%esp), STATE0
+	paddd		1*16(%esp), STATE1
+
+	/* Write hash values back in the correct order */
+	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
+	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
+	mova128		STATE0, MSGTMP4
+	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
+	palignr		$8, MSGTMP4, STATE1		/* HGFE */
+
+	movu128		STATE0, 76+0*16(%eax)
+	movu128		STATE1, 76+1*16(%eax)
+
+	movl	%ebp, %esp
+	popl	%ebp
+	ret
+	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
+
+.section	.rodata.cst256.K256, "aM", @progbits, 256
+.balign 16
+K256:
+	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
+	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
+	.long	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3
+	.long	0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174
+	.long	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc
+	.long	0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da
+	.long	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7
+	.long	0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967
+	.long	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13
+	.long	0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85
+	.long	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3
+	.long	0xd192e819,0xd6990624,0xf40e3585,0x106aa070
+	.long	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5
+	.long	0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3
+	.long	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
+	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
+
+.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
+.balign 16
+PSHUFFLE_BSWAP32_FLIP_MASK:
+	.octa 0x0c0d0e0f08090a0b0405060700010203
+
+#endif
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
new file mode 100644
index 000000000..1c2b75af3
--- /dev/null
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -0,0 +1,281 @@
+#if ENABLE_SHA256_HWACCEL && defined(__GNUC__) && defined(__x86_64__)
+/* The code is adapted from Linux kernel's source */
+
+// We use shorter insns, even though they are for "wrong"
+// data type (fp, not int).
+// For Intel, there is no penalty for doing it at all
+// (CPUs which do have such penalty do not support SHA1 insns).
+// For AMD, the penalty is one extra cycle
+// (allegedly: I failed to find measurable difference).
+
+//#define mova128 movdqa
+#define mova128 movaps
+//#define movu128 movdqu
+#define movu128 movups
+//#define shuf128_32 pshufd
+#define shuf128_32 shufps
+
+	.section	.text.sha256_process_block64_shaNI, "ax", @progbits
+	.globl	sha256_process_block64_shaNI
+	.hidden	sha256_process_block64_shaNI
+	.type	sha256_process_block64_shaNI, @function
+
+#define DATA_PTR	%rdi
+
+#define SHA256CONSTANTS	%rax
+
+#define MSG		%xmm0
+#define STATE0		%xmm1
+#define STATE1		%xmm2
+#define MSGTMP0		%xmm3
+#define MSGTMP1		%xmm4
+#define MSGTMP2		%xmm5
+#define MSGTMP3		%xmm6
+#define MSGTMP4		%xmm7
+
+#define SHUF_MASK	%xmm8
+
+#define ABEF_SAVE	%xmm9
+#define CDGH_SAVE	%xmm10
+
+	.balign	8	# allow decoders to fetch at least 2 first insns
+sha256_process_block64_shaNI:
+	movu128		80+0*16(%rdi), STATE0
+	movu128		80+1*16(%rdi), STATE1
+
+	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
+	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
+	mova128		STATE0, MSGTMP4
+	palignr		$8, STATE1,  STATE0		/* ABEF */
+	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
+
+	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK
+	lea		K256(%rip), SHA256CONSTANTS
+
+	/* Save hash values for addition after rounds */
+	mova128		STATE0, ABEF_SAVE
+	mova128		STATE1, CDGH_SAVE
+
+	/* Rounds 0-3 */
+	movu128		0*16(DATA_PTR), MSG
+	pshufb		SHUF_MASK, MSG
+	mova128		MSG, MSGTMP0
+		paddd		0*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 4-7 */
+	movu128		1*16(DATA_PTR), MSG
+	pshufb		SHUF_MASK, MSG
+	mova128		MSG, MSGTMP1
+		paddd		1*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 8-11 */
+	movu128		2*16(DATA_PTR), MSG
+	pshufb		SHUF_MASK, MSG
+	mova128		MSG, MSGTMP2
+		paddd		2*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 12-15 */
+	movu128		3*16(DATA_PTR), MSG
+	pshufb		SHUF_MASK, MSG
+	mova128		MSG, MSGTMP3
+		paddd		3*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 16-19 */
+	mova128		MSGTMP0, MSG
+		paddd		4*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 20-23 */
+	mova128		MSGTMP1, MSG
+		paddd		5*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 24-27 */
+	mova128		MSGTMP2, MSG
+		paddd		6*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 28-31 */
+	mova128		MSGTMP3, MSG
+		paddd		7*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 32-35 */
+	mova128		MSGTMP0, MSG
+		paddd		8*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 36-39 */
+	mova128		MSGTMP1, MSG
+		paddd		9*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP1, MSGTMP0
+
+	/* Rounds 40-43 */
+	mova128		MSGTMP2, MSG
+		paddd		10*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP2, MSGTMP1
+
+	/* Rounds 44-47 */
+	mova128		MSGTMP3, MSG
+		paddd		11*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP3, MSGTMP4
+	palignr		$4, MSGTMP2, MSGTMP4
+	paddd		MSGTMP4, MSGTMP0
+	sha256msg2	MSGTMP3, MSGTMP0
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP3, MSGTMP2
+
+	/* Rounds 48-51 */
+	mova128		MSGTMP0, MSG
+		paddd		12*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP0, MSGTMP4
+	palignr		$4, MSGTMP3, MSGTMP4
+	paddd		MSGTMP4, MSGTMP1
+	sha256msg2	MSGTMP0, MSGTMP1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+	sha256msg1	MSGTMP0, MSGTMP3
+
+	/* Rounds 52-55 */
+	mova128		MSGTMP1, MSG
+		paddd		13*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP1, MSGTMP4
+	palignr		$4, MSGTMP0, MSGTMP4
+	paddd		MSGTMP4, MSGTMP2
+	sha256msg2	MSGTMP1, MSGTMP2
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 56-59 */
+	mova128		MSGTMP2, MSG
+		paddd		14*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+	mova128		MSGTMP2, MSGTMP4
+	palignr		$4, MSGTMP1, MSGTMP4
+	paddd		MSGTMP4, MSGTMP3
+	sha256msg2	MSGTMP2, MSGTMP3
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Rounds 60-63 */
+	mova128		MSGTMP3, MSG
+		paddd		15*16(SHA256CONSTANTS), MSG
+		sha256rnds2	STATE0, STATE1
+		shuf128_32	$0x0E, MSG, MSG
+		sha256rnds2	STATE1, STATE0
+
+	/* Add current hash values with previously saved */
+	paddd		ABEF_SAVE, STATE0
+	paddd		CDGH_SAVE, STATE1
+
+	/* Write hash values back in the correct order */
+	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
+	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
+	mova128		STATE0, MSGTMP4
+	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
+	palignr		$8, MSGTMP4, STATE1		/* HGFE */
+
+	movu128		STATE0, 80+0*16(%rdi)
+	movu128		STATE1, 80+1*16(%rdi)
+
+	ret
+	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
+
+.section	.rodata.cst256.K256, "aM", @progbits, 256
+.balign 16
+K256:
+	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
+	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
+	.long	0xd807aa98,0x12835b01,0x243185be,0x550c7dc3
+	.long	0x72be5d74,0x80deb1fe,0x9bdc06a7,0xc19bf174
+	.long	0xe49b69c1,0xefbe4786,0x0fc19dc6,0x240ca1cc
+	.long	0x2de92c6f,0x4a7484aa,0x5cb0a9dc,0x76f988da
+	.long	0x983e5152,0xa831c66d,0xb00327c8,0xbf597fc7
+	.long	0xc6e00bf3,0xd5a79147,0x06ca6351,0x14292967
+	.long	0x27b70a85,0x2e1b2138,0x4d2c6dfc,0x53380d13
+	.long	0x650a7354,0x766a0abb,0x81c2c92e,0x92722c85
+	.long	0xa2bfe8a1,0xa81a664b,0xc24b8b70,0xc76c51a3
+	.long	0xd192e819,0xd6990624,0xf40e3585,0x106aa070
+	.long	0x19a4c116,0x1e376c08,0x2748774c,0x34b0bcb5
+	.long	0x391c0cb3,0x4ed8aa4a,0x5b9cca4f,0x682e6ff3
+	.long	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
+	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
+
+.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
+.balign 16
+PSHUFFLE_BSWAP32_FLIP_MASK:
+	.octa 0x0c0d0e0f08090a0b0405060700010203
+
+#endif
diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index 166cfd38a..11b855e26 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -20,7 +20,7 @@
 #define extr128_32 pextrd
 //#define extr128_32 extractps	# not shorter
 
-	.section	.text.sha1_process_block64_shaNI,"ax", at progbits
+	.section	.text.sha1_process_block64_shaNI, "ax", @progbits
 	.globl	sha1_process_block64_shaNI
 	.hidden	sha1_process_block64_shaNI
 	.type	sha1_process_block64_shaNI, @function
@@ -224,7 +224,7 @@ sha1_process_block64_shaNI:
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
 .section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-.align 16
+.balign 16
 PSHUFFLE_BYTE_FLIP_MASK:
 	.octa 0x000102030405060708090a0b0c0d0e0f
 
diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index 743269d98..47ace60de 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -1394,7 +1394,7 @@ sha1_process_block64:
 	.size	sha1_process_block64, .-sha1_process_block64
 
 	.section	.rodata.cst16.sha1const, "aM", @progbits, 16
-	.align	16
+	.balign	16
 rconst0x5A827999:
 	.long	0x5A827999
 	.long	0x5A827999
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index 47c40af0d..656fb5414 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -433,7 +433,7 @@ echo "
 	.size	sha1_process_block64, .-sha1_process_block64
 
 	.section	.rodata.cst16.sha1const, \"aM\", @progbits, 16
-	.align	16
+	.balign	16
 rconst0x5A827999:
 	.long	0x5A827999
 	.long	0x5A827999
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index 33cc3bf7f..ba92f09df 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -20,7 +20,7 @@
 #define extr128_32 pextrd
 //#define extr128_32 extractps	# not shorter
 
-	.section	.text.sha1_process_block64_shaNI,"ax", at progbits
+	.section	.text.sha1_process_block64_shaNI, "ax", @progbits
 	.globl	sha1_process_block64_shaNI
 	.hidden	sha1_process_block64_shaNI
 	.type	sha1_process_block64_shaNI, @function
@@ -218,7 +218,7 @@ sha1_process_block64_shaNI:
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
 .section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-.align 16
+.balign 16
 PSHUFFLE_BYTE_FLIP_MASK:
 	.octa 0x000102030405060708090a0b0c0d0e0f
 

From vda.linux at googlemail.com  Thu Feb  3 14:11:23 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Thu, 3 Feb 2022 15:11:23 +0100
Subject: [git commit] libbb/sha256: code shrink in 32-bit x86
Message-ID: <20220203140444.EF21982A68@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=de6cb4bed82356db72af81890c7c26d7e85fb50d
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         747     722     -25

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 56e37fa38..632dab7e6 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -49,8 +49,7 @@ sha256_process_block64_shaNI:
 	palignr		$8, STATE1,  STATE0		/* ABEF */
 	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
 
-#	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, SHUF_MASK
-	lea		K256, SHA256CONSTANTS
+	movl		$K256+8*16, SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
 	mova128		STATE0, 0*16(%esp)
@@ -60,7 +59,7 @@ sha256_process_block64_shaNI:
 	movu128		0*16(DATA_PTR), MSG
 	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
 	mova128		MSG, MSGTMP0
-		paddd		0*16(SHA256CONSTANTS), MSG
+		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -69,7 +68,7 @@ sha256_process_block64_shaNI:
 	movu128		1*16(DATA_PTR), MSG
 	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
 	mova128		MSG, MSGTMP1
-		paddd		1*16(SHA256CONSTANTS), MSG
+		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -79,7 +78,7 @@ sha256_process_block64_shaNI:
 	movu128		2*16(DATA_PTR), MSG
 	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
 	mova128		MSG, MSGTMP2
-		paddd		2*16(SHA256CONSTANTS), MSG
+		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -89,7 +88,7 @@ sha256_process_block64_shaNI:
 	movu128		3*16(DATA_PTR), MSG
 	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
 	mova128		MSG, MSGTMP3
-		paddd		3*16(SHA256CONSTANTS), MSG
+		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -101,7 +100,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 16-19 */
 	mova128		MSGTMP0, MSG
-		paddd		4*16(SHA256CONSTANTS), MSG
+		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -113,7 +112,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 20-23 */
 	mova128		MSGTMP1, MSG
-		paddd		5*16(SHA256CONSTANTS), MSG
+		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -125,7 +124,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 24-27 */
 	mova128		MSGTMP2, MSG
-		paddd		6*16(SHA256CONSTANTS), MSG
+		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -137,7 +136,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 28-31 */
 	mova128		MSGTMP3, MSG
-		paddd		7*16(SHA256CONSTANTS), MSG
+		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -149,7 +148,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 32-35 */
 	mova128		MSGTMP0, MSG
-		paddd		8*16(SHA256CONSTANTS), MSG
+		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -161,7 +160,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 36-39 */
 	mova128		MSGTMP1, MSG
-		paddd		9*16(SHA256CONSTANTS), MSG
+		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -173,7 +172,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 40-43 */
 	mova128		MSGTMP2, MSG
-		paddd		10*16(SHA256CONSTANTS), MSG
+		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -185,7 +184,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 44-47 */
 	mova128		MSGTMP3, MSG
-		paddd		11*16(SHA256CONSTANTS), MSG
+		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -197,7 +196,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 48-51 */
 	mova128		MSGTMP0, MSG
-		paddd		12*16(SHA256CONSTANTS), MSG
+		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -209,7 +208,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 52-55 */
 	mova128		MSGTMP1, MSG
-		paddd		13*16(SHA256CONSTANTS), MSG
+		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -220,7 +219,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 56-59 */
 	mova128		MSGTMP2, MSG
-		paddd		14*16(SHA256CONSTANTS), MSG
+		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -231,7 +230,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 60-63 */
 	mova128		MSGTMP3, MSG
-		paddd		15*16(SHA256CONSTANTS), MSG
+		paddd		15*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0

From vda.linux at googlemail.com  Thu Feb  3 14:17:42 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Thu, 3 Feb 2022 15:17:42 +0100
Subject: [git commit] libbb/sha256: code shrink in 64-bit x86
Message-ID: <20220203141101.202F882A2F@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=a1429fbb8ca373efc01939d599f6f65969b1a366
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         730     706     -24

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-64_shaNI.S | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index 1c2b75af3..f3df541e4 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -50,7 +50,7 @@ sha256_process_block64_shaNI:
 	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
 
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK
-	lea		K256(%rip), SHA256CONSTANTS
+	leaq		K256+8*16(%rip), SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
 	mova128		STATE0, ABEF_SAVE
@@ -60,7 +60,7 @@ sha256_process_block64_shaNI:
 	movu128		0*16(DATA_PTR), MSG
 	pshufb		SHUF_MASK, MSG
 	mova128		MSG, MSGTMP0
-		paddd		0*16(SHA256CONSTANTS), MSG
+		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -69,7 +69,7 @@ sha256_process_block64_shaNI:
 	movu128		1*16(DATA_PTR), MSG
 	pshufb		SHUF_MASK, MSG
 	mova128		MSG, MSGTMP1
-		paddd		1*16(SHA256CONSTANTS), MSG
+		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -79,7 +79,7 @@ sha256_process_block64_shaNI:
 	movu128		2*16(DATA_PTR), MSG
 	pshufb		SHUF_MASK, MSG
 	mova128		MSG, MSGTMP2
-		paddd		2*16(SHA256CONSTANTS), MSG
+		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -89,7 +89,7 @@ sha256_process_block64_shaNI:
 	movu128		3*16(DATA_PTR), MSG
 	pshufb		SHUF_MASK, MSG
 	mova128		MSG, MSGTMP3
-		paddd		3*16(SHA256CONSTANTS), MSG
+		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -101,7 +101,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 16-19 */
 	mova128		MSGTMP0, MSG
-		paddd		4*16(SHA256CONSTANTS), MSG
+		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -113,7 +113,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 20-23 */
 	mova128		MSGTMP1, MSG
-		paddd		5*16(SHA256CONSTANTS), MSG
+		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -125,7 +125,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 24-27 */
 	mova128		MSGTMP2, MSG
-		paddd		6*16(SHA256CONSTANTS), MSG
+		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -137,7 +137,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 28-31 */
 	mova128		MSGTMP3, MSG
-		paddd		7*16(SHA256CONSTANTS), MSG
+		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -149,7 +149,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 32-35 */
 	mova128		MSGTMP0, MSG
-		paddd		8*16(SHA256CONSTANTS), MSG
+		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -161,7 +161,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 36-39 */
 	mova128		MSGTMP1, MSG
-		paddd		9*16(SHA256CONSTANTS), MSG
+		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -173,7 +173,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 40-43 */
 	mova128		MSGTMP2, MSG
-		paddd		10*16(SHA256CONSTANTS), MSG
+		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -185,7 +185,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 44-47 */
 	mova128		MSGTMP3, MSG
-		paddd		11*16(SHA256CONSTANTS), MSG
+		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP3, MSGTMP4
 	palignr		$4, MSGTMP2, MSGTMP4
@@ -197,7 +197,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 48-51 */
 	mova128		MSGTMP0, MSG
-		paddd		12*16(SHA256CONSTANTS), MSG
+		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP0, MSGTMP4
 	palignr		$4, MSGTMP3, MSGTMP4
@@ -209,7 +209,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 52-55 */
 	mova128		MSGTMP1, MSG
-		paddd		13*16(SHA256CONSTANTS), MSG
+		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP1, MSGTMP4
 	palignr		$4, MSGTMP0, MSGTMP4
@@ -220,7 +220,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 56-59 */
 	mova128		MSGTMP2, MSG
-		paddd		14*16(SHA256CONSTANTS), MSG
+		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 	mova128		MSGTMP2, MSGTMP4
 	palignr		$4, MSGTMP1, MSGTMP4
@@ -231,7 +231,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 60-63 */
 	mova128		MSGTMP3, MSG
-		paddd		15*16(SHA256CONSTANTS), MSG
+		paddd		15*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0

From vda.linux at googlemail.com  Sat Feb  5 23:33:42 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Sun, 6 Feb 2022 00:33:42 +0100
Subject: [git commit] libbb/sha256: code shrink in 64-bit x86
Message-ID: <20220205234357.20D04819E6@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=31c1c310772fa6c897ee1585ea15fc38f3ab3dff
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         706     701      -5

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-64_shaNI.S | 96 ++++++++++++++++++------------------
 1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index f3df541e4..dbf391135 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -31,9 +31,7 @@
 #define MSGTMP1		%xmm4
 #define MSGTMP2		%xmm5
 #define MSGTMP3		%xmm6
-#define MSGTMP4		%xmm7
-
-#define SHUF_MASK	%xmm8
+#define XMMTMP4		%xmm7
 
 #define ABEF_SAVE	%xmm9
 #define CDGH_SAVE	%xmm10
@@ -45,11 +43,12 @@ sha256_process_block64_shaNI:
 
 	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
 	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
-	mova128		STATE0, MSGTMP4
+	mova128		STATE0, XMMTMP4
 	palignr		$8, STATE1,  STATE0		/* ABEF */
-	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
+	pblendw		$0xF0, XMMTMP4, STATE1		/* CDGH */
 
-	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), SHUF_MASK
+/* XMMTMP4 holds flip mask from here... */
+	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP4
 	leaq		K256+8*16(%rip), SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
@@ -58,7 +57,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 0-3 */
 	movu128		0*16(DATA_PTR), MSG
-	pshufb		SHUF_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -67,7 +66,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
-	pshufb		SHUF_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -77,7 +76,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 8-11 */
 	movu128		2*16(DATA_PTR), MSG
-	pshufb		SHUF_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -87,13 +86,14 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 12-15 */
 	movu128		3*16(DATA_PTR), MSG
-	pshufb		SHUF_MASK, MSG
+	pshufb		XMMTMP4, MSG
+/* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -103,9 +103,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -115,9 +115,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -127,9 +127,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -139,9 +139,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -151,9 +151,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -163,9 +163,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -175,9 +175,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -187,9 +187,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -199,9 +199,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -211,9 +211,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -222,9 +222,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -243,9 +243,9 @@ sha256_process_block64_shaNI:
 	/* Write hash values back in the correct order */
 	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
 	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
-	mova128		STATE0, MSGTMP4
+	mova128		STATE0, XMMTMP4
 	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
-	palignr		$8, MSGTMP4, STATE1		/* HGFE */
+	palignr		$8, XMMTMP4, STATE1		/* HGFE */
 
 	movu128		STATE0, 80+0*16(%rdi)
 	movu128		STATE1, 80+1*16(%rdi)

From vda.linux at googlemail.com  Sat Feb  5 23:56:13 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Sun, 6 Feb 2022 00:56:13 +0100
Subject: [git commit] libbb/sha256: code shrink in 32-bit x86
Message-ID: <20220205235036.87148821F1@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=4f40735c87f8292a87c066b3b7099b0be007cf59
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         722     713      -9

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 93 +++++++++++++++++++-----------------
 1 file changed, 48 insertions(+), 45 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 632dab7e6..417da37d8 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -31,7 +31,7 @@
 #define MSGTMP1		%xmm4
 #define MSGTMP2		%xmm5
 #define MSGTMP3		%xmm6
-#define MSGTMP4		%xmm7
+#define XMMTMP4		%xmm7
 
 	.balign	8	# allow decoders to fetch at least 3 first insns
 sha256_process_block64_shaNI:
@@ -45,10 +45,12 @@ sha256_process_block64_shaNI:
 
 	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
 	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
-	mova128		STATE0, MSGTMP4
+	mova128		STATE0, XMMTMP4
 	palignr		$8, STATE1,  STATE0		/* ABEF */
-	pblendw		$0xF0, MSGTMP4, STATE1		/* CDGH */
+	pblendw		$0xF0, XMMTMP4, STATE1		/* CDGH */
 
+/* XMMTMP4 holds flip mask from here... */
+	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP4
 	movl		$K256+8*16, SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
@@ -57,7 +59,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 0-3 */
 	movu128		0*16(DATA_PTR), MSG
-	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -66,7 +68,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
-	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -76,7 +78,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 8-11 */
 	movu128		2*16(DATA_PTR), MSG
-	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	pshufb		XMMTMP4, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -86,13 +88,14 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 12-15 */
 	movu128		3*16(DATA_PTR), MSG
-	pshufb		PSHUFFLE_BSWAP32_FLIP_MASK, MSG
+	pshufb		XMMTMP4, MSG
+/* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -102,9 +105,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -114,9 +117,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -126,9 +129,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -138,9 +141,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -150,9 +153,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -162,9 +165,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -174,9 +177,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -186,9 +189,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, MSGTMP4
-	palignr		$4, MSGTMP2, MSGTMP4
-	paddd		MSGTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP4
+	palignr		$4, MSGTMP2, XMMTMP4
+	paddd		XMMTMP4, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -198,9 +201,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, MSGTMP4
-	palignr		$4, MSGTMP3, MSGTMP4
-	paddd		MSGTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP4
+	palignr		$4, MSGTMP3, XMMTMP4
+	paddd		XMMTMP4, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -210,9 +213,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, MSGTMP4
-	palignr		$4, MSGTMP0, MSGTMP4
-	paddd		MSGTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP4
+	palignr		$4, MSGTMP0, XMMTMP4
+	paddd		XMMTMP4, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -221,9 +224,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, MSGTMP4
-	palignr		$4, MSGTMP1, MSGTMP4
-	paddd		MSGTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP4
+	palignr		$4, MSGTMP1, XMMTMP4
+	paddd		XMMTMP4, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -242,9 +245,9 @@ sha256_process_block64_shaNI:
 	/* Write hash values back in the correct order */
 	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
 	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
-	mova128		STATE0, MSGTMP4
+	mova128		STATE0, XMMTMP4
 	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
-	palignr		$8, MSGTMP4, STATE1		/* HGFE */
+	palignr		$8, XMMTMP4, STATE1		/* HGFE */
 
 	movu128		STATE0, 76+0*16(%eax)
 	movu128		STATE1, 76+1*16(%eax)

From vda.linux at googlemail.com  Sun Feb  6 18:53:10 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Sun, 6 Feb 2022 19:53:10 +0100
Subject: [git commit] *: slap on a few ALIGN* where appropriate
Message-ID: <20220206184644.726FD82C83@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=ca466f385ac985a8b3491daa9f326dc480cdee70
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

The result of looking at "grep -F -B2 '*fill*' busybox_unstripped.map"

function                                             old     new   delta
.rodata                                           108586  108460    -126
------------------------------------------------------------------------------
(add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-126)           Total: -126 bytes
   text	   data	    bss	    dec	    hex	filename
 970412	   4219	   1848	 976479	  ee65f	busybox_old
 970286	   4219	   1848	 976353	  ee5e1	busybox_unstripped

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 console-tools/reset.c             | 2 +-
 coreutils/od.c                    | 2 +-
 include/platform.h                | 1 +
 libbb/appletlib.c                 | 2 +-
 libbb/get_console.c               | 2 +-
 miscutils/bc.c                    | 2 +-
 miscutils/man.c                   | 2 +-
 networking/ifupdown.c             | 8 ++++----
 networking/interface.c            | 6 +++---
 networking/libiproute/ipaddress.c | 2 +-
 networking/udhcp/common.c         | 2 +-
 networking/udhcp/d6_dhcpc.c       | 2 +-
 shell/ash.c                       | 2 +-
 util-linux/hexdump.c              | 2 +-
 util-linux/nsenter.c              | 2 +-
 util-linux/unshare.c              | 2 +-
 16 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/console-tools/reset.c b/console-tools/reset.c
index b3acf69f8..cc04e4fcc 100644
--- a/console-tools/reset.c
+++ b/console-tools/reset.c
@@ -36,7 +36,7 @@ int stty_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE;
 int reset_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE;
 int reset_main(int argc UNUSED_PARAM, char **argv UNUSED_PARAM)
 {
-	static const char *const args[] = {
+	static const char *const args[] ALIGN_PTR = {
 		"stty", "sane", NULL
 	};
 
diff --git a/coreutils/od.c b/coreutils/od.c
index 9a888dd5f..6f22331e0 100644
--- a/coreutils/od.c
+++ b/coreutils/od.c
@@ -144,7 +144,7 @@ odoffset(dumper_t *dumper, int argc, char ***argvp)
 	}
 }
 
-static const char *const add_strings[] = {
+static const char *const add_strings[] ALIGN_PTR = {
 	"16/1 \"%3_u \" \"\\n\"",              /* a */
 	"8/2 \" %06o \" \"\\n\"",              /* B, o */
 	"16/1 \"%03o \" \"\\n\"",              /* b */
diff --git a/include/platform.h b/include/platform.h
index ad27bb31a..ea0512f36 100644
--- a/include/platform.h
+++ b/include/platform.h
@@ -346,6 +346,7 @@ typedef unsigned smalluint;
 # define ALIGN4
 #endif
 #define ALIGN8     __attribute__((aligned(8)))
+#define ALIGN_INT  __attribute__((aligned(sizeof(int))))
 #define ALIGN_PTR  __attribute__((aligned(sizeof(void*))))
 
 /*
diff --git a/libbb/appletlib.c b/libbb/appletlib.c
index 03389f541..841b3b873 100644
--- a/libbb/appletlib.c
+++ b/libbb/appletlib.c
@@ -651,7 +651,7 @@ static void check_suid(int applet_no)
 # if ENABLE_FEATURE_INSTALLER
 static const char usr_bin [] ALIGN1 = "/usr/bin/";
 static const char usr_sbin[] ALIGN1 = "/usr/sbin/";
-static const char *const install_dir[] = {
+static const char *const install_dir[] ALIGN_PTR = {
 	&usr_bin [8], /* "/" */
 	&usr_bin [4], /* "/bin/" */
 	&usr_sbin[4]  /* "/sbin/" */
diff --git a/libbb/get_console.c b/libbb/get_console.c
index 7f2c75332..9044efea1 100644
--- a/libbb/get_console.c
+++ b/libbb/get_console.c
@@ -37,7 +37,7 @@ static int open_a_console(const char *fnam)
  */
 int FAST_FUNC get_console_fd_or_die(void)
 {
-	static const char *const console_names[] = {
+	static const char *const console_names[] ALIGN_PTR = {
 		DEV_CONSOLE, CURRENT_VC, CURRENT_TTY
 	};
 
diff --git a/miscutils/bc.c b/miscutils/bc.c
index ae370ff55..ab785bbc8 100644
--- a/miscutils/bc.c
+++ b/miscutils/bc.c
@@ -6011,7 +6011,7 @@ static BC_STATUS zxc_program_assign(char inst)
 #endif
 
 	if (ib || sc || left->t == XC_RESULT_OBASE) {
-		static const char *const msg[] = {
+		static const char *const msg[] ALIGN_PTR = {
 			"bad ibase; must be [2,16]",                 //XC_RESULT_IBASE
 			"bad obase; must be [2,"BC_MAX_OBASE_STR"]", //XC_RESULT_OBASE
 			"bad scale; must be [0,"BC_MAX_SCALE_STR"]", //XC_RESULT_SCALE
diff --git a/miscutils/man.c b/miscutils/man.c
index d319e8bba..deaf9e5ab 100644
--- a/miscutils/man.c
+++ b/miscutils/man.c
@@ -303,7 +303,7 @@ int man_main(int argc UNUSED_PARAM, char **argv)
 	config_close(parser);
 
 	if (!man_path_list) {
-		static const char *const mpl[] = { "/usr/man", "/usr/share/man", NULL };
+		static const char *const mpl[] ALIGN_PTR = { "/usr/man", "/usr/share/man", NULL };
 		man_path_list = (char**)mpl;
 		/*count_mp = 2; - not used below anyway */
 	}
diff --git a/networking/ifupdown.c b/networking/ifupdown.c
index 737113dd4..6c4ae27f2 100644
--- a/networking/ifupdown.c
+++ b/networking/ifupdown.c
@@ -532,7 +532,7 @@ static int FAST_FUNC v4tunnel_down(struct interface_defn_t * ifd, execfn * exec)
 }
 # endif
 
-static const struct method_t methods6[] = {
+static const struct method_t methods6[] ALIGN_PTR = {
 # if ENABLE_FEATURE_IFUPDOWN_IP
 	{ "v4tunnel" , v4tunnel_up     , v4tunnel_down   , },
 # endif
@@ -627,7 +627,7 @@ struct dhcp_client_t {
 	const char *stopcmd;
 };
 
-static const struct dhcp_client_t ext_dhcp_clients[] = {
+static const struct dhcp_client_t ext_dhcp_clients[] ALIGN_PTR = {
 	{ "dhcpcd",
 		"dhcpcd[[ -h %hostname%]][[ -i %vendor%]][[ -I %client%]][[ -l %leasetime%]] %iface%",
 		"dhcpcd -k %iface%",
@@ -774,7 +774,7 @@ static int FAST_FUNC wvdial_down(struct interface_defn_t *ifd, execfn *exec)
 			"-p /var/run/wvdial.%iface% -s 2", ifd, exec);
 }
 
-static const struct method_t methods[] = {
+static const struct method_t methods[] ALIGN_PTR = {
 	{ "manual"  , manual_up_down, manual_up_down, },
 	{ "wvdial"  , wvdial_up     , wvdial_down   , },
 	{ "ppp"     , ppp_up        , ppp_down      , },
@@ -797,7 +797,7 @@ static int FAST_FUNC link_up_down(struct interface_defn_t *ifd UNUSED_PARAM, exe
 	return 1;
 }
 
-static const struct method_t link_methods[] = {
+static const struct method_t link_methods[] ALIGN_PTR = {
 	{ "none", link_up_down, link_up_down }
 };
 
diff --git a/networking/interface.c b/networking/interface.c
index ea6a2c8a8..6b6c0944a 100644
--- a/networking/interface.c
+++ b/networking/interface.c
@@ -446,13 +446,13 @@ static char *get_name(char name[IFNAMSIZ], char *p)
  * %n specifiers (even the size of integers may not match).
  */
 #if INT_MAX == LONG_MAX
-static const char *const ss_fmt[] = {
+static const char *const ss_fmt[] ALIGN_PTR = {
 	"%n%llu%u%u%u%u%n%n%n%llu%u%u%u%u%u",
 	"%llu%llu%u%u%u%u%n%n%llu%llu%u%u%u%u%u",
 	"%llu%llu%u%u%u%u%u%u%llu%llu%u%u%u%u%u%u"
 };
 #else
-static const char *const ss_fmt[] = {
+static const char *const ss_fmt[] ALIGN_PTR = {
 	"%n%llu%lu%lu%lu%lu%n%n%n%llu%lu%lu%lu%lu%lu",
 	"%llu%llu%lu%lu%lu%lu%n%n%llu%llu%lu%lu%lu%lu%lu",
 	"%llu%llu%lu%lu%lu%lu%lu%lu%llu%llu%lu%lu%lu%lu%lu%lu"
@@ -731,7 +731,7 @@ static const struct hwtype ib_hwtype = {
 #endif
 
 
-static const struct hwtype *const hwtypes[] = {
+static const struct hwtype *const hwtypes[] ALIGN_PTR = {
 	&loop_hwtype,
 	&ether_hwtype,
 	&ppp_hwtype,
diff --git a/networking/libiproute/ipaddress.c b/networking/libiproute/ipaddress.c
index 17a838411..ecc3848ff 100644
--- a/networking/libiproute/ipaddress.c
+++ b/networking/libiproute/ipaddress.c
@@ -58,7 +58,7 @@ typedef struct filter_t filter_t;
 
 static void print_link_flags(unsigned flags, unsigned mdown)
 {
-	static const int flag_masks[] = {
+	static const int flag_masks[] ALIGN_INT = {
 		IFF_LOOPBACK, IFF_BROADCAST, IFF_POINTOPOINT,
 		IFF_MULTICAST, IFF_NOARP, IFF_UP, IFF_LOWER_UP };
 	static const char flag_labels[] ALIGN1 =
diff --git a/networking/udhcp/common.c b/networking/udhcp/common.c
index 8e9b93655..ae818db05 100644
--- a/networking/udhcp/common.c
+++ b/networking/udhcp/common.c
@@ -19,7 +19,7 @@ const uint8_t MAC_BCAST_ADDR[6] ALIGN2 = {
  * See RFC2132 for more options.
  * OPTION_REQ: these options are requested by udhcpc (unless -o).
  */
-const struct dhcp_optflag dhcp_optflags[] = {
+const struct dhcp_optflag dhcp_optflags[] ALIGN2 = {
 	/* flags                                    code */
 	{ OPTION_IP                   | OPTION_REQ, 0x01 }, /* DHCP_SUBNET        */
 	{ OPTION_S32                              , 0x02 }, /* DHCP_TIME_OFFSET   */
diff --git a/networking/udhcp/d6_dhcpc.c b/networking/udhcp/d6_dhcpc.c
index 9d2a8f5d3..9fc690315 100644
--- a/networking/udhcp/d6_dhcpc.c
+++ b/networking/udhcp/d6_dhcpc.c
@@ -65,7 +65,7 @@
 
 /* "struct client_data_t client_data" is in bb_common_bufsiz1 */
 
-static const struct dhcp_optflag d6_optflags[] = {
+static const struct dhcp_optflag d6_optflags[] ALIGN2 = {
 #if ENABLE_FEATURE_UDHCPC6_RFC3646
 	{ OPTION_6RD | OPTION_LIST        | OPTION_REQ, D6_OPT_DNS_SERVERS },
 	{ OPTION_DNS_STRING | OPTION_LIST | OPTION_REQ, D6_OPT_DOMAIN_LIST },
diff --git a/shell/ash.c b/shell/ash.c
index 55df54bd0..adb0f223a 100644
--- a/shell/ash.c
+++ b/shell/ash.c
@@ -313,7 +313,7 @@ typedef long arith_t;
 /* ============ Shell options */
 
 /* If you add/change options hare, update --help text too */
-static const char *const optletters_optnames[] = {
+static const char *const optletters_optnames[] ALIGN_PTR = {
 	"e"   "errexit",
 	"f"   "noglob",
 /* bash has '-o ignoreeof', but no short synonym -I for it */
diff --git a/util-linux/hexdump.c b/util-linux/hexdump.c
index 57e7e8db7..307a84803 100644
--- a/util-linux/hexdump.c
+++ b/util-linux/hexdump.c
@@ -71,7 +71,7 @@ static void bb_dump_addfile(dumper_t *dumper, char *name)
 	fclose(fp);
 }
 
-static const char *const add_strings[] = {
+static const char *const add_strings[] ALIGN_PTR = {
 	"\"%07.7_ax \"16/1 \"%03o \"\"\n\"",   /* b */
 	"\"%07.7_ax \"16/1 \"%3_c \"\"\n\"",   /* c */
 	"\"%07.7_ax \"8/2 \"  %05u \"\"\n\"",  /* d */
diff --git a/util-linux/nsenter.c b/util-linux/nsenter.c
index e6339da2f..1aa045b35 100644
--- a/util-linux/nsenter.c
+++ b/util-linux/nsenter.c
@@ -93,7 +93,7 @@ enum {
  * The user namespace comes first, so that it is entered first.
  * This gives an unprivileged user the potential to enter other namespaces.
  */
-static const struct namespace_descr ns_list[] = {
+static const struct namespace_descr ns_list[] ALIGN_INT = {
 	{ CLONE_NEWUSER, "ns/user", },
 	{ CLONE_NEWIPC,  "ns/ipc",  },
 	{ CLONE_NEWUTS,  "ns/uts",  },
diff --git a/util-linux/unshare.c b/util-linux/unshare.c
index 68ccdd874..06b938074 100644
--- a/util-linux/unshare.c
+++ b/util-linux/unshare.c
@@ -120,7 +120,7 @@ enum {
 	NS_USR_POS, /* OPT_user, NS_USR_POS, and ns_list[] index must match! */
 	NS_COUNT,
 };
-static const struct namespace_descr ns_list[] = {
+static const struct namespace_descr ns_list[] ALIGN_INT = {
 	{ CLONE_NEWNS,   "mnt"  },
 	{ CLONE_NEWUTS,  "uts"  },
 	{ CLONE_NEWIPC,  "ipc"  },

From vda.linux at googlemail.com  Sun Feb  6 19:07:12 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Sun, 6 Feb 2022 20:07:12 +0100
Subject: [git commit] *: slap on a few ALIGN_PTR where appropriate
Message-ID: <20220206190009.C99FD82B4D@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=987be932ed3cbea56b68bbe85649191c13b66015
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 coreutils/test.c       |  2 +-
 e2fsprogs/fsck.c       |  2 +-
 libbb/getopt32.c       |  2 +-
 miscutils/devfsd.c     |  4 ++--
 modutils/modutils-24.c |  4 ++--
 networking/inetd.c     |  2 +-
 procps/nmeter.c        |  2 +-
 selinux/setenforce.c   |  2 +-
 shell/hush.c           | 10 +++++-----
 9 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/coreutils/test.c b/coreutils/test.c
index a914c7490..840a0daaf 100644
--- a/coreutils/test.c
+++ b/coreutils/test.c
@@ -242,7 +242,7 @@ int depth;
 	depth--; \
 	return __res; \
 } while (0)
-static const char *const TOKSTR[] = {
+static const char *const TOKSTR[] ALIGN_PTR = {
 	"EOI",
 	"FILRD",
 	"FILWR",
diff --git a/e2fsprogs/fsck.c b/e2fsprogs/fsck.c
index 96c1e51e0..028f8a803 100644
--- a/e2fsprogs/fsck.c
+++ b/e2fsprogs/fsck.c
@@ -190,7 +190,7 @@ struct globals {
  * Required for the uber-silly devfs /dev/ide/host1/bus2/target3/lun3
  * pathames.
  */
-static const char *const devfs_hier[] = {
+static const char *const devfs_hier[] ALIGN_PTR = {
 	"host", "bus", "target", "lun", NULL
 };
 #endif
diff --git a/libbb/getopt32.c b/libbb/getopt32.c
index 5ab4d66f1..e861d0567 100644
--- a/libbb/getopt32.c
+++ b/libbb/getopt32.c
@@ -296,7 +296,7 @@ Special characters:
 
 /* Code here assumes that 'unsigned' is at least 32 bits wide */
 
-const char *const bb_argv_dash[] = { "-", NULL };
+const char *const bb_argv_dash[] ALIGN_PTR = { "-", NULL };
 
 enum {
 	PARAM_STRING,
diff --git a/miscutils/devfsd.c b/miscutils/devfsd.c
index 839d00fd0..fb9ebcf60 100644
--- a/miscutils/devfsd.c
+++ b/miscutils/devfsd.c
@@ -928,7 +928,7 @@ static void action_compat(const struct devfsd_notify_struct *info, unsigned int
 	unsigned int i;
 	char rewind_;
 	/* 1 to 5  "scsi/" , 6 to 9 "ide/host" */
-	static const char *const fmt[] = {
+	static const char *const fmt[] ALIGN_PTR = {
 		NULL ,
 		"sg/c%db%dt%du%d",		/* scsi/generic */
 		"sd/c%db%dt%du%d",		/* scsi/disc */
@@ -1468,7 +1468,7 @@ const char *get_old_name(const char *devname, unsigned int namelen,
 	const char *pty1;
 	const char *pty2;
 	/* 1 to 5  "scsi/" , 6 to 9 "ide/host", 10 sbp/, 11 vcc/, 12 pty/ */
-	static const char *const fmt[] = {
+	static const char *const fmt[] ALIGN_PTR = {
 		NULL ,
 		"sg%u",			/* scsi/generic */
 		NULL,			/* scsi/disc */
diff --git a/modutils/modutils-24.c b/modutils/modutils-24.c
index ac8632481..d0bc2a6ef 100644
--- a/modutils/modutils-24.c
+++ b/modutils/modutils-24.c
@@ -3458,7 +3458,7 @@ static int obj_load_progbits(char *image, size_t image_size, struct obj_file *f,
 
 static void hide_special_symbols(struct obj_file *f)
 {
-	static const char *const specials[] = {
+	static const char *const specials[] ALIGN_PTR = {
 		SPFX "cleanup_module",
 		SPFX "init_module",
 		SPFX "kernel_version",
@@ -3484,7 +3484,7 @@ static int obj_gpl_license(struct obj_file *f, const char **license)
 	 * linux/include/linux/module.h.  Checking for leading "GPL" will not
 	 * work, somebody will use "GPL sucks, this is proprietary".
 	 */
-	static const char *const gpl_licenses[] = {
+	static const char *const gpl_licenses[] ALIGN_PTR = {
 		"GPL",
 		"GPL v2",
 		"GPL and additional rights",
diff --git a/networking/inetd.c b/networking/inetd.c
index e71be51c3..fb2fbe323 100644
--- a/networking/inetd.c
+++ b/networking/inetd.c
@@ -1538,7 +1538,7 @@ int inetd_main(int argc UNUSED_PARAM, char **argv)
 #if ENABLE_FEATURE_INETD_SUPPORT_BUILTIN_ECHO \
  || ENABLE_FEATURE_INETD_SUPPORT_BUILTIN_DISCARD
 # if !BB_MMU
-static const char *const cat_args[] = { "cat", NULL };
+static const char *const cat_args[] ALIGN_PTR = { "cat", NULL };
 # endif
 #endif
 
diff --git a/procps/nmeter.c b/procps/nmeter.c
index 2310e9844..088d366bf 100644
--- a/procps/nmeter.c
+++ b/procps/nmeter.c
@@ -70,7 +70,7 @@ typedef struct proc_file {
 	smallint last_gen;
 } proc_file;
 
-static const char *const proc_name[] = {
+static const char *const proc_name[] ALIGN_PTR = {
 	"stat",		// Must match the order of proc_file's!
 	"loadavg",
 	"net/dev",
diff --git a/selinux/setenforce.c b/selinux/setenforce.c
index 996034f8e..2267be451 100644
--- a/selinux/setenforce.c
+++ b/selinux/setenforce.c
@@ -26,7 +26,7 @@
 /* These strings are arranged so that odd ones
  * result in security_setenforce(1) being done,
  * the rest will do security_setenforce(0) */
-static const char *const setenforce_cmd[] = {
+static const char *const setenforce_cmd[] ALIGN_PTR = {
 	"0",
 	"1",
 	"permissive",
diff --git a/shell/hush.c b/shell/hush.c
index 6dc2ecaac..ae81f0da5 100644
--- a/shell/hush.c
+++ b/shell/hush.c
@@ -564,7 +564,7 @@ enum {
 #define NULL_O_STRING { NULL }
 
 #ifndef debug_printf_parse
-static const char *const assignment_flag[] = {
+static const char *const assignment_flag[] ALIGN_PTR = {
 	"MAYBE_ASSIGNMENT",
 	"DEFINITELY_ASSIGNMENT",
 	"NOT_ASSIGNMENT",
@@ -3682,7 +3682,7 @@ static void free_pipe_list(struct pipe *pi)
 #ifndef debug_print_tree
 static void debug_print_tree(struct pipe *pi, int lvl)
 {
-	static const char *const PIPE[] = {
+	static const char *const PIPE[] ALIGN_PTR = {
 		[PIPE_SEQ] = "SEQ",
 		[PIPE_AND] = "AND",
 		[PIPE_OR ] = "OR" ,
@@ -3717,7 +3717,7 @@ static void debug_print_tree(struct pipe *pi, int lvl)
 		[RES_XXXX ] = "XXXX" ,
 		[RES_SNTX ] = "SNTX" ,
 	};
-	static const char *const CMDTYPE[] = {
+	static const char *const CMDTYPE[] ALIGN_PTR = {
 		"{}",
 		"()",
 		"[noglob]",
@@ -7659,7 +7659,7 @@ static int generate_stream_from_string(const char *s, pid_t *pid_p)
 		if (is_prefixed_with(s, "trap")
 		 && skip_whitespace(s + 4)[0] == '\0'
 		) {
-			static const char *const argv[] = { NULL, NULL };
+			static const char *const argv[] ALIGN_PTR = { NULL, NULL };
 			builtin_trap((char**)argv);
 			fflush_all(); /* important */
 			_exit(0);
@@ -9826,7 +9826,7 @@ static int run_list(struct pipe *pi)
 				static const char encoded_dollar_at[] ALIGN1 = {
 					SPECIAL_VAR_SYMBOL, '@' | 0x80, SPECIAL_VAR_SYMBOL, '\0'
 				}; /* encoded representation of "$@" */
-				static const char *const encoded_dollar_at_argv[] = {
+				static const char *const encoded_dollar_at_argv[] ALIGN_PTR = {
 					encoded_dollar_at, NULL
 				}; /* argv list with one element: "$@" */
 				char **vals;

From vda.linux at googlemail.com  Tue Feb  8 02:29:16 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Tue, 8 Feb 2022 03:29:16 +0100
Subject: [git commit] libbb/sha1: shrink unrolled x86-64 code
Message-ID: <20220208022853.C4572831C9@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=4923f74e5873b25b8205a4059964cff75ee731a8
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64                                3482    3481      -1
.rodata                                           108460  108412     -48
------------------------------------------------------------------------------
(add/remove: 1/4 grow/shrink: 0/2 up/down: 0/-49)             Total: -49 bytes

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-64.S    | 33 ++++++++++-----------------------
 libbb/hash_md5_sha_x86-64.S.sh | 34 +++++++++++-----------------------
 2 files changed, 21 insertions(+), 46 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index e26c46f25..287cfe547 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -24,6 +24,7 @@ sha1_process_block64:
 # xmm0..xmm3: W[]
 # xmm4,xmm5: temps
 # xmm6: current round constant
+# xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
 	movl	80(%rdi), %eax		# a = ctx->hash[0]
@@ -32,16 +33,17 @@ sha1_process_block64:
 	movl	92(%rdi), %edx		# d = ctx->hash[3]
 	movl	96(%rdi), %ebp		# e = ctx->hash[4]
 
-	movaps	rconst0x5A827999(%rip), %xmm6
+	movaps	sha1const(%rip), %xmm7
+	pshufd	$0x00, %xmm7, %xmm6
 
 	# Load W[] to xmm registers, byteswapping on the fly.
 	#
 	# For iterations 0..15, we pass W[] in rsi,r8..r14
-	# for use in RD1A's instead of spilling them to stack.
+	# for use in RD1As instead of spilling them to stack.
 	# We lose parallelized addition of RCONST, but LEA
-	# can do two additions at once, so it's probably a wash.
+	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# LEAs in two first RD1A's shorter by one byte).
+	# LEAs in two first RD1As shorter by one byte).
 	movq	4*0(%rdi), %rsi
 	movq	4*2(%rdi), %r8
 	bswapq	%rsi
@@ -253,7 +255,7 @@ sha1_process_block64:
 	roll	$5, %edi		# rotl32(a,5)
 	addl	%edi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
-	movaps	rconst0x6ED9EBA1(%rip), %xmm6
+	pshufd	$0x55, %xmm7, %xmm6
 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp)
 	movaps	%xmm0, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
@@ -614,7 +616,7 @@ sha1_process_block64:
 	roll	$5, %esi		# rotl32(a,5)
 	addl	%esi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
-	movaps	rconst0x8F1BBCDC(%rip), %xmm6
+	pshufd	$0xaa, %xmm7, %xmm6
 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp)
 	movaps	%xmm1, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
@@ -1001,7 +1003,7 @@ sha1_process_block64:
 	roll	$5, %esi		# rotl32(a,5)
 	addl	%esi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
-	movaps	rconst0xCA62C1D6(%rip), %xmm6
+	pshufd	$0xff, %xmm7, %xmm6
 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp)
 	movaps	%xmm2, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
@@ -1475,25 +1477,10 @@ sha1_process_block64:
 
 	.section	.rodata.cst16.sha1const, "aM", @progbits, 16
 	.balign	16
-rconst0x5A827999:
+sha1const:
 	.long	0x5A827999
-	.long	0x5A827999
-	.long	0x5A827999
-	.long	0x5A827999
-rconst0x6ED9EBA1:
-	.long	0x6ED9EBA1
-	.long	0x6ED9EBA1
-	.long	0x6ED9EBA1
 	.long	0x6ED9EBA1
-rconst0x8F1BBCDC:
 	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-rconst0xCA62C1D6:
-	.long	0xCA62C1D6
-	.long	0xCA62C1D6
-	.long	0xCA62C1D6
 	.long	0xCA62C1D6
 
 #endif
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index fb1e4b57e..a10ac411d 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -34,6 +34,7 @@ exec >hash_md5_sha_x86-64.S
 xmmT1="%xmm4"
 xmmT2="%xmm5"
 xmmRCONST="%xmm6"
+xmmALLRCONST="%xmm7"
 T=`printf '\t'`
 
 # SSE instructions are longer than 4 bytes on average.
@@ -125,6 +126,7 @@ sha1_process_block64:
 # xmm0..xmm3: W[]
 # xmm4,xmm5: temps
 # xmm6: current round constant
+# xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
 	movl	80(%rdi), %eax		# a = ctx->hash[0]
@@ -133,16 +135,17 @@ sha1_process_block64:
 	movl	92(%rdi), %edx		# d = ctx->hash[3]
 	movl	96(%rdi), %ebp		# e = ctx->hash[4]
 
-	movaps	rconst0x5A827999(%rip), $xmmRCONST
+	movaps	sha1const(%rip), $xmmALLRCONST
+	pshufd	\$0x00, $xmmALLRCONST, $xmmRCONST
 
 	# Load W[] to xmm registers, byteswapping on the fly.
 	#
 	# For iterations 0..15, we pass W[] in rsi,r8..r14
-	# for use in RD1A's instead of spilling them to stack.
+	# for use in RD1As instead of spilling them to stack.
 	# We lose parallelized addition of RCONST, but LEA
-	# can do two additions at once, so it's probably a wash.
+	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# LEAs in two first RD1A's shorter by one byte).
+	# LEAs in two first RD1As shorter by one byte).
 	movq	4*0(%rdi), %rsi
 	movq	4*2(%rdi), %r8
 	bswapq	%rsi
@@ -359,7 +362,7 @@ RD1A bx cx dx bp ax  4; RD1A ax bx cx dx bp  5; RD1A bp ax bx cx dx  6; RD1A dx
 a=`PREP %xmm0 %xmm1 %xmm2 %xmm3 "-64+16*0(%rsp)"`
 b=`RD1A cx dx bp ax bx  8; RD1A bx cx dx bp ax  9; RD1A ax bx cx dx bp 10; RD1A bp ax bx cx dx 11;`
 INTERLEAVE "$a" "$b"
-a=`echo "	movaps	rconst0x6ED9EBA1(%rip), $xmmRCONST"
+a=`echo "	pshufd	\\$0x55, $xmmALLRCONST, $xmmRCONST"
    PREP %xmm1 %xmm2 %xmm3 %xmm0 "-64+16*1(%rsp)"`
 b=`RD1A dx bp ax bx cx 12; RD1A cx dx bp ax bx 13; RD1A bx cx dx bp ax 14; RD1A ax bx cx dx bp 15;`
 INTERLEAVE "$a" "$b"
@@ -378,7 +381,7 @@ INTERLEAVE "$a" "$b"
 a=`PREP %xmm1 %xmm2 %xmm3 %xmm0 "-64+16*1(%rsp)"`
 b=`RD2 cx dx bp ax bx 28; RD2 bx cx dx bp ax 29; RD2 ax bx cx dx bp 30; RD2 bp ax bx cx dx 31;`
 INTERLEAVE "$a" "$b"
-a=`echo "	movaps	rconst0x8F1BBCDC(%rip), $xmmRCONST"
+a=`echo "	pshufd	\\$0xaa, $xmmALLRCONST, $xmmRCONST"
    PREP %xmm2 %xmm3 %xmm0 %xmm1 "-64+16*2(%rsp)"`
 b=`RD2 dx bp ax bx cx 32; RD2 cx dx bp ax bx 33; RD2 bx cx dx bp ax 34; RD2 ax bx cx dx bp 35;`
 INTERLEAVE "$a" "$b"
@@ -397,7 +400,7 @@ INTERLEAVE "$a" "$b"
 a=`PREP %xmm2 %xmm3 %xmm0 %xmm1 "-64+16*2(%rsp)"`
 b=`RD3 cx dx bp ax bx 48; RD3 bx cx dx bp ax 49; RD3 ax bx cx dx bp 50; RD3 bp ax bx cx dx 51;`
 INTERLEAVE "$a" "$b"
-a=`echo "	movaps	rconst0xCA62C1D6(%rip), $xmmRCONST"
+a=`echo "	pshufd	\\$0xff, $xmmALLRCONST, $xmmRCONST"
    PREP %xmm3 %xmm0 %xmm1 %xmm2 "-64+16*3(%rsp)"`
 b=`RD3 dx bp ax bx cx 52; RD3 cx dx bp ax bx 53; RD3 bx cx dx bp ax 54; RD3 ax bx cx dx bp 55;`
 INTERLEAVE "$a" "$b"
@@ -439,25 +442,10 @@ echo "
 
 	.section	.rodata.cst16.sha1const, \"aM\", @progbits, 16
 	.balign	16
-rconst0x5A827999:
+sha1const:
 	.long	0x5A827999
-	.long	0x5A827999
-	.long	0x5A827999
-	.long	0x5A827999
-rconst0x6ED9EBA1:
-	.long	0x6ED9EBA1
-	.long	0x6ED9EBA1
-	.long	0x6ED9EBA1
 	.long	0x6ED9EBA1
-rconst0x8F1BBCDC:
 	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-	.long	0x8F1BBCDC
-rconst0xCA62C1D6:
-	.long	0xCA62C1D6
-	.long	0xCA62C1D6
-	.long	0xCA62C1D6
 	.long	0xCA62C1D6
 
 #endif"

From vda.linux at googlemail.com  Mon Feb  7 01:34:04 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Mon, 7 Feb 2022 02:34:04 +0100
Subject: [git commit] libbb/sha1: shrink and speed up unrolled x86-64 code
Message-ID: <20220208022853.B8D66831C4@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=c193cbd6dfd095c6b8346bab1ea6ba7106b3e5bb
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64                                3514    3482     -32

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S |   8 +-
 libbb/hash_md5_sha256_x86-64_shaNI.S |   8 +-
 libbb/hash_md5_sha_x86-32_shaNI.S    |   4 +-
 libbb/hash_md5_sha_x86-64.S          | 144 +++++++++++++++++++++++++++--------
 libbb/hash_md5_sha_x86-64.S.sh       |   9 ++-
 libbb/hash_md5_sha_x86-64_shaNI.S    |   4 +-
 6 files changed, 131 insertions(+), 46 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 417da37d8..39e2baf41 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -257,8 +257,8 @@ sha256_process_block64_shaNI:
 	ret
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
 
-.section	.rodata.cst256.K256, "aM", @progbits, 256
-.balign 16
+	.section	.rodata.cst256.K256, "aM", @progbits, 256
+	.balign 16
 K256:
 	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
 	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
@@ -277,8 +277,8 @@ K256:
 	.long	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
 	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
 
-.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
-.balign 16
+	.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
+	.balign 16
 PSHUFFLE_BSWAP32_FLIP_MASK:
 	.octa 0x0c0d0e0f08090a0b0405060700010203
 
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index dbf391135..c6c931341 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -253,8 +253,8 @@ sha256_process_block64_shaNI:
 	ret
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
 
-.section	.rodata.cst256.K256, "aM", @progbits, 256
-.balign 16
+	.section	.rodata.cst256.K256, "aM", @progbits, 256
+	.balign 16
 K256:
 	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
 	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
@@ -273,8 +273,8 @@ K256:
 	.long	0x748f82ee,0x78a5636f,0x84c87814,0x8cc70208
 	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
 
-.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
-.balign 16
+	.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
+	.balign 16
 PSHUFFLE_BSWAP32_FLIP_MASK:
 	.octa 0x0c0d0e0f08090a0b0405060700010203
 
diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index 11b855e26..5d082ebfb 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -223,8 +223,8 @@ sha1_process_block64_shaNI:
 	ret
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
-.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-.balign 16
+	.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
+	.balign 16
 PSHUFFLE_BYTE_FLIP_MASK:
 	.octa 0x000102030405060708090a0b0c0d0e0f
 
diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index 47ace60de..e26c46f25 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -180,8 +180,13 @@ sha1_process_block64:
 # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp)
 	movaps	%xmm3, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm0, %xmm5
+	shufps	$0x4e, %xmm1, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm2, %xmm0	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm0	# ^
@@ -252,8 +257,13 @@ sha1_process_block64:
 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp)
 	movaps	%xmm0, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm1, %xmm5
+	shufps	$0x4e, %xmm2, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm3, %xmm1	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm1	# ^
@@ -323,8 +333,13 @@ sha1_process_block64:
 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp)
 	movaps	%xmm1, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm2, %xmm5
+	shufps	$0x4e, %xmm3, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm0, %xmm2	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm2	# ^
@@ -392,8 +407,13 @@ sha1_process_block64:
 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp)
 	movaps	%xmm2, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm3, %xmm5
+	shufps	$0x4e, %xmm0, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm1, %xmm3	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm3	# ^
@@ -457,8 +477,13 @@ sha1_process_block64:
 # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp)
 	movaps	%xmm3, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm0, %xmm5
+	shufps	$0x4e, %xmm1, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm2, %xmm0	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm0	# ^
@@ -522,8 +547,13 @@ sha1_process_block64:
 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp)
 	movaps	%xmm0, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm1, %xmm5
+	shufps	$0x4e, %xmm2, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm3, %xmm1	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm1	# ^
@@ -588,8 +618,13 @@ sha1_process_block64:
 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp)
 	movaps	%xmm1, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm2, %xmm5
+	shufps	$0x4e, %xmm3, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm0, %xmm2	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm2	# ^
@@ -653,8 +688,13 @@ sha1_process_block64:
 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp)
 	movaps	%xmm2, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm3, %xmm5
+	shufps	$0x4e, %xmm0, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm1, %xmm3	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm3	# ^
@@ -718,8 +758,13 @@ sha1_process_block64:
 # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp)
 	movaps	%xmm3, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm0, %xmm5
+	shufps	$0x4e, %xmm1, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm2, %xmm0	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm0	# ^
@@ -795,8 +840,13 @@ sha1_process_block64:
 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp)
 	movaps	%xmm0, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm1, %xmm5
+	shufps	$0x4e, %xmm2, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm3, %xmm1	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm1	# ^
@@ -872,8 +922,13 @@ sha1_process_block64:
 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp)
 	movaps	%xmm1, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm2, %xmm5
+	shufps	$0x4e, %xmm3, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm0, %xmm2	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm2	# ^
@@ -950,8 +1005,13 @@ sha1_process_block64:
 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp)
 	movaps	%xmm2, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm3, %xmm5
+	shufps	$0x4e, %xmm0, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm1, %xmm3	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm3	# ^
@@ -1027,8 +1087,13 @@ sha1_process_block64:
 # PREP %xmm0 %xmm1 %xmm2 %xmm3 -64+16*0(%rsp)
 	movaps	%xmm3, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm0, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm1, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm0, %xmm5
+	shufps	$0x4e, %xmm1, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm2, %xmm0	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm0	# ^
@@ -1104,8 +1169,13 @@ sha1_process_block64:
 # PREP %xmm1 %xmm2 %xmm3 %xmm0 -64+16*1(%rsp)
 	movaps	%xmm0, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm1, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm2, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm1, %xmm5
+	shufps	$0x4e, %xmm2, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm3, %xmm1	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm1	# ^
@@ -1169,8 +1239,13 @@ sha1_process_block64:
 # PREP %xmm2 %xmm3 %xmm0 %xmm1 -64+16*2(%rsp)
 	movaps	%xmm1, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm2, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm3, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm2, %xmm5
+	shufps	$0x4e, %xmm3, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm0, %xmm2	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm2	# ^
@@ -1234,8 +1309,13 @@ sha1_process_block64:
 # PREP %xmm3 %xmm0 %xmm1 %xmm2 -64+16*3(%rsp)
 	movaps	%xmm2, %xmm4
 	psrldq	$4, %xmm4	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
-	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	$0x4e, %xmm3, %xmm5	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq %xmm0, %xmm5	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	%xmm3, %xmm5
+	shufps	$0x4e, %xmm0, %xmm5	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 	xorps	%xmm1, %xmm3	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	%xmm4, %xmm5	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
 	xorps	%xmm5, %xmm3	# ^
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index 656fb5414..fb1e4b57e 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -203,8 +203,13 @@ echo "# PREP $@
 	movaps	$xmmW12, $xmmT1
 	psrldq	\$4, $xmmT1	# rshift by 4 bytes: T1 = ([13],[14],[15],0)
 
-	pshufd	\$0x4e, $xmmW0, $xmmT2	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
-	punpcklqdq $xmmW4, $xmmT2	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+#	pshufd	\$0x4e, $xmmW0, $xmmT2	# 01001110=2,3,0,1 shuffle, ([2],[3],x,x)
+#	punpcklqdq $xmmW4, $xmmT2	# T2 = W4[0..63]:T2[0..63] = ([2],[3],[4],[5])
+# same result as above, but shorter and faster:
+# pshufd/shufps are subtly different: pshufd takes all dwords from source operand,
+# shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one!
+	movaps	$xmmW0, $xmmT2
+	shufps	\$0x4e, $xmmW4, $xmmT2	# 01001110=(T2.dw[2], T2.dw[3], W4.dw[0], W4.dw[1]) = ([2],[3],[4],[5])
 
 	xorps	$xmmW8, $xmmW0	# ([8],[9],[10],[11]) ^ ([0],[1],[2],[3])
 	xorps	$xmmT1, $xmmT2	# ([13],[14],[15],0) ^ ([2],[3],[4],[5])
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index ba92f09df..8ddec87ce 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -217,8 +217,8 @@ sha1_process_block64_shaNI:
 	ret
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
-.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-.balign 16
+	.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
+	.balign 16
 PSHUFFLE_BYTE_FLIP_MASK:
 	.octa 0x000102030405060708090a0b0c0d0e0f
 

From vda.linux at googlemail.com  Tue Feb  8 07:22:17 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Tue, 8 Feb 2022 08:22:17 +0100
Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing
Message-ID: <20220208073205.50AE1813BB@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=71a1cccaad679bd102f87283f78c581a8fb0e255
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64_shaNI 32-bit                    524     517      -7
sha1_process_block64_shaNI 64-bit                    510     508      -2

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-32_shaNI.S | 37 +++++++++++++++++--------------------
 libbb/hash_md5_sha_x86-64_shaNI.S | 24 ++++++++++++------------
 2 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index 5d082ebfb..0f3fe57ca 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -32,14 +32,10 @@
 #define MSG1		%xmm4
 #define MSG2		%xmm5
 #define MSG3		%xmm6
-#define SHUF_MASK	%xmm7
 
-	.balign	8	# allow decoders to fetch at least 3 first insns
+	.balign	8	# allow decoders to fetch at least 2 first insns
 sha1_process_block64_shaNI:
-	pushl		%ebp
-	movl		%esp, %ebp
-	subl		$32, %esp
-	andl		$~0xF, %esp	# paddd needs aligned memory operand
+	subl		$16, %esp
 
 	/* load initial hash values */
 	xor128		E0, E0
@@ -47,30 +43,33 @@ sha1_process_block64_shaNI:
 	pinsrd		$3, 76+4*4(%eax), E0	# load to uppermost 32-bit word
 	shuf128_32	$0x1B, ABCD, ABCD	# DCBA -> ABCD
 
-	mova128		PSHUFFLE_BYTE_FLIP_MASK, SHUF_MASK
+	mova128		PSHUFFLE_BYTE_FLIP_MASK, %xmm7
+
+	movu128		0*16(%eax), MSG0
+	pshufb		%xmm7, MSG0
+	movu128		1*16(%eax), MSG1
+	pshufb		%xmm7, MSG1
+	movu128		2*16(%eax), MSG2
+	pshufb		%xmm7, MSG2
+	movu128		3*16(%eax), MSG3
+	pshufb		%xmm7, MSG3
 
 	/* Save hash values for addition after rounds */
-	movu128		E0, 16(%esp)
+	movu128		E0, %xmm7
 	movu128		ABCD, (%esp)
 
 	/* Rounds 0-3 */
-	movu128		0*16(%eax), MSG0
-	pshufb		SHUF_MASK, MSG0
 		paddd		MSG0, E0
 		mova128		ABCD, E1
 		sha1rnds4	$0, E0, ABCD
 
 	/* Rounds 4-7 */
-	movu128		1*16(%eax), MSG1
-	pshufb		SHUF_MASK, MSG1
 		sha1nexte	MSG1, E1
 		mova128		ABCD, E0
 		sha1rnds4	$0, E1, ABCD
 	sha1msg1	MSG1, MSG0
 
 	/* Rounds 8-11 */
-	movu128		2*16(%eax), MSG2
-	pshufb		SHUF_MASK, MSG2
 		sha1nexte	MSG2, E0
 		mova128		ABCD, E1
 		sha1rnds4	$0, E0, ABCD
@@ -78,8 +77,6 @@ sha1_process_block64_shaNI:
 	xor128		MSG2, MSG0
 
 	/* Rounds 12-15 */
-	movu128		3*16(%eax), MSG3
-	pshufb		SHUF_MASK, MSG3
 		sha1nexte	MSG3, E1
 		mova128		ABCD, E0
 	sha1msg2	MSG3, MSG0
@@ -210,16 +207,16 @@ sha1_process_block64_shaNI:
 		sha1rnds4	$3, E1, ABCD
 
 	/* Add current hash values with previously saved */
-	sha1nexte	16(%esp), E0
-	paddd		(%esp), ABCD
+	sha1nexte	%xmm7, E0
+	movu128		(%esp), %xmm7
+	paddd		%xmm7, ABCD
 
 	/* Write hash values back in the correct order */
 	shuf128_32	$0x1B, ABCD, ABCD
 	movu128		ABCD, 76(%eax)
 	extr128_32	$3, E0, 76+4*4(%eax)
 
-	movl	%ebp, %esp
-	popl	%ebp
+	addl		$16, %esp
 	ret
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index 8ddec87ce..fc2ca92e8 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -32,7 +32,6 @@
 #define MSG1		%xmm4
 #define MSG2		%xmm5
 #define MSG3		%xmm6
-#define SHUF_MASK	%xmm7
 
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha1_process_block64_shaNI:
@@ -43,30 +42,33 @@ sha1_process_block64_shaNI:
 	pinsrd		$3, 80+4*4(%rdi), E0	# load to uppermost 32-bit word
 	shuf128_32	$0x1B, ABCD, ABCD	# DCBA -> ABCD
 
-	mova128		PSHUFFLE_BYTE_FLIP_MASK(%rip), SHUF_MASK
+	mova128		PSHUFFLE_BYTE_FLIP_MASK(%rip), %xmm7
+
+	movu128		0*16(%rdi), MSG0
+	pshufb		%xmm7, MSG0
+	movu128		1*16(%rdi), MSG1
+	pshufb		%xmm7, MSG1
+	movu128		2*16(%rdi), MSG2
+	pshufb		%xmm7, MSG2
+	movu128		3*16(%rdi), MSG3
+	pshufb		%xmm7, MSG3
 
 	/* Save hash values for addition after rounds */
-	mova128		E0, %xmm9
+	mova128		E0, %xmm7
 	mova128		ABCD, %xmm8
 
 	/* Rounds 0-3 */
-	movu128		0*16(%rdi), MSG0
-	pshufb		SHUF_MASK, MSG0
 		paddd		MSG0, E0
 		mova128		ABCD, E1
 		sha1rnds4	$0, E0, ABCD
 
 	/* Rounds 4-7 */
-	movu128		1*16(%rdi), MSG1
-	pshufb		SHUF_MASK, MSG1
 		sha1nexte	MSG1, E1
 		mova128		ABCD, E0
 		sha1rnds4	$0, E1, ABCD
 	sha1msg1	MSG1, MSG0
 
 	/* Rounds 8-11 */
-	movu128		2*16(%rdi), MSG2
-	pshufb		SHUF_MASK, MSG2
 		sha1nexte	MSG2, E0
 		mova128		ABCD, E1
 		sha1rnds4	$0, E0, ABCD
@@ -74,8 +76,6 @@ sha1_process_block64_shaNI:
 	xor128		MSG2, MSG0
 
 	/* Rounds 12-15 */
-	movu128		3*16(%rdi), MSG3
-	pshufb		SHUF_MASK, MSG3
 		sha1nexte	MSG3, E1
 		mova128		ABCD, E0
 	sha1msg2	MSG3, MSG0
@@ -206,7 +206,7 @@ sha1_process_block64_shaNI:
 		sha1rnds4	$3, E1, ABCD
 
 	/* Add current hash values with previously saved */
-	sha1nexte	%xmm9, E0
+	sha1nexte	%xmm7, E0
 	paddd		%xmm8, ABCD
 
 	/* Write hash values back in the correct order */

From vda.linux at googlemail.com  Tue Feb  8 14:23:26 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Tue, 8 Feb 2022 15:23:26 +0100
Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing
 (32-bit)
Message-ID: <20220208141656.F168E829FC@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=eb52e7fa522d829fb400461ca4c808ee5c1d6428
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64_shaNI                           517     511      -6

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-32_shaNI.S | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index 0f3fe57ca..ad814a21b 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -35,11 +35,9 @@
 
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha1_process_block64_shaNI:
-	subl		$16, %esp
-
 	/* load initial hash values */
-	xor128		E0, E0
 	movu128		76(%eax), ABCD
+	xor128		E0, E0
 	pinsrd		$3, 76+4*4(%eax), E0	# load to uppermost 32-bit word
 	shuf128_32	$0x1B, ABCD, ABCD	# DCBA -> ABCD
 
@@ -56,7 +54,7 @@ sha1_process_block64_shaNI:
 
 	/* Save hash values for addition after rounds */
 	movu128		E0, %xmm7
-	movu128		ABCD, (%esp)
+	/*movu128	ABCD, %xmm8 - NOPE, 32bit has no xmm8 */
 
 	/* Rounds 0-3 */
 		paddd		MSG0, E0
@@ -208,7 +206,9 @@ sha1_process_block64_shaNI:
 
 	/* Add current hash values with previously saved */
 	sha1nexte	%xmm7, E0
-	movu128		(%esp), %xmm7
+	/*paddd		%xmm8, ABCD - 32-bit mode has no xmm8 */
+	movu128		76(%eax), %xmm7		# recreate original ABCD
+	shuf128_32	$0x1B, %xmm7, %xmm7	#  DCBA -> ABCD
 	paddd		%xmm7, ABCD
 
 	/* Write hash values back in the correct order */
@@ -216,7 +216,6 @@ sha1_process_block64_shaNI:
 	movu128		ABCD, 76(%eax)
 	extr128_32	$3, E0, 76+4*4(%eax)
 
-	addl		$16, %esp
 	ret
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 

From vda.linux at googlemail.com  Tue Feb  8 14:34:02 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Tue, 8 Feb 2022 15:34:02 +0100
Subject: [git commit] libbb/sha1: shrink x86 hardware accelerated hashing
 (32-bit)
Message-ID: <20220208142920.E0AE082B5D@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=eb8d5f3b8f3c91f3ed82a52b4ce52a154c146ede
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64_shaNI                           511     507      -4

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-32_shaNI.S | 9 ++++-----
 libbb/hash_md5_sha_x86-64_shaNI.S | 3 +--
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index ad814a21b..a61b3cbed 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -53,8 +53,8 @@ sha1_process_block64_shaNI:
 	pshufb		%xmm7, MSG3
 
 	/* Save hash values for addition after rounds */
-	movu128		E0, %xmm7
-	/*movu128	ABCD, %xmm8 - NOPE, 32bit has no xmm8 */
+	mova128		E0, %xmm7
+	/*mova128	ABCD, %xmm8 - NOPE, 32bit has no xmm8 */
 
 	/* Rounds 0-3 */
 		paddd		MSG0, E0
@@ -207,12 +207,11 @@ sha1_process_block64_shaNI:
 	/* Add current hash values with previously saved */
 	sha1nexte	%xmm7, E0
 	/*paddd		%xmm8, ABCD - 32-bit mode has no xmm8 */
-	movu128		76(%eax), %xmm7		# recreate original ABCD
-	shuf128_32	$0x1B, %xmm7, %xmm7	#  DCBA -> ABCD
-	paddd		%xmm7, ABCD
+	movu128		76(%eax), %xmm7	# get original ABCD (not shuffled)...
 
 	/* Write hash values back in the correct order */
 	shuf128_32	$0x1B, ABCD, ABCD
+	paddd		%xmm7, ABCD	# ...add it to final ABCD
 	movu128		ABCD, 76(%eax)
 	extr128_32	$3, E0, 76+4*4(%eax)
 
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index fc2ca92e8..b32029360 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -36,9 +36,8 @@
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha1_process_block64_shaNI:
 	/* load initial hash values */
-
-	xor128		E0, E0
 	movu128		80(%rdi), ABCD
+	xor128		E0, E0
 	pinsrd		$3, 80+4*4(%rdi), E0	# load to uppermost 32-bit word
 	shuf128_32	$0x1B, ABCD, ABCD	# DCBA -> ABCD
 

From bugzilla at busybox.net  Tue Feb  8 15:48:37 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Tue, 08 Feb 2022 15:48:37 +0000
Subject: [Bug 14571] New: ash crashes with fork (&) and stty -echo
Message-ID: <bug-14571-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14571

            Bug ID: 14571
           Summary: ash crashes with fork (&) and stty -echo
           Product: Busybox
           Version: 1.33.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Other
          Assignee: unassigned at busybox.net
          Reporter: cyrilbur at gmail.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

Setting -echo and leaving a dangling fork results in an ash crash.

I have a relatively stripped down busybox, I am using the busybox coreutls.

Reproduce:

stty -echo
sleep 1 & ps &
             ^ This is the problem

ash will crash.

I have done the same thing with bash and dash and neither crash.

If I have time I will endeavour to get more information.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Tue Feb  8 18:41:34 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Tue, 08 Feb 2022 18:41:34 +0000
Subject: [Bug 14576] New: unzip: test skipped with bad archive
Message-ID: <bug-14576-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14576

            Bug ID: 14576
           Summary: unzip: test skipped with bad archive
           Product: Busybox
           Version: 1.33.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: major
          Priority: P5
         Component: Standard Compliance
          Assignee: unassigned at busybox.net
          Reporter: dharanendiran at gmail.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

When I run the testsuite, the unzip (bad archive) was skipped. So the expected
results is always skipped here or it require further components to get succeed.

# ./runtest -v unzip
======================
echo -ne '' >input
echo -ne '' | unzip -q foo.zip foo/ && test -d foo && test ! -f foo/bar && echo
yes
PASS: unzip (subdir only)
SKIPPED: unzip (bad archive)
======================
echo -ne '' >input
echo -ne '' | unzip -p ../unzip_bad_lzma_1.zip 2>&1; echo $?
PASS: unzip (archive with corrupted lzma 1)
======================
echo -ne '' >input
echo -ne '' | unzip -p ../unzip_bad_lzma_2.zip 2>&1; echo $?
PASS: unzip (archive with corrupted lzma 2)
# 

The following config options are enabled in busybox:
FEATURE_UNZIP_CDF CONFIG_UNICODE_SUPPORT UUDECODE

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From vda.linux at googlemail.com  Wed Feb  9 00:30:23 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Wed, 9 Feb 2022 01:30:23 +0100
Subject: [git commit] libbb/sha256: code shrink in 32-bit x86
Message-ID: <20220209003846.5A4608148C@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=461a994b09c5022b93bccccf903b39438d61bbf1
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         697     676     -21

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index a849dfcc2..846230e3e 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -34,16 +34,18 @@
 
 #define XMMTMP		%xmm7
 
+#define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6))
+
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
-	movu128		76+0*16(%eax), STATE0
-	movu128		76+1*16(%eax), STATE1
 
-	shuf128_32	$0xB1, STATE0, STATE0		/* CDAB */
-	shuf128_32	$0x1B, STATE1, STATE1		/* EFGH */
+	movu128		76+0*16(%eax), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */
+	movu128		76+1*16(%eax), STATE0 /* HGFE */
+/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
 	mova128		STATE0, XMMTMP
-	palignr		$8, STATE1, STATE0		/* ABEF */
-	pblendw		$0xF0, XMMTMP, STATE1		/* CDGH */
+	shufps		SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */
+	shufps		SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */
+	mova128		XMMTMP, STATE1
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP
@@ -231,18 +233,19 @@ sha256_process_block64_shaNI:
 		sha256rnds2	STATE1, STATE0
 
 	/* Write hash values back in the correct order */
-	shuf128_32	$0x1B, STATE0, STATE0		/* FEBA */
-	shuf128_32	$0xB1, STATE1, STATE1		/* DCHG */
+	/* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */
+	/* STATE1: CDGH */
 	mova128		STATE0, XMMTMP
-	pblendw		$0xF0, STATE1, STATE0		/* DCBA */
-	palignr		$8, XMMTMP, STATE1		/* HGFE */
+/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
+	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */
+	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */
 	/* add current hash values to previous ones */
+	movu128		76+1*16(%eax), STATE1
+	paddd		XMMTMP, STATE1
+	movu128		STATE1, 76+1*16(%eax)
 	movu128		76+0*16(%eax), XMMTMP
 	paddd		XMMTMP, STATE0
-	movu128		76+1*16(%eax), XMMTMP
 	movu128		STATE0, 76+0*16(%eax)
-	paddd		XMMTMP, STATE1
-	movu128		STATE1, 76+1*16(%eax)
 
 	ret
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI

From vda.linux at googlemail.com  Tue Feb  8 23:33:39 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Wed, 9 Feb 2022 00:33:39 +0100
Subject: [git commit] libbb/sha256: code shrink in 32-bit x86
Message-ID: <20220209003846.4FB3582DFD@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=c0ff0d4528d718c20b9ca2290bd10d59e9f794a3
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         713     697     -16

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 130 ++++++++++++++++-------------------
 libbb/hash_md5_sha256_x86-64_shaNI.S | 107 ++++++++++++++--------------
 2 files changed, 114 insertions(+), 123 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 39e2baf41..a849dfcc2 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -31,35 +31,27 @@
 #define MSGTMP1		%xmm4
 #define MSGTMP2		%xmm5
 #define MSGTMP3		%xmm6
-#define XMMTMP4		%xmm7
 
-	.balign	8	# allow decoders to fetch at least 3 first insns
-sha256_process_block64_shaNI:
-	pushl		%ebp
-	movl		%esp, %ebp
-	subl		$32, %esp
-	andl		$~0xF, %esp	# paddd needs aligned memory operand
+#define XMMTMP		%xmm7
 
+	.balign	8	# allow decoders to fetch at least 2 first insns
+sha256_process_block64_shaNI:
 	movu128		76+0*16(%eax), STATE0
 	movu128		76+1*16(%eax), STATE1
 
-	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
-	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
-	mova128		STATE0, XMMTMP4
-	palignr		$8, STATE1,  STATE0		/* ABEF */
-	pblendw		$0xF0, XMMTMP4, STATE1		/* CDGH */
+	shuf128_32	$0xB1, STATE0, STATE0		/* CDAB */
+	shuf128_32	$0x1B, STATE1, STATE1		/* EFGH */
+	mova128		STATE0, XMMTMP
+	palignr		$8, STATE1, STATE0		/* ABEF */
+	pblendw		$0xF0, XMMTMP, STATE1		/* CDGH */
 
-/* XMMTMP4 holds flip mask from here... */
-	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP4
+/* XMMTMP holds flip mask from here... */
+	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP
 	movl		$K256+8*16, SHA256CONSTANTS
 
-	/* Save hash values for addition after rounds */
-	mova128		STATE0, 0*16(%esp)
-	mova128		STATE1, 1*16(%esp)
-
 	/* Rounds 0-3 */
 	movu128		0*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -68,7 +60,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -78,7 +70,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 8-11 */
 	movu128		2*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -88,14 +80,14 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 12-15 */
 	movu128		3*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 /* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -105,9 +97,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -117,9 +109,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -129,9 +121,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -141,9 +133,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -153,9 +145,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -165,9 +157,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -177,9 +169,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -189,9 +181,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -201,9 +193,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -213,9 +205,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -224,9 +216,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -238,22 +230,20 @@ sha256_process_block64_shaNI:
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
 
-	/* Add current hash values with previously saved */
-	paddd		0*16(%esp), STATE0
-	paddd		1*16(%esp), STATE1
-
 	/* Write hash values back in the correct order */
-	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
-	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
-	mova128		STATE0, XMMTMP4
-	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
-	palignr		$8, XMMTMP4, STATE1		/* HGFE */
-
+	shuf128_32	$0x1B, STATE0, STATE0		/* FEBA */
+	shuf128_32	$0xB1, STATE1, STATE1		/* DCHG */
+	mova128		STATE0, XMMTMP
+	pblendw		$0xF0, STATE1, STATE0		/* DCBA */
+	palignr		$8, XMMTMP, STATE1		/* HGFE */
+	/* add current hash values to previous ones */
+	movu128		76+0*16(%eax), XMMTMP
+	paddd		XMMTMP, STATE0
+	movu128		76+1*16(%eax), XMMTMP
 	movu128		STATE0, 76+0*16(%eax)
+	paddd		XMMTMP, STATE1
 	movu128		STATE1, 76+1*16(%eax)
 
-	movl	%ebp, %esp
-	popl	%ebp
 	ret
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
 
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index c6c931341..b5c950a9a 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -31,7 +31,8 @@
 #define MSGTMP1		%xmm4
 #define MSGTMP2		%xmm5
 #define MSGTMP3		%xmm6
-#define XMMTMP4		%xmm7
+
+#define XMMTMP		%xmm7
 
 #define ABEF_SAVE	%xmm9
 #define CDGH_SAVE	%xmm10
@@ -41,14 +42,14 @@ sha256_process_block64_shaNI:
 	movu128		80+0*16(%rdi), STATE0
 	movu128		80+1*16(%rdi), STATE1
 
-	shuf128_32	$0xB1, STATE0,  STATE0		/* CDAB */
-	shuf128_32	$0x1B, STATE1,  STATE1		/* EFGH */
-	mova128		STATE0, XMMTMP4
-	palignr		$8, STATE1,  STATE0		/* ABEF */
-	pblendw		$0xF0, XMMTMP4, STATE1		/* CDGH */
+	shuf128_32	$0xB1, STATE0, STATE0		/* CDAB */
+	shuf128_32	$0x1B, STATE1, STATE1		/* EFGH */
+	mova128		STATE0, XMMTMP
+	palignr		$8, STATE1, STATE0		/* ABEF */
+	pblendw		$0xF0, XMMTMP, STATE1		/* CDGH */
 
-/* XMMTMP4 holds flip mask from here... */
-	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP4
+/* XMMTMP holds flip mask from here... */
+	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP
 	leaq		K256+8*16(%rip), SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
@@ -57,7 +58,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 0-3 */
 	movu128		0*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -66,7 +67,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -76,7 +77,7 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 8-11 */
 	movu128		2*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
@@ -86,14 +87,14 @@ sha256_process_block64_shaNI:
 
 	/* Rounds 12-15 */
 	movu128		3*16(DATA_PTR), MSG
-	pshufb		XMMTMP4, MSG
+	pshufb		XMMTMP, MSG
 /* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -103,9 +104,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -115,9 +116,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -127,9 +128,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -139,9 +140,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -151,9 +152,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -163,9 +164,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -175,9 +176,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -187,9 +188,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP3, XMMTMP4
-	palignr		$4, MSGTMP2, XMMTMP4
-	paddd		XMMTMP4, MSGTMP0
+	mova128		MSGTMP3, XMMTMP
+	palignr		$4, MSGTMP2, XMMTMP
+	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -199,9 +200,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP0, XMMTMP4
-	palignr		$4, MSGTMP3, XMMTMP4
-	paddd		XMMTMP4, MSGTMP1
+	mova128		MSGTMP0, XMMTMP
+	palignr		$4, MSGTMP3, XMMTMP
+	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -211,9 +212,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP1, XMMTMP4
-	palignr		$4, MSGTMP0, XMMTMP4
-	paddd		XMMTMP4, MSGTMP2
+	mova128		MSGTMP1, XMMTMP
+	palignr		$4, MSGTMP0, XMMTMP
+	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -222,9 +223,9 @@ sha256_process_block64_shaNI:
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
 		sha256rnds2	STATE0, STATE1
-	mova128		MSGTMP2, XMMTMP4
-	palignr		$4, MSGTMP1, XMMTMP4
-	paddd		XMMTMP4, MSGTMP3
+	mova128		MSGTMP2, XMMTMP
+	palignr		$4, MSGTMP1, XMMTMP
+	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
 		sha256rnds2	STATE1, STATE0
@@ -241,11 +242,11 @@ sha256_process_block64_shaNI:
 	paddd		CDGH_SAVE, STATE1
 
 	/* Write hash values back in the correct order */
-	shuf128_32	$0x1B, STATE0,  STATE0		/* FEBA */
-	shuf128_32	$0xB1, STATE1,  STATE1		/* DCHG */
-	mova128		STATE0, XMMTMP4
-	pblendw		$0xF0, STATE1,  STATE0		/* DCBA */
-	palignr		$8, XMMTMP4, STATE1		/* HGFE */
+	shuf128_32	$0x1B, STATE0, STATE0		/* FEBA */
+	shuf128_32	$0xB1, STATE1, STATE1		/* DCHG */
+	mova128		STATE0, XMMTMP
+	pblendw		$0xF0, STATE1, STATE0		/* DCBA */
+	palignr		$8, XMMTMP, STATE1		/* HGFE */
 
 	movu128		STATE0, 80+0*16(%rdi)
 	movu128		STATE1, 80+1*16(%rdi)

From vda.linux at googlemail.com  Wed Feb  9 00:42:49 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Wed, 9 Feb 2022 01:42:49 +0100
Subject: [git commit] libbb/sha256: code shrink in 64-bit x86
Message-ID: <20220209003846.64EAB8315B@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=11bcea7ac0ac4b2156c1b2d53f926d789b9792b4
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI                         701     680     -21

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-64_shaNI.S | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index b5c950a9a..bc063b9cc 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -37,16 +37,18 @@
 #define ABEF_SAVE	%xmm9
 #define CDGH_SAVE	%xmm10
 
+#define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6))
+
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
-	movu128		80+0*16(%rdi), STATE0
-	movu128		80+1*16(%rdi), STATE1
 
-	shuf128_32	$0xB1, STATE0, STATE0		/* CDAB */
-	shuf128_32	$0x1B, STATE1, STATE1		/* EFGH */
+	movu128		80+0*16(%rdi), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */
+	movu128		80+1*16(%rdi), STATE0 /* HGFE */
+/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
 	mova128		STATE0, XMMTMP
-	palignr		$8, STATE1, STATE0		/* ABEF */
-	pblendw		$0xF0, XMMTMP, STATE1		/* CDGH */
+	shufps		SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */
+	shufps		SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */
+	mova128		XMMTMP, STATE1
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP
@@ -242,14 +244,15 @@ sha256_process_block64_shaNI:
 	paddd		CDGH_SAVE, STATE1
 
 	/* Write hash values back in the correct order */
-	shuf128_32	$0x1B, STATE0, STATE0		/* FEBA */
-	shuf128_32	$0xB1, STATE1, STATE1		/* DCHG */
+	/* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */
+	/* STATE1: CDGH */
 	mova128		STATE0, XMMTMP
-	pblendw		$0xF0, STATE1, STATE0		/* DCBA */
-	palignr		$8, XMMTMP, STATE1		/* HGFE */
+/* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
+	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */
+	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */
 
 	movu128		STATE0, 80+0*16(%rdi)
-	movu128		STATE1, 80+1*16(%rdi)
+	movu128		XMMTMP, 80+1*16(%rdi)
 
 	ret
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI

From vda.linux at googlemail.com  Wed Feb  9 00:50:22 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Wed, 9 Feb 2022 01:50:22 +0100
Subject: [git commit] libbb/sha256: code shrink in x86 assembly
Message-ID: <20220209004444.C2B4182B8E@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=caa9c4f707b661cf398f2c2d66f54f5b0d8adfe2
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha256_process_block64_shaNI 32-bit                  676     673      -3
sha256_process_block64_shaNI 64-bit                  680     677      -3

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 11 +++++------
 libbb/hash_md5_sha256_x86-64_shaNI.S | 11 +++++------
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 846230e3e..aa68193bd 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -39,13 +39,12 @@
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
 
-	movu128		76+0*16(%eax), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */
-	movu128		76+1*16(%eax), STATE0 /* HGFE */
+	movu128		76+0*16(%eax), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */
+	movu128		76+1*16(%eax), STATE1 /* HGFE */
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
-	mova128		STATE0, XMMTMP
-	shufps		SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */
-	shufps		SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */
-	mova128		XMMTMP, STATE1
+	mova128		STATE1, STATE0
+	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */
+	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index bc063b9cc..4663f750a 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -42,13 +42,12 @@
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
 
-	movu128		80+0*16(%rdi), STATE1 /* DCBA (msb-to-lsb: 3,2,1,0) */
-	movu128		80+1*16(%rdi), STATE0 /* HGFE */
+	movu128		80+0*16(%rdi), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */
+	movu128		80+1*16(%rdi), STATE1 /* HGFE */
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
-	mova128		STATE0, XMMTMP
-	shufps		SHUF(1,0,1,0), STATE1, STATE0 /* ABEF */
-	shufps		SHUF(3,2,3,2), STATE1, XMMTMP /* CDGH */
-	mova128		XMMTMP, STATE1
+	mova128		STATE1, STATE0
+	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */
+	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP

From vda.linux at googlemail.com  Wed Feb  9 10:29:23 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Wed, 9 Feb 2022 11:29:23 +0100
Subject: [git commit] whitespace fix
Message-ID: <20220209102223.E965181D55@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=6a6c1c0ea91edeeb18736190feb5a7278d3d1141
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 6 +++---
 libbb/hash_md5_sha256_x86-64_shaNI.S | 6 +++---
 libbb/hash_md5_sha_x86-32_shaNI.S    | 4 ++--
 libbb/hash_md5_sha_x86-64_shaNI.S    | 4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index aa68193bd..413e2df9e 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -250,7 +250,7 @@ sha256_process_block64_shaNI:
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
 
 	.section	.rodata.cst256.K256, "aM", @progbits, 256
-	.balign 16
+	.balign	16
 K256:
 	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
 	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
@@ -270,8 +270,8 @@ K256:
 	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
 
 	.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
-	.balign 16
+	.balign	16
 PSHUFFLE_BSWAP32_FLIP_MASK:
-	.octa 0x0c0d0e0f08090a0b0405060700010203
+	.octa	0x0c0d0e0f08090a0b0405060700010203
 
 #endif
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index 4663f750a..c246762aa 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -257,7 +257,7 @@ sha256_process_block64_shaNI:
 	.size	sha256_process_block64_shaNI, .-sha256_process_block64_shaNI
 
 	.section	.rodata.cst256.K256, "aM", @progbits, 256
-	.balign 16
+	.balign	16
 K256:
 	.long	0x428a2f98,0x71374491,0xb5c0fbcf,0xe9b5dba5
 	.long	0x3956c25b,0x59f111f1,0x923f82a4,0xab1c5ed5
@@ -277,8 +277,8 @@ K256:
 	.long	0x90befffa,0xa4506ceb,0xbef9a3f7,0xc67178f2
 
 	.section	.rodata.cst16.PSHUFFLE_BSWAP32_FLIP_MASK, "aM", @progbits, 16
-	.balign 16
+	.balign	16
 PSHUFFLE_BSWAP32_FLIP_MASK:
-	.octa 0x0c0d0e0f08090a0b0405060700010203
+	.octa	0x0c0d0e0f08090a0b0405060700010203
 
 #endif
diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index a61b3cbed..afca98a62 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -219,8 +219,8 @@ sha1_process_block64_shaNI:
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
 	.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-	.balign 16
+	.balign	16
 PSHUFFLE_BYTE_FLIP_MASK:
-	.octa 0x000102030405060708090a0b0c0d0e0f
+	.octa	0x000102030405060708090a0b0c0d0e0f
 
 #endif
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index b32029360..54d122788 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -217,8 +217,8 @@ sha1_process_block64_shaNI:
 	.size	sha1_process_block64_shaNI, .-sha1_process_block64_shaNI
 
 	.section	.rodata.cst16.PSHUFFLE_BYTE_FLIP_MASK, "aM", @progbits, 16
-	.balign 16
+	.balign	16
 PSHUFFLE_BYTE_FLIP_MASK:
-	.octa 0x000102030405060708090a0b0c0d0e0f
+	.octa	0x000102030405060708090a0b0c0d0e0f
 
 #endif

From vda.linux at googlemail.com  Thu Feb 10 14:38:10 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Thu, 10 Feb 2022 15:38:10 +0100
Subject: [git commit] libbb/sha: improve comments
Message-ID: <20220210143100.BAFC48142B@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=6f56fa17131b3cbb84e887c6c5fb202f2492169e
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 18 +++++++++---------
 libbb/hash_md5_sha256_x86-64_shaNI.S | 19 +++++++++----------
 libbb/hash_md5_sha_x86-32_shaNI.S    |  2 +-
 libbb/hash_md5_sha_x86-64_shaNI.S    |  2 +-
 4 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 413e2df9e..4b33449d4 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -4,7 +4,7 @@
 // We use shorter insns, even though they are for "wrong"
 // data type (fp, not int).
 // For Intel, there is no penalty for doing it at all
-// (CPUs which do have such penalty do not support SHA1 insns).
+// (CPUs which do have such penalty do not support SHA insns).
 // For AMD, the penalty is one extra cycle
 // (allegedly: I failed to find measurable difference).
 
@@ -39,12 +39,13 @@
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
 
-	movu128		76+0*16(%eax), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */
-	movu128		76+1*16(%eax), STATE1 /* HGFE */
+	movu128		76+0*16(%eax), XMMTMP /* ABCD (little-endian dword order) */
+	movu128		76+1*16(%eax), STATE1 /* EFGH */
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
 	mova128		STATE1, STATE0
-	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */
-	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */
+	/* ---		-------------- ABCD -- EFGH */
+	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* FEBA */
+	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* HGDC */
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK, XMMTMP
@@ -232,12 +233,11 @@ sha256_process_block64_shaNI:
 		sha256rnds2	STATE1, STATE0
 
 	/* Write hash values back in the correct order */
-	/* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */
-	/* STATE1: CDGH */
 	mova128		STATE0, XMMTMP
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
-	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */
-	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */
+	/* ---		-------------- HGDC -- FEBA */
+	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* ABCD */
+	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* EFGH */
 	/* add current hash values to previous ones */
 	movu128		76+1*16(%eax), STATE1
 	paddd		XMMTMP, STATE1
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index c246762aa..5ed80c2ef 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -4,7 +4,7 @@
 // We use shorter insns, even though they are for "wrong"
 // data type (fp, not int).
 // For Intel, there is no penalty for doing it at all
-// (CPUs which do have such penalty do not support SHA1 insns).
+// (CPUs which do have such penalty do not support SHA insns).
 // For AMD, the penalty is one extra cycle
 // (allegedly: I failed to find measurable difference).
 
@@ -42,12 +42,13 @@
 	.balign	8	# allow decoders to fetch at least 2 first insns
 sha256_process_block64_shaNI:
 
-	movu128		80+0*16(%rdi), XMMTMP /* DCBA (msb-to-lsb: 3,2,1,0) */
-	movu128		80+1*16(%rdi), STATE1 /* HGFE */
+	movu128		80+0*16(%rdi), XMMTMP /* ABCD (little-endian dword order) */
+	movu128		80+1*16(%rdi), STATE1 /* EFGH */
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
 	mova128		STATE1, STATE0
-	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* ABEF */
-	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* CDGH */
+	/* ---		-------------- ABCD -- EFGH */
+	shufps		SHUF(1,0,1,0), XMMTMP, STATE0 /* FEBA */
+	shufps		SHUF(3,2,3,2), XMMTMP, STATE1 /* HGDC */
 
 /* XMMTMP holds flip mask from here... */
 	mova128		PSHUFFLE_BSWAP32_FLIP_MASK(%rip), XMMTMP
@@ -243,13 +244,11 @@ sha256_process_block64_shaNI:
 	paddd		CDGH_SAVE, STATE1
 
 	/* Write hash values back in the correct order */
-	/* STATE0: ABEF (msb-to-lsb: 3,2,1,0) */
-	/* STATE1: CDGH */
 	mova128		STATE0, XMMTMP
 /* shufps takes dwords 0,1 from *2nd* operand, and dwords 2,3 from 1st one */
-	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* DCBA */
-	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* HGFE */
-
+	/* ---		-------------- HGDC -- FEBA */
+	shufps		SHUF(3,2,3,2), STATE1, STATE0 /* ABCD */
+	shufps		SHUF(1,0,1,0), STATE1, XMMTMP /* EFGH */
 	movu128		STATE0, 80+0*16(%rdi)
 	movu128		XMMTMP, 80+1*16(%rdi)
 
diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index afca98a62..c7fb243ce 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -4,7 +4,7 @@
 // We use shorter insns, even though they are for "wrong"
 // data type (fp, not int).
 // For Intel, there is no penalty for doing it at all
-// (CPUs which do have such penalty do not support SHA1 insns).
+// (CPUs which do have such penalty do not support SHA insns).
 // For AMD, the penalty is one extra cycle
 // (allegedly: I failed to find measurable difference).
 
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index 54d122788..c13cdec07 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -4,7 +4,7 @@
 // We use shorter insns, even though they are for "wrong"
 // data type (fp, not int).
 // For Intel, there is no penalty for doing it at all
-// (CPUs which do have such penalty do not support SHA1 insns).
+// (CPUs which do have such penalty do not support SHA insns).
 // For AMD, the penalty is one extra cycle
 // (allegedly: I failed to find measurable difference).
 

From vda.linux at googlemail.com  Fri Feb 11 05:08:27 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Fri, 11 Feb 2022 06:08:27 +0100
Subject: [git commit] libbb/sha1: shrink unrolled x86-64 code
Message-ID: <20220211050806.E034782212@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=8154146be491bc66ab34d5d5f2a2466ddbdcff52
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

function                                             old     new   delta
sha1_process_block64                                3481    3384     -97

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-64.S    | 129 ++++++++++++++++++++---------------------
 libbb/hash_md5_sha_x86-64.S.sh | 111 +++++++++++++++++------------------
 2 files changed, 117 insertions(+), 123 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index 287cfe547..51fde082a 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -27,68 +27,60 @@ sha1_process_block64:
 # xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
-	movl	80(%rdi), %eax		# a = ctx->hash[0]
-	movl	84(%rdi), %ebx		# b = ctx->hash[1]
-	movl	88(%rdi), %ecx		# c = ctx->hash[2]
-	movl	92(%rdi), %edx		# d = ctx->hash[3]
-	movl	96(%rdi), %ebp		# e = ctx->hash[4]
-
 	movaps	sha1const(%rip), %xmm7
+	movaps	bswap32_mask(%rip), %xmm4
 	pshufd	$0x00, %xmm7, %xmm6
 
-	# Load W[] to xmm registers, byteswapping on the fly.
+	# Load W[] to xmm0..3, byteswapping on the fly.
 	#
-	# For iterations 0..15, we pass W[] in rsi,r8..r14
+	# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
 	# for use in RD1As instead of spilling them to stack.
-	# We lose parallelized addition of RCONST, but LEA
-	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# LEAs in two first RD1As shorter by one byte).
-	movq	4*0(%rdi), %rsi
-	movq	4*2(%rdi), %r8
-	bswapq	%rsi
-	bswapq	%r8
-	rolq	$32, %rsi		# rsi = W[1]:W[0]
-	rolq	$32, %r8		# r8  = W[3]:W[2]
-	movq	%rsi, %xmm0
-	movq	%r8, %xmm4
-	punpcklqdq %xmm4, %xmm0	# xmm0 = r8:rsi = (W[0],W[1],W[2],W[3])
-#	movaps	%xmm0, %xmm4		# add RCONST, spill to stack
-#	paddd	%xmm6, %xmm4
-#	movups	%xmm4, -64+16*0(%rsp)
+	# ADDs in two first RD1As shorter by one byte).
+	movups	16*0(%rdi), %xmm0
+	pshufb	%xmm4, %xmm0
+	movaps	%xmm0, %xmm5
+	paddd	%xmm6, %xmm5
+	movq	%xmm5, %rsi
+#	pextrq	$1, %xmm5, %r8	#SSE4.1 insn
+#	movhpd	%xmm5, %r8		#can only move to mem, not to reg
+	shufps	$0x0e, %xmm5, %xmm5
+	movq	%xmm5, %r8
+
+	movups	16*1(%rdi), %xmm1
+	pshufb	%xmm4, %xmm1
+	movaps	%xmm1, %xmm5
+	paddd	%xmm6, %xmm5
+	movq	%xmm5, %r9
+	shufps	$0x0e, %xmm5, %xmm5
+	movq	%xmm5, %r10
 
-	movq	4*4(%rdi), %r9
-	movq	4*6(%rdi), %r10
-	bswapq	%r9
-	bswapq	%r10
-	rolq	$32, %r9		# r9  = W[5]:W[4]
-	rolq	$32, %r10		# r10 = W[7]:W[6]
-	movq	%r9, %xmm1
-	movq	%r10, %xmm4
-	punpcklqdq %xmm4, %xmm1	# xmm1 = r10:r9 = (W[4],W[5],W[6],W[7])
+	movups	16*2(%rdi), %xmm2
+	pshufb	%xmm4, %xmm2
+	movaps	%xmm2, %xmm5
+	paddd	%xmm6, %xmm5
+	movq	%xmm5, %r11
+	shufps	$0x0e, %xmm5, %xmm5
+	movq	%xmm5, %r12
 
-	movq	4*8(%rdi), %r11
-	movq	4*10(%rdi), %r12
-	bswapq	%r11
-	bswapq	%r12
-	rolq	$32, %r11		# r11  = W[9]:W[8]
-	rolq	$32, %r12		# r12  = W[11]:W[10]
-	movq	%r11, %xmm2
-	movq	%r12, %xmm4
-	punpcklqdq %xmm4, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
+	movups	16*3(%rdi), %xmm3
+	pshufb	%xmm4, %xmm3
+	movaps	%xmm3, %xmm5
+	paddd	%xmm6, %xmm5
+	movq	%xmm5, %r13
+	shufps	$0x0e, %xmm5, %xmm5
+	movq	%xmm5, %r14
 
-	movq	4*12(%rdi), %r13
-	movq	4*14(%rdi), %r14
-	bswapq	%r13
-	bswapq	%r14
-	rolq	$32, %r13		# r13  = W[13]:W[12]
-	rolq	$32, %r14		# r14  = W[15]:W[14]
-	movq	%r13, %xmm3
-	movq	%r14, %xmm4
-	punpcklqdq %xmm4, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])
+	# MOVQs to GPRs (above) have somewhat high latency.
+	# Load hash[] while they are completing:
+	movl	80(%rdi), %eax		# a = ctx->hash[0]
+	movl	84(%rdi), %ebx		# b = ctx->hash[1]
+	movl	88(%rdi), %ecx		# c = ctx->hash[2]
+	movl	92(%rdi), %edx		# d = ctx->hash[3]
+	movl	96(%rdi), %ebp		# e = ctx->hash[4]
 
 # 0
-	leal	0x5A827999(%rbp,%rsi), %ebp # e += RCONST + W[n]
+	addl	%esi, %ebp		# e += RCONST + W[n]
 	shrq	$32, %rsi
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
@@ -100,7 +92,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 1
-	leal	0x5A827999(%rdx,%rsi), %edx # e += RCONST + W[n]
+	addl	%esi, %edx		# e += RCONST + W[n]
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
 	andl	%eax, %edi		# &b
@@ -111,7 +103,7 @@ sha1_process_block64:
 	addl	%edi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
 # 2
-	leal	0x5A827999(%rcx,%r8), %ecx # e += RCONST + W[n]
+	addl	%r8d, %ecx		# e += RCONST + W[n]
 	shrq	$32, %r8
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
@@ -123,7 +115,7 @@ sha1_process_block64:
 	addl	%edi, %ecx		# e += rotl32(a,5)
 	rorl	$2, %ebp		# b = rotl32(b,30)
 # 3
-	leal	0x5A827999(%rbx,%r8), %ebx # e += RCONST + W[n]
+	addl	%r8d, %ebx		# e += RCONST + W[n]
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
 	andl	%edx, %edi		# &b
@@ -134,7 +126,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 4
-	leal	0x5A827999(%rax,%r9), %eax # e += RCONST + W[n]
+	addl	%r9d, %eax		# e += RCONST + W[n]
 	shrq	$32, %r9
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
@@ -146,7 +138,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 5
-	leal	0x5A827999(%rbp,%r9), %ebp # e += RCONST + W[n]
+	addl	%r9d, %ebp		# e += RCONST + W[n]
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
 	andl	%ebx, %edi		# &b
@@ -157,7 +149,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 6
-	leal	0x5A827999(%rdx,%r10), %edx # e += RCONST + W[n]
+	addl	%r10d, %edx		# e += RCONST + W[n]
 	shrq	$32, %r10
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
@@ -169,7 +161,7 @@ sha1_process_block64:
 	addl	%edi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
 # 7
-	leal	0x5A827999(%rcx,%r10), %ecx # e += RCONST + W[n]
+	addl	%r10d, %ecx		# e += RCONST + W[n]
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
 	andl	%ebp, %edi		# &b
@@ -210,7 +202,7 @@ sha1_process_block64:
 	paddd	%xmm6, %xmm5
 	movups	%xmm5, -64+16*0(%rsp)
 # 8
-	leal	0x5A827999(%rbx,%r11), %ebx # e += RCONST + W[n]
+	addl	%r11d, %ebx		# e += RCONST + W[n]
 	shrq	$32, %r11
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
@@ -222,7 +214,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 9
-	leal	0x5A827999(%rax,%r11), %eax # e += RCONST + W[n]
+	addl	%r11d, %eax		# e += RCONST + W[n]
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
 	andl	%ecx, %edi		# &b
@@ -233,7 +225,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 10
-	leal	0x5A827999(%rbp,%r12), %ebp # e += RCONST + W[n]
+	addl	%r12d, %ebp		# e += RCONST + W[n]
 	shrq	$32, %r12
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
@@ -245,7 +237,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 11
-	leal	0x5A827999(%rdx,%r12), %edx # e += RCONST + W[n]
+	addl	%r12d, %edx		# e += RCONST + W[n]
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
 	andl	%eax, %edi		# &b
@@ -287,7 +279,7 @@ sha1_process_block64:
 	paddd	%xmm6, %xmm5
 	movups	%xmm5, -64+16*1(%rsp)
 # 12
-	leal	0x5A827999(%rcx,%r13), %ecx # e += RCONST + W[n]
+	addl	%r13d, %ecx		# e += RCONST + W[n]
 	shrq	$32, %r13
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
@@ -299,7 +291,7 @@ sha1_process_block64:
 	addl	%edi, %ecx		# e += rotl32(a,5)
 	rorl	$2, %ebp		# b = rotl32(b,30)
 # 13
-	leal	0x5A827999(%rbx,%r13), %ebx # e += RCONST + W[n]
+	addl	%r13d, %ebx		# e += RCONST + W[n]
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
 	andl	%edx, %edi		# &b
@@ -310,7 +302,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 14
-	leal	0x5A827999(%rax,%r14), %eax # e += RCONST + W[n]
+	addl	%r14d, %eax		# e += RCONST + W[n]
 	shrq	$32, %r14
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
@@ -322,7 +314,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 15
-	leal	0x5A827999(%rbp,%r14), %ebp # e += RCONST + W[n]
+	addl	%r14d, %ebp		# e += RCONST + W[n]
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
 	andl	%ebx, %edi		# &b
@@ -1475,6 +1467,11 @@ sha1_process_block64:
 	ret
 	.size	sha1_process_block64, .-sha1_process_block64
 
+	.section	.rodata.cst16.bswap32_mask, "aM", @progbits, 16
+	.balign	16
+bswap32_mask:
+	.octa	0x0c0d0e0f08090a0b0405060700010203
+
 	.section	.rodata.cst16.sha1const, "aM", @progbits, 16
 	.balign	16
 sha1const:
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index a10ac411d..f34e6e6fa 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -129,65 +129,57 @@ sha1_process_block64:
 # xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
-	movl	80(%rdi), %eax		# a = ctx->hash[0]
-	movl	84(%rdi), %ebx		# b = ctx->hash[1]
-	movl	88(%rdi), %ecx		# c = ctx->hash[2]
-	movl	92(%rdi), %edx		# d = ctx->hash[3]
-	movl	96(%rdi), %ebp		# e = ctx->hash[4]
-
 	movaps	sha1const(%rip), $xmmALLRCONST
+	movaps	bswap32_mask(%rip), $xmmT1
 	pshufd	\$0x00, $xmmALLRCONST, $xmmRCONST
 
-	# Load W[] to xmm registers, byteswapping on the fly.
+	# Load W[] to xmm0..3, byteswapping on the fly.
 	#
-	# For iterations 0..15, we pass W[] in rsi,r8..r14
+	# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
 	# for use in RD1As instead of spilling them to stack.
-	# We lose parallelized addition of RCONST, but LEA
-	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# LEAs in two first RD1As shorter by one byte).
-	movq	4*0(%rdi), %rsi
-	movq	4*2(%rdi), %r8
-	bswapq	%rsi
-	bswapq	%r8
-	rolq	\$32, %rsi		# rsi = W[1]:W[0]
-	rolq	\$32, %r8		# r8  = W[3]:W[2]
-	movq	%rsi, %xmm0
-	movq	%r8, $xmmT1
-	punpcklqdq $xmmT1, %xmm0	# xmm0 = r8:rsi = (W[0],W[1],W[2],W[3])
-#	movaps	%xmm0, $xmmT1		# add RCONST, spill to stack
-#	paddd	$xmmRCONST, $xmmT1
-#	movups	$xmmT1, -64+16*0(%rsp)
-
-	movq	4*4(%rdi), %r9
-	movq	4*6(%rdi), %r10
-	bswapq	%r9
-	bswapq	%r10
-	rolq	\$32, %r9		# r9  = W[5]:W[4]
-	rolq	\$32, %r10		# r10 = W[7]:W[6]
-	movq	%r9, %xmm1
-	movq	%r10, $xmmT1
-	punpcklqdq $xmmT1, %xmm1	# xmm1 = r10:r9 = (W[4],W[5],W[6],W[7])
-
-	movq	4*8(%rdi), %r11
-	movq	4*10(%rdi), %r12
-	bswapq	%r11
-	bswapq	%r12
-	rolq	\$32, %r11		# r11  = W[9]:W[8]
-	rolq	\$32, %r12		# r12  = W[11]:W[10]
-	movq	%r11, %xmm2
-	movq	%r12, $xmmT1
-	punpcklqdq $xmmT1, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
-
-	movq	4*12(%rdi), %r13
-	movq	4*14(%rdi), %r14
-	bswapq	%r13
-	bswapq	%r14
-	rolq	\$32, %r13		# r13  = W[13]:W[12]
-	rolq	\$32, %r14		# r14  = W[15]:W[14]
-	movq	%r13, %xmm3
-	movq	%r14, $xmmT1
-	punpcklqdq $xmmT1, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])
+	# ADDs in two first RD1As shorter by one byte).
+	movups	16*0(%rdi), %xmm0
+	pshufb	$xmmT1, %xmm0
+	movaps	%xmm0, $xmmT2
+	paddd	$xmmRCONST, $xmmT2
+	movq	$xmmT2, %rsi
+#	pextrq	\$1, $xmmT2, %r8	#SSE4.1 insn
+#	movhpd	$xmmT2, %r8		#can only move to mem, not to reg
+	shufps	\$0x0e, $xmmT2, $xmmT2
+	movq	$xmmT2, %r8
+
+	movups	16*1(%rdi), %xmm1
+	pshufb	$xmmT1, %xmm1
+	movaps	%xmm1, $xmmT2
+	paddd	$xmmRCONST, $xmmT2
+	movq	$xmmT2, %r9
+	shufps	\$0x0e, $xmmT2, $xmmT2
+	movq	$xmmT2, %r10
+
+	movups	16*2(%rdi), %xmm2
+	pshufb	$xmmT1, %xmm2
+	movaps	%xmm2, $xmmT2
+	paddd	$xmmRCONST, $xmmT2
+	movq	$xmmT2, %r11
+	shufps	\$0x0e, $xmmT2, $xmmT2
+	movq	$xmmT2, %r12
+
+	movups	16*3(%rdi), %xmm3
+	pshufb	$xmmT1, %xmm3
+	movaps	%xmm3, $xmmT2
+	paddd	$xmmRCONST, $xmmT2
+	movq	$xmmT2, %r13
+	shufps	\$0x0e, $xmmT2, $xmmT2
+	movq	$xmmT2, %r14
+
+	# MOVQs to GPRs (above) have somewhat high latency.
+	# Load hash[] while they are completing:
+	movl	80(%rdi), %eax		# a = ctx->hash[0]
+	movl	84(%rdi), %ebx		# b = ctx->hash[1]
+	movl	88(%rdi), %ecx		# c = ctx->hash[2]
+	movl	92(%rdi), %edx		# d = ctx->hash[3]
+	movl	96(%rdi), %ebp		# e = ctx->hash[4]
 "
 
 PREP() {
@@ -266,15 +258,15 @@ local rN=$((7+n0/2))
 echo "
 # $n
 ";test $n0 = 0 && echo "
-	leal	$RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n]
+	addl	%esi, %e$e		# e += RCONST + W[n]
 	shrq	\$32, %rsi
 ";test $n0 = 1 && echo "
-	leal	$RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n]
+	addl	%esi, %e$e		# e += RCONST + W[n]
 ";test $n0 -ge 2 && test $((n0 & 1)) = 0 && echo "
-	leal	$RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n]
+	addl	%r${rN}d, %e$e		# e += RCONST + W[n]
 	shrq	\$32, %r$rN
 ";test $n0 -ge 2 && test $((n0 & 1)) = 1 && echo "
-	leal	$RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n]
+	addl	%r${rN}d, %e$e		# e += RCONST + W[n]
 ";echo "
 	movl	%e$c, %edi		# c
 	xorl	%e$d, %edi		# ^d
@@ -440,6 +432,11 @@ echo "
 	ret
 	.size	sha1_process_block64, .-sha1_process_block64
 
+	.section	.rodata.cst16.bswap32_mask, \"aM\", @progbits, 16
+	.balign	16
+bswap32_mask:
+	.octa	0x0c0d0e0f08090a0b0405060700010203
+
 	.section	.rodata.cst16.sha1const, \"aM\", @progbits, 16
 	.balign	16
 sha1const:

From vda.linux at googlemail.com  Fri Feb 11 13:53:26 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Fri, 11 Feb 2022 14:53:26 +0100
Subject: [git commit] libbb/sha1: revert last commit: pshufb is a SSSE3 insn,
 can't use it
Message-ID: <20220211134649.1F2D782E01@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=dda77e83762861b52d62f0f161e2b4bf8092eacf
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S |   4 ++
 libbb/hash_md5_sha256_x86-64_shaNI.S |   4 ++
 libbb/hash_md5_sha_x86-32_shaNI.S    |   5 ++
 libbb/hash_md5_sha_x86-64.S          | 127 +++++++++++++++++----------------
 libbb/hash_md5_sha_x86-64.S.sh       | 133 +++++++++++++++++++++--------------
 libbb/hash_md5_sha_x86-64_shaNI.S    |   5 ++
 6 files changed, 163 insertions(+), 115 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index 4b33449d4..c059fb18d 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -15,6 +15,10 @@
 //#define shuf128_32 pshufd
 #define shuf128_32 shufps
 
+// pshufb and palignr are SSSE3 insns.
+// We do not check SSSE3 in cpuid,
+// all SHA-capable CPUs support it as well.
+
 	.section	.text.sha256_process_block64_shaNI, "ax", @progbits
 	.globl	sha256_process_block64_shaNI
 	.hidden	sha256_process_block64_shaNI
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index 5ed80c2ef..9578441f8 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -15,6 +15,10 @@
 //#define shuf128_32 pshufd
 #define shuf128_32 shufps
 
+// pshufb and palignr are SSSE3 insns.
+// We do not check SSSE3 in cpuid,
+// all SHA-capable CPUs support it as well.
+
 	.section	.text.sha256_process_block64_shaNI, "ax", @progbits
 	.globl	sha256_process_block64_shaNI
 	.hidden	sha256_process_block64_shaNI
diff --git a/libbb/hash_md5_sha_x86-32_shaNI.S b/libbb/hash_md5_sha_x86-32_shaNI.S
index c7fb243ce..2366b046a 100644
--- a/libbb/hash_md5_sha_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha_x86-32_shaNI.S
@@ -20,6 +20,11 @@
 #define extr128_32 pextrd
 //#define extr128_32 extractps	# not shorter
 
+// pshufb is a SSSE3 insn.
+// pinsrd, pextrd, extractps are SSE4.1 insns.
+// We do not check SSSE3/SSE4.1 in cpuid,
+// all SHA-capable CPUs support them as well.
+
 	.section	.text.sha1_process_block64_shaNI, "ax", @progbits
 	.globl	sha1_process_block64_shaNI
 	.hidden	sha1_process_block64_shaNI
diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index 51fde082a..f0daa30f6 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -27,60 +27,68 @@ sha1_process_block64:
 # xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
+	movl	80(%rdi), %eax		# a = ctx->hash[0]
+	movl	84(%rdi), %ebx		# b = ctx->hash[1]
+	movl	88(%rdi), %ecx		# c = ctx->hash[2]
+	movl	92(%rdi), %edx		# d = ctx->hash[3]
+	movl	96(%rdi), %ebp		# e = ctx->hash[4]
+
 	movaps	sha1const(%rip), %xmm7
-	movaps	bswap32_mask(%rip), %xmm4
 	pshufd	$0x00, %xmm7, %xmm6
 
 	# Load W[] to xmm0..3, byteswapping on the fly.
 	#
-	# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
+	# For iterations 0..15, we pass W[] in rsi,r8..r14
 	# for use in RD1As instead of spilling them to stack.
+	# We lose parallelized addition of RCONST, but LEA
+	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# ADDs in two first RD1As shorter by one byte).
-	movups	16*0(%rdi), %xmm0
-	pshufb	%xmm4, %xmm0
-	movaps	%xmm0, %xmm5
-	paddd	%xmm6, %xmm5
-	movq	%xmm5, %rsi
-#	pextrq	$1, %xmm5, %r8	#SSE4.1 insn
-#	movhpd	%xmm5, %r8		#can only move to mem, not to reg
-	shufps	$0x0e, %xmm5, %xmm5
-	movq	%xmm5, %r8
-
-	movups	16*1(%rdi), %xmm1
-	pshufb	%xmm4, %xmm1
-	movaps	%xmm1, %xmm5
-	paddd	%xmm6, %xmm5
-	movq	%xmm5, %r9
-	shufps	$0x0e, %xmm5, %xmm5
-	movq	%xmm5, %r10
+	# LEAs in two first RD1As shorter by one byte).
+	movq	4*0(%rdi), %rsi
+	movq	4*2(%rdi), %r8
+	bswapq	%rsi
+	bswapq	%r8
+	rolq	$32, %rsi		# rsi = W[1]:W[0]
+	rolq	$32, %r8		# r8  = W[3]:W[2]
+	movq	%rsi, %xmm0
+	movq	%r8, %xmm4
+	punpcklqdq %xmm4, %xmm0	# xmm0 = r8:rsi = (W[0],W[1],W[2],W[3])
+#	movaps	%xmm0, %xmm4		# add RCONST, spill to stack
+#	paddd	%xmm6, %xmm4
+#	movups	%xmm4, -64+16*0(%rsp)
 
-	movups	16*2(%rdi), %xmm2
-	pshufb	%xmm4, %xmm2
-	movaps	%xmm2, %xmm5
-	paddd	%xmm6, %xmm5
-	movq	%xmm5, %r11
-	shufps	$0x0e, %xmm5, %xmm5
-	movq	%xmm5, %r12
+	movq	4*4(%rdi), %r9
+	movq	4*6(%rdi), %r10
+	bswapq	%r9
+	bswapq	%r10
+	rolq	$32, %r9		# r9  = W[5]:W[4]
+	rolq	$32, %r10		# r10 = W[7]:W[6]
+	movq	%r9, %xmm1
+	movq	%r10, %xmm4
+	punpcklqdq %xmm4, %xmm1	# xmm1 = r10:r9 = (W[4],W[5],W[6],W[7])
 
-	movups	16*3(%rdi), %xmm3
-	pshufb	%xmm4, %xmm3
-	movaps	%xmm3, %xmm5
-	paddd	%xmm6, %xmm5
-	movq	%xmm5, %r13
-	shufps	$0x0e, %xmm5, %xmm5
-	movq	%xmm5, %r14
+	movq	4*8(%rdi), %r11
+	movq	4*10(%rdi), %r12
+	bswapq	%r11
+	bswapq	%r12
+	rolq	$32, %r11		# r11  = W[9]:W[8]
+	rolq	$32, %r12		# r12  = W[11]:W[10]
+	movq	%r11, %xmm2
+	movq	%r12, %xmm4
+	punpcklqdq %xmm4, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
 
-	# MOVQs to GPRs (above) have somewhat high latency.
-	# Load hash[] while they are completing:
-	movl	80(%rdi), %eax		# a = ctx->hash[0]
-	movl	84(%rdi), %ebx		# b = ctx->hash[1]
-	movl	88(%rdi), %ecx		# c = ctx->hash[2]
-	movl	92(%rdi), %edx		# d = ctx->hash[3]
-	movl	96(%rdi), %ebp		# e = ctx->hash[4]
+	movq	4*12(%rdi), %r13
+	movq	4*14(%rdi), %r14
+	bswapq	%r13
+	bswapq	%r14
+	rolq	$32, %r13		# r13  = W[13]:W[12]
+	rolq	$32, %r14		# r14  = W[15]:W[14]
+	movq	%r13, %xmm3
+	movq	%r14, %xmm4
+	punpcklqdq %xmm4, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])
 
 # 0
-	addl	%esi, %ebp		# e += RCONST + W[n]
+	leal	0x5A827999(%rbp,%rsi), %ebp # e += RCONST + W[n]
 	shrq	$32, %rsi
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
@@ -92,7 +100,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 1
-	addl	%esi, %edx		# e += RCONST + W[n]
+	leal	0x5A827999(%rdx,%rsi), %edx # e += RCONST + W[n]
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
 	andl	%eax, %edi		# &b
@@ -103,7 +111,7 @@ sha1_process_block64:
 	addl	%edi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
 # 2
-	addl	%r8d, %ecx		# e += RCONST + W[n]
+	leal	0x5A827999(%rcx,%r8), %ecx # e += RCONST + W[n]
 	shrq	$32, %r8
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
@@ -115,7 +123,7 @@ sha1_process_block64:
 	addl	%edi, %ecx		# e += rotl32(a,5)
 	rorl	$2, %ebp		# b = rotl32(b,30)
 # 3
-	addl	%r8d, %ebx		# e += RCONST + W[n]
+	leal	0x5A827999(%rbx,%r8), %ebx # e += RCONST + W[n]
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
 	andl	%edx, %edi		# &b
@@ -126,7 +134,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 4
-	addl	%r9d, %eax		# e += RCONST + W[n]
+	leal	0x5A827999(%rax,%r9), %eax # e += RCONST + W[n]
 	shrq	$32, %r9
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
@@ -138,7 +146,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 5
-	addl	%r9d, %ebp		# e += RCONST + W[n]
+	leal	0x5A827999(%rbp,%r9), %ebp # e += RCONST + W[n]
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
 	andl	%ebx, %edi		# &b
@@ -149,7 +157,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 6
-	addl	%r10d, %edx		# e += RCONST + W[n]
+	leal	0x5A827999(%rdx,%r10), %edx # e += RCONST + W[n]
 	shrq	$32, %r10
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
@@ -161,7 +169,7 @@ sha1_process_block64:
 	addl	%edi, %edx		# e += rotl32(a,5)
 	rorl	$2, %eax		# b = rotl32(b,30)
 # 7
-	addl	%r10d, %ecx		# e += RCONST + W[n]
+	leal	0x5A827999(%rcx,%r10), %ecx # e += RCONST + W[n]
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
 	andl	%ebp, %edi		# &b
@@ -202,7 +210,7 @@ sha1_process_block64:
 	paddd	%xmm6, %xmm5
 	movups	%xmm5, -64+16*0(%rsp)
 # 8
-	addl	%r11d, %ebx		# e += RCONST + W[n]
+	leal	0x5A827999(%rbx,%r11), %ebx # e += RCONST + W[n]
 	shrq	$32, %r11
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
@@ -214,7 +222,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 9
-	addl	%r11d, %eax		# e += RCONST + W[n]
+	leal	0x5A827999(%rax,%r11), %eax # e += RCONST + W[n]
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
 	andl	%ecx, %edi		# &b
@@ -225,7 +233,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 10
-	addl	%r12d, %ebp		# e += RCONST + W[n]
+	leal	0x5A827999(%rbp,%r12), %ebp # e += RCONST + W[n]
 	shrq	$32, %r12
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
@@ -237,7 +245,7 @@ sha1_process_block64:
 	addl	%edi, %ebp		# e += rotl32(a,5)
 	rorl	$2, %ebx		# b = rotl32(b,30)
 # 11
-	addl	%r12d, %edx		# e += RCONST + W[n]
+	leal	0x5A827999(%rdx,%r12), %edx # e += RCONST + W[n]
 	movl	%ebx, %edi		# c
 	xorl	%ecx, %edi		# ^d
 	andl	%eax, %edi		# &b
@@ -279,7 +287,7 @@ sha1_process_block64:
 	paddd	%xmm6, %xmm5
 	movups	%xmm5, -64+16*1(%rsp)
 # 12
-	addl	%r13d, %ecx		# e += RCONST + W[n]
+	leal	0x5A827999(%rcx,%r13), %ecx # e += RCONST + W[n]
 	shrq	$32, %r13
 	movl	%eax, %edi		# c
 	xorl	%ebx, %edi		# ^d
@@ -291,7 +299,7 @@ sha1_process_block64:
 	addl	%edi, %ecx		# e += rotl32(a,5)
 	rorl	$2, %ebp		# b = rotl32(b,30)
 # 13
-	addl	%r13d, %ebx		# e += RCONST + W[n]
+	leal	0x5A827999(%rbx,%r13), %ebx # e += RCONST + W[n]
 	movl	%ebp, %edi		# c
 	xorl	%eax, %edi		# ^d
 	andl	%edx, %edi		# &b
@@ -302,7 +310,7 @@ sha1_process_block64:
 	addl	%edi, %ebx		# e += rotl32(a,5)
 	rorl	$2, %edx		# b = rotl32(b,30)
 # 14
-	addl	%r14d, %eax		# e += RCONST + W[n]
+	leal	0x5A827999(%rax,%r14), %eax # e += RCONST + W[n]
 	shrq	$32, %r14
 	movl	%edx, %edi		# c
 	xorl	%ebp, %edi		# ^d
@@ -314,7 +322,7 @@ sha1_process_block64:
 	addl	%edi, %eax		# e += rotl32(a,5)
 	rorl	$2, %ecx		# b = rotl32(b,30)
 # 15
-	addl	%r14d, %ebp		# e += RCONST + W[n]
+	leal	0x5A827999(%rbp,%r14), %ebp # e += RCONST + W[n]
 	movl	%ecx, %edi		# c
 	xorl	%edx, %edi		# ^d
 	andl	%ebx, %edi		# &b
@@ -1467,11 +1475,6 @@ sha1_process_block64:
 	ret
 	.size	sha1_process_block64, .-sha1_process_block64
 
-	.section	.rodata.cst16.bswap32_mask, "aM", @progbits, 16
-	.balign	16
-bswap32_mask:
-	.octa	0x0c0d0e0f08090a0b0405060700010203
-
 	.section	.rodata.cst16.sha1const, "aM", @progbits, 16
 	.balign	16
 sha1const:
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index f34e6e6fa..57e77b118 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -99,6 +99,30 @@ INTERLEAVE() {
 	)
 }
 
+#	movaps  bswap32_mask(%rip), $xmmT1
+# Load W[] to xmm0..3, byteswapping on the fly.
+# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
+# for use in RD1As instead of spilling them to stack.
+# (We use rsi instead of rN because this makes two
+# ADDs in two first RD1As shorter by one byte).
+#	movups	16*0(%rdi), %xmm0
+#	pshufb	$xmmT1, %xmm0		#SSSE3 insn
+#	movaps	%xmm0, $xmmT2
+#	paddd	$xmmRCONST, $xmmT2
+#	movq	$xmmT2, %rsi
+#	#pextrq	\$1, $xmmT2, %r8        #SSE4.1 insn
+#	#movhpd	$xmmT2, %r8             #can only move to mem, not to reg
+#	shufps	\$0x0e, $xmmT2, $xmmT2	# have to use two-insn sequence
+#	movq	$xmmT2, %r8		# instead
+#	...
+#	<repeat for xmm1,2,3>
+#	...
+#-	leal	$RCONST(%r$e,%rsi), %e$e	# e += RCONST + W[n]
+#+	addl	%esi, %e$e			# e += RCONST + W[n]
+# ^^^^^^^^^^^^^^^^^^^^^^^^
+# The above is -97 bytes of code...
+# ...but pshufb is a SSSE3 insn. Can't use it.
+
 echo \
 "### Generated by hash_md5_sha_x86-64.S.sh ###
 
@@ -129,57 +153,65 @@ sha1_process_block64:
 # xmm7: all round constants
 # -64(%rsp): area for passing RCONST + W[] from vector to integer units
 
+	movl	80(%rdi), %eax		# a = ctx->hash[0]
+	movl	84(%rdi), %ebx		# b = ctx->hash[1]
+	movl	88(%rdi), %ecx		# c = ctx->hash[2]
+	movl	92(%rdi), %edx		# d = ctx->hash[3]
+	movl	96(%rdi), %ebp		# e = ctx->hash[4]
+
 	movaps	sha1const(%rip), $xmmALLRCONST
-	movaps	bswap32_mask(%rip), $xmmT1
 	pshufd	\$0x00, $xmmALLRCONST, $xmmRCONST
 
 	# Load W[] to xmm0..3, byteswapping on the fly.
 	#
-	# For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
+	# For iterations 0..15, we pass W[] in rsi,r8..r14
 	# for use in RD1As instead of spilling them to stack.
+	# We lose parallelized addition of RCONST, but LEA
+	# can do two additions at once, so it is probably a wash.
 	# (We use rsi instead of rN because this makes two
-	# ADDs in two first RD1As shorter by one byte).
-	movups	16*0(%rdi), %xmm0
-	pshufb	$xmmT1, %xmm0
-	movaps	%xmm0, $xmmT2
-	paddd	$xmmRCONST, $xmmT2
-	movq	$xmmT2, %rsi
-#	pextrq	\$1, $xmmT2, %r8	#SSE4.1 insn
-#	movhpd	$xmmT2, %r8		#can only move to mem, not to reg
-	shufps	\$0x0e, $xmmT2, $xmmT2
-	movq	$xmmT2, %r8
-
-	movups	16*1(%rdi), %xmm1
-	pshufb	$xmmT1, %xmm1
-	movaps	%xmm1, $xmmT2
-	paddd	$xmmRCONST, $xmmT2
-	movq	$xmmT2, %r9
-	shufps	\$0x0e, $xmmT2, $xmmT2
-	movq	$xmmT2, %r10
-
-	movups	16*2(%rdi), %xmm2
-	pshufb	$xmmT1, %xmm2
-	movaps	%xmm2, $xmmT2
-	paddd	$xmmRCONST, $xmmT2
-	movq	$xmmT2, %r11
-	shufps	\$0x0e, $xmmT2, $xmmT2
-	movq	$xmmT2, %r12
-
-	movups	16*3(%rdi), %xmm3
-	pshufb	$xmmT1, %xmm3
-	movaps	%xmm3, $xmmT2
-	paddd	$xmmRCONST, $xmmT2
-	movq	$xmmT2, %r13
-	shufps	\$0x0e, $xmmT2, $xmmT2
-	movq	$xmmT2, %r14
-
-	# MOVQs to GPRs (above) have somewhat high latency.
-	# Load hash[] while they are completing:
-	movl	80(%rdi), %eax		# a = ctx->hash[0]
-	movl	84(%rdi), %ebx		# b = ctx->hash[1]
-	movl	88(%rdi), %ecx		# c = ctx->hash[2]
-	movl	92(%rdi), %edx		# d = ctx->hash[3]
-	movl	96(%rdi), %ebp		# e = ctx->hash[4]
+	# LEAs in two first RD1As shorter by one byte).
+	movq	4*0(%rdi), %rsi
+	movq	4*2(%rdi), %r8
+	bswapq	%rsi
+	bswapq	%r8
+	rolq	\$32, %rsi		# rsi = W[1]:W[0]
+	rolq	\$32, %r8		# r8  = W[3]:W[2]
+	movq	%rsi, %xmm0
+	movq	%r8, $xmmT1
+	punpcklqdq $xmmT1, %xmm0	# xmm0 = r8:rsi = (W[0],W[1],W[2],W[3])
+#	movaps	%xmm0, $xmmT1		# add RCONST, spill to stack
+#	paddd	$xmmRCONST, $xmmT1
+#	movups	$xmmT1, -64+16*0(%rsp)
+
+	movq	4*4(%rdi), %r9
+	movq	4*6(%rdi), %r10
+	bswapq	%r9
+	bswapq	%r10
+	rolq	\$32, %r9		# r9  = W[5]:W[4]
+	rolq	\$32, %r10		# r10 = W[7]:W[6]
+	movq	%r9, %xmm1
+	movq	%r10, $xmmT1
+	punpcklqdq $xmmT1, %xmm1	# xmm1 = r10:r9 = (W[4],W[5],W[6],W[7])
+
+	movq	4*8(%rdi), %r11
+	movq	4*10(%rdi), %r12
+	bswapq	%r11
+	bswapq	%r12
+	rolq	\$32, %r11		# r11  = W[9]:W[8]
+	rolq	\$32, %r12		# r12  = W[11]:W[10]
+	movq	%r11, %xmm2
+	movq	%r12, $xmmT1
+	punpcklqdq $xmmT1, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
+
+	movq	4*12(%rdi), %r13
+	movq	4*14(%rdi), %r14
+	bswapq	%r13
+	bswapq	%r14
+	rolq	\$32, %r13		# r13  = W[13]:W[12]
+	rolq	\$32, %r14		# r14  = W[15]:W[14]
+	movq	%r13, %xmm3
+	movq	%r14, $xmmT1
+	punpcklqdq $xmmT1, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])
 "
 
 PREP() {
@@ -258,15 +290,15 @@ local rN=$((7+n0/2))
 echo "
 # $n
 ";test $n0 = 0 && echo "
-	addl	%esi, %e$e		# e += RCONST + W[n]
+	leal	$RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n]
 	shrq	\$32, %rsi
 ";test $n0 = 1 && echo "
-	addl	%esi, %e$e		# e += RCONST + W[n]
+	leal	$RCONST(%r$e,%rsi), %e$e # e += RCONST + W[n]
 ";test $n0 -ge 2 && test $((n0 & 1)) = 0 && echo "
-	addl	%r${rN}d, %e$e		# e += RCONST + W[n]
+	leal	$RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n]
 	shrq	\$32, %r$rN
 ";test $n0 -ge 2 && test $((n0 & 1)) = 1 && echo "
-	addl	%r${rN}d, %e$e		# e += RCONST + W[n]
+	leal	$RCONST(%r$e,%r$rN), %e$e # e += RCONST + W[n]
 ";echo "
 	movl	%e$c, %edi		# c
 	xorl	%e$d, %edi		# ^d
@@ -432,11 +464,6 @@ echo "
 	ret
 	.size	sha1_process_block64, .-sha1_process_block64
 
-	.section	.rodata.cst16.bswap32_mask, \"aM\", @progbits, 16
-	.balign	16
-bswap32_mask:
-	.octa	0x0c0d0e0f08090a0b0405060700010203
-
 	.section	.rodata.cst16.sha1const, \"aM\", @progbits, 16
 	.balign	16
 sha1const:
diff --git a/libbb/hash_md5_sha_x86-64_shaNI.S b/libbb/hash_md5_sha_x86-64_shaNI.S
index c13cdec07..794e97040 100644
--- a/libbb/hash_md5_sha_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha_x86-64_shaNI.S
@@ -20,6 +20,11 @@
 #define extr128_32 pextrd
 //#define extr128_32 extractps	# not shorter
 
+// pshufb is a SSSE3 insn.
+// pinsrd, pextrd, extractps are SSE4.1 insns.
+// We do not check SSSE3/SSE4.1 in cpuid,
+// all SHA-capable CPUs support them as well.
+
 	.section	.text.sha1_process_block64_shaNI, "ax", @progbits
 	.globl	sha1_process_block64_shaNI
 	.hidden	sha1_process_block64_shaNI

From vda.linux at googlemail.com  Fri Feb 11 22:03:27 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Fri, 11 Feb 2022 23:03:27 +0100
Subject: [git commit] whitespace fixes
Message-ID: <20220211215609.0CC91831C4@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=1f272c06d02e7c7f0f3af1f97165722255c8828d
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha_x86-64.S    |  8 ++++----
 libbb/hash_md5_sha_x86-64.S.sh | 14 +++++++-------
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/libbb/hash_md5_sha_x86-64.S b/libbb/hash_md5_sha_x86-64.S
index f0daa30f6..1d55b91f8 100644
--- a/libbb/hash_md5_sha_x86-64.S
+++ b/libbb/hash_md5_sha_x86-64.S
@@ -71,8 +71,8 @@ sha1_process_block64:
 	movq	4*10(%rdi), %r12
 	bswapq	%r11
 	bswapq	%r12
-	rolq	$32, %r11		# r11  = W[9]:W[8]
-	rolq	$32, %r12		# r12  = W[11]:W[10]
+	rolq	$32, %r11		# r11 = W[9]:W[8]
+	rolq	$32, %r12		# r12 = W[11]:W[10]
 	movq	%r11, %xmm2
 	movq	%r12, %xmm4
 	punpcklqdq %xmm4, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
@@ -81,8 +81,8 @@ sha1_process_block64:
 	movq	4*14(%rdi), %r14
 	bswapq	%r13
 	bswapq	%r14
-	rolq	$32, %r13		# r13  = W[13]:W[12]
-	rolq	$32, %r14		# r14  = W[15]:W[14]
+	rolq	$32, %r13		# r13 = W[13]:W[12]
+	rolq	$32, %r14		# r14 = W[15]:W[14]
 	movq	%r13, %xmm3
 	movq	%r14, %xmm4
 	punpcklqdq %xmm4, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])
diff --git a/libbb/hash_md5_sha_x86-64.S.sh b/libbb/hash_md5_sha_x86-64.S.sh
index 57e77b118..40c979d35 100755
--- a/libbb/hash_md5_sha_x86-64.S.sh
+++ b/libbb/hash_md5_sha_x86-64.S.sh
@@ -99,7 +99,7 @@ INTERLEAVE() {
 	)
 }
 
-#	movaps  bswap32_mask(%rip), $xmmT1
+#	movaps	bswap32_mask(%rip), $xmmT1
 # Load W[] to xmm0..3, byteswapping on the fly.
 # For iterations 0..15, we pass RCONST+W[] in rsi,r8..r14
 # for use in RD1As instead of spilling them to stack.
@@ -110,8 +110,8 @@ INTERLEAVE() {
 #	movaps	%xmm0, $xmmT2
 #	paddd	$xmmRCONST, $xmmT2
 #	movq	$xmmT2, %rsi
-#	#pextrq	\$1, $xmmT2, %r8        #SSE4.1 insn
-#	#movhpd	$xmmT2, %r8             #can only move to mem, not to reg
+#	#pextrq	\$1, $xmmT2, %r8	#SSE4.1 insn
+#	#movhpd	$xmmT2, %r8		#can only move to mem, not to reg
 #	shufps	\$0x0e, $xmmT2, $xmmT2	# have to use two-insn sequence
 #	movq	$xmmT2, %r8		# instead
 #	...
@@ -197,8 +197,8 @@ sha1_process_block64:
 	movq	4*10(%rdi), %r12
 	bswapq	%r11
 	bswapq	%r12
-	rolq	\$32, %r11		# r11  = W[9]:W[8]
-	rolq	\$32, %r12		# r12  = W[11]:W[10]
+	rolq	\$32, %r11		# r11 = W[9]:W[8]
+	rolq	\$32, %r12		# r12 = W[11]:W[10]
 	movq	%r11, %xmm2
 	movq	%r12, $xmmT1
 	punpcklqdq $xmmT1, %xmm2	# xmm2 = r12:r11 = (W[8],W[9],W[10],W[11])
@@ -207,8 +207,8 @@ sha1_process_block64:
 	movq	4*14(%rdi), %r14
 	bswapq	%r13
 	bswapq	%r14
-	rolq	\$32, %r13		# r13  = W[13]:W[12]
-	rolq	\$32, %r14		# r14  = W[15]:W[14]
+	rolq	\$32, %r13		# r13 = W[13]:W[12]
+	rolq	\$32, %r14		# r14 = W[15]:W[14]
 	movq	%r13, %xmm3
 	movq	%r14, $xmmT1
 	punpcklqdq $xmmT1, %xmm3	# xmm3 = r14:r13 = (W[12],W[13],W[14],W[15])

From bugzilla at busybox.net  Fri Feb 11 22:39:50 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Fri, 11 Feb 2022 22:39:50 +0000
Subject: [Bug 14586] lsof missing from command description page
In-Reply-To: <bug-14586-161@https.bugs.busybox.net/>
References: <bug-14586-161@https.bugs.busybox.net/>
Message-ID: <bug-14586-161-Xz3ocUlL75@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14586

Mike Frysinger <vapier at gentoo.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|vapier at gentoo.org           |busybox-cvs at busybox.net
           Assignee|unassigned at buildroot.uclibc |unassigned at busybox.net
                   |.org                        |
          Component|Website                     |Website
            Product|Infrastructure              |Busybox

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From vda.linux at googlemail.com  Fri Feb 11 23:52:12 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Sat, 12 Feb 2022 00:52:12 +0100
Subject: [git commit] libbb/sha256: explicitly use sha256rnds2's %xmm0 (MSG)
 argument
Message-ID: <20220211234704.AFEFE82DB5@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=c2e7780e526b0f421c3b43367a53019d1dc5f2d6
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Else, the code seemingly does not use MSG.

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/hash_md5_sha256_x86-32_shaNI.S | 64 +++++++++++++++---------------
 libbb/hash_md5_sha256_x86-64_shaNI.S | 76 ++++++++++++++++++------------------
 2 files changed, 70 insertions(+), 70 deletions(-)

diff --git a/libbb/hash_md5_sha256_x86-32_shaNI.S b/libbb/hash_md5_sha256_x86-32_shaNI.S
index c059fb18d..3905bad9a 100644
--- a/libbb/hash_md5_sha256_x86-32_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-32_shaNI.S
@@ -60,18 +60,18 @@ sha256_process_block64_shaNI:
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 8-11 */
@@ -79,9 +79,9 @@ sha256_process_block64_shaNI:
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 12-15 */
@@ -90,151 +90,151 @@ sha256_process_block64_shaNI:
 /* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 16-19 */
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 20-23 */
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 24-27 */
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 28-31 */
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 32-35 */
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 36-39 */
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 40-43 */
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 44-47 */
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 48-51 */
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 52-55 */
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 56-59 */
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 60-63 */
 	mova128		MSGTMP3, MSG
 		paddd		15*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Write hash values back in the correct order */
 	mova128		STATE0, XMMTMP
diff --git a/libbb/hash_md5_sha256_x86-64_shaNI.S b/libbb/hash_md5_sha256_x86-64_shaNI.S
index 9578441f8..082ceafe4 100644
--- a/libbb/hash_md5_sha256_x86-64_shaNI.S
+++ b/libbb/hash_md5_sha256_x86-64_shaNI.S
@@ -38,8 +38,8 @@
 
 #define XMMTMP		%xmm7
 
-#define ABEF_SAVE	%xmm9
-#define CDGH_SAVE	%xmm10
+#define SAVE0		%xmm8
+#define SAVE1		%xmm9
 
 #define SHUF(a,b,c,d) $(a+(b<<2)+(c<<4)+(d<<6))
 
@@ -59,26 +59,26 @@ sha256_process_block64_shaNI:
 	leaq		K256+8*16(%rip), SHA256CONSTANTS
 
 	/* Save hash values for addition after rounds */
-	mova128		STATE0, ABEF_SAVE
-	mova128		STATE1, CDGH_SAVE
+	mova128		STATE0, SAVE0
+	mova128		STATE1, SAVE1
 
 	/* Rounds 0-3 */
 	movu128		0*16(DATA_PTR), MSG
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP0
 		paddd		0*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 4-7 */
 	movu128		1*16(DATA_PTR), MSG
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP1
 		paddd		1*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 8-11 */
@@ -86,9 +86,9 @@ sha256_process_block64_shaNI:
 	pshufb		XMMTMP, MSG
 	mova128		MSG, MSGTMP2
 		paddd		2*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 12-15 */
@@ -97,155 +97,155 @@ sha256_process_block64_shaNI:
 /* ...to here */
 	mova128		MSG, MSGTMP3
 		paddd		3*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 16-19 */
 	mova128		MSGTMP0, MSG
 		paddd		4*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 20-23 */
 	mova128		MSGTMP1, MSG
 		paddd		5*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 24-27 */
 	mova128		MSGTMP2, MSG
 		paddd		6*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 28-31 */
 	mova128		MSGTMP3, MSG
 		paddd		7*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 32-35 */
 	mova128		MSGTMP0, MSG
 		paddd		8*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 36-39 */
 	mova128		MSGTMP1, MSG
 		paddd		9*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP1, MSGTMP0
 
 	/* Rounds 40-43 */
 	mova128		MSGTMP2, MSG
 		paddd		10*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP2, MSGTMP1
 
 	/* Rounds 44-47 */
 	mova128		MSGTMP3, MSG
 		paddd		11*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP3, XMMTMP
 	palignr		$4, MSGTMP2, XMMTMP
 	paddd		XMMTMP, MSGTMP0
 	sha256msg2	MSGTMP3, MSGTMP0
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP3, MSGTMP2
 
 	/* Rounds 48-51 */
 	mova128		MSGTMP0, MSG
 		paddd		12*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP0, XMMTMP
 	palignr		$4, MSGTMP3, XMMTMP
 	paddd		XMMTMP, MSGTMP1
 	sha256msg2	MSGTMP0, MSGTMP1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 	sha256msg1	MSGTMP0, MSGTMP3
 
 	/* Rounds 52-55 */
 	mova128		MSGTMP1, MSG
 		paddd		13*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP1, XMMTMP
 	palignr		$4, MSGTMP0, XMMTMP
 	paddd		XMMTMP, MSGTMP2
 	sha256msg2	MSGTMP1, MSGTMP2
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 56-59 */
 	mova128		MSGTMP2, MSG
 		paddd		14*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 	mova128		MSGTMP2, XMMTMP
 	palignr		$4, MSGTMP1, XMMTMP
 	paddd		XMMTMP, MSGTMP3
 	sha256msg2	MSGTMP2, MSGTMP3
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Rounds 60-63 */
 	mova128		MSGTMP3, MSG
 		paddd		15*16-8*16(SHA256CONSTANTS), MSG
-		sha256rnds2	STATE0, STATE1
+		sha256rnds2	MSG, STATE0, STATE1
 		shuf128_32	$0x0E, MSG, MSG
-		sha256rnds2	STATE1, STATE0
+		sha256rnds2	MSG, STATE1, STATE0
 
 	/* Add current hash values with previously saved */
-	paddd		ABEF_SAVE, STATE0
-	paddd		CDGH_SAVE, STATE1
+	paddd		SAVE0, STATE0
+	paddd		SAVE1, STATE1
 
 	/* Write hash values back in the correct order */
 	mova128		STATE0, XMMTMP

From bugzilla at busybox.net  Sun Feb 13 20:22:13 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Sun, 13 Feb 2022 20:22:13 +0000
Subject: [Bug 14591] New: Online Manufacturing Service
Message-ID: <bug-14591-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14591

            Bug ID: 14591
           Summary: Online Manufacturing Service
           Product: Busybox
           Version: 1.33.x
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Standard Compliance
          Assignee: unassigned at busybox.net
          Reporter: gavin77902 at balaket.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

3DSculpLab offers professional online manufacturing services for prototypes,
one-off goods, and short-run production using a variety of manufacturing
processes, all through a single online platform. https://www.3dsculplab.xyz/

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Sun Feb 13 20:22:52 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Sun, 13 Feb 2022 20:22:52 +0000
Subject: [Bug 14591] Online Manufacturing Service
In-Reply-To: <bug-14591-161@https.bugs.busybox.net/>
References: <bug-14591-161@https.bugs.busybox.net/>
Message-ID: <bug-14591-161-5LpvnDQOkg@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14591

3D Printing <gavin77902 at balaket.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |documentation
                URL|                            |https://www.3dsculplab.xyz/

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From vda.linux at googlemail.com  Fri Feb 18 16:09:51 2022
From: vda.linux at googlemail.com (Denys Vlasenko)
Date: Fri, 18 Feb 2022 17:09:51 +0100
Subject: [git commit] libbb/sha1: update config help text with new performance
 numbers
Message-ID: <20220218161429.8F9DB813EE@busybox.osuosl.org>

commit: https://git.busybox.net/busybox/commit/?id=1891fdda59092a215d3a407d9108bbbe6ab8df7a
branch: https://git.busybox.net/busybox/commit/?id=refs/heads/master

Signed-off-by: Denys Vlasenko <vda.linux at googlemail.com>
---
 libbb/Config.src | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/libbb/Config.src b/libbb/Config.src
index 0ecd5bd46..66a3ffa23 100644
--- a/libbb/Config.src
+++ b/libbb/Config.src
@@ -57,11 +57,12 @@ config SHA1_SMALL
 	range 0 3
 	help
 	Trade binary size versus speed for the sha1 algorithm.
+	With FEATURE_COPYBUF_KB=64:
 	                throughput MB/s   size of sha1_process_block64
 	value           486  x86-64       486   x86-64
-	0               367  375          3657  3502
-	1               224  229           654   732
-	2,3             200  195           358   380
+	0               440  485          3481  3502
+	1               265  265           641   696
+	2,3             220  210           342   364
 
 config SHA1_HWACCEL
 	bool "SHA1: Use hardware accelerated instructions if possible"

From bugzilla at busybox.net  Sun Feb 20 04:57:25 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Sun, 20 Feb 2022 04:57:25 +0000
Subject: [Bug 14576] unzip: test skipped with bad archive
In-Reply-To: <bug-14576-161@https.bugs.busybox.net/>
References: <bug-14576-161@https.bugs.busybox.net/>
Message-ID: <bug-14576-161-gFgsr2qYly@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14576

--- Comment #1 from dharan <dharanendiran at gmail.com> ---
Hi Team,

      Can you please share an update on the requested SKIPPED test case? 

SKIPPED: unzip (bad archive)

Regards,
-Dharan

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Wed Feb 23 16:19:45 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Wed, 23 Feb 2022 16:19:45 +0000
Subject: [Bug 11736] KCONFIG_ALLCONFIG does not apply passed config
 (regression in 0b1c62934)
In-Reply-To: <bug-11736-161@https.bugs.busybox.net/>
References: <bug-11736-161@https.bugs.busybox.net/>
Message-ID: <bug-11736-161-n661FkgCoQ@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=11736

--- Comment #1 from Axel Fontaine <axel at axelfontaine.com> ---
This issue is still present in the latest release. Is there any workaround?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

From bugzilla at busybox.net  Mon Feb 28 19:15:08 2022
From: bugzilla at busybox.net (bugzilla at busybox.net)
Date: Mon, 28 Feb 2022 19:15:08 +0000
Subject: [Bug 14616] New: Printf format code and data type do not match in
 taskset
Message-ID: <bug-14616-161@https.bugs.busybox.net/>

https://bugs.busybox.net/show_bug.cgi?id=14616

            Bug ID: 14616
           Summary: Printf format code and data type do not match in
                    taskset
           Product: Busybox
           Version: 1.33.x
          Hardware: Other
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Other
          Assignee: unassigned at busybox.net
          Reporter: pdvb at yahoo.com
                CC: busybox-cvs at busybox.net
  Target Milestone: ---

The following code uses an (unsigned long) "%lx" format code, but passes an
(unsigned long long) value to printf.  The result is that on architectures
which use 32-bit for (unsigned long) and 64-bit for (unsigned long long) the
printf produces incorrect output.

#define TASKSET_PRINTF_MASK "%lx"
static unsigned long long from_mask(ul *mask, unsigned sz_in_bytes
UNUSED_PARAM)
{
        return *mask;
}


This was broken by commit ef0e76cc on 1/29/2017
The quick fix is to define the function as:

static unsigned long from_mask()

-- 
You are receiving this mail because:
You are on the CC list for the bug.