Huacai Chen chenhuacai@kernel.org 于2021年4月8日周四 下午12:56写道:
Hi, Maciej,
On Wed, Apr 7, 2021 at 9:38 PM Maciej W. Rozycki macro@orcam.me.uk wrote:
On Wed, 7 Apr 2021, Huacai Chen wrote:
This code is rather broken in an obvious way, starting from:
unsigned long long __n; \ \ __high = *__n >> 32; \ __low = __n; \
where `__n' is used uninitialised. Since this is my code originally I'll look into it; we may want to reinstate `do_div' too, which didn't have to be removed in the first place.
I think we can reuse the generic do_div().
We can, but it's not clear to me if this is optimal. We have a DIVMOD instruction which original code took advantage of (although I can see potential in reusing bits from include/asm-generic/div64.h). The two implementations would have to be benchmarked against each other across a couple of different CPUs.
The original MIPS do_div() has "h" constraint, and this is also the reason why Ralf rewrote this file. How can we reintroduce do_div() without "h" constraint?
I try to figure out a new version:
uint32_t __attribute__ ((noinline)) div64_32n(uint64_t *x, uint32_t b) { uint64_t a = *x;
uint64_t t1 = ((a>>32)/b)<<32; uint32_t t2 = (a>>32)%b;
uint32_t res = (uint32_t)a; uint32_t t1lo = 0;
uint32_t t3 = 0xffffffffu/b; uint32_t t4 = t3*b; uint32_t hi, lo;
while(t2>0) { __asm__ ( "multu %2, %3\n" "mfhi %0\n" "mflo %1\n" : "=r" (hi), "=r"(lo) : "r" (t4), "r"(t2) );
// yes, we are sure that t2*t3 will not overflow t1lo += (t3*t2); t2 -= hi; if (lo > 0) { t2 --; // we are sure that t2 > 0 lo = 0xffffffff - lo + 1; unsigned tmp = lo + res; // overflow if (tmp < lo || tmp < res) { t2 ++; } res = tmp; } } if (res >= b) { t1lo += (res/b); res = (res%b); }
t1 += t1lo; *x = t1; return res; }
With some test the performace: ((uint64_t)(-1))/3 with 0xfffff times
GCC: 5555555555555555, 0, seconds: 5 SYQ: 5555555555555555, 0, seconds: 4 KER: 5555555555555555, 0, seconds: 8 RAL: ffffffff, 2, seconds: 4
1. the MIPS current asm version cost 4s (and wrong result) 2. the simplest C code : a/b && a % b, cost 5s 3. the asm-generic version cost 8s. 4. my version cost 4s.
And the question is why asm-generic version exists since it has bad performance than the code generated by GCC?
Huacai
Huacai, thanks for your investigation! Please be more careful in verifying your future submissions however.
Sorry, I thought there is only one bug in div64.h, but in fact there are three...
This just shows the verification you made was not good enough, hence my observation.
Maciej