20210717, 16:22  #1 
Jul 2021
3×11 Posts 
NTT faster than FFT?
As far as I know Prime95 uses Fast Fourier Transform to multiply (square) large numbers, right?
Just for fun, recently I decided to implement from scratch both Fast Fourier Transform (FFT) and Number Theoretical Transform (NTT) for numbers multiplication. In C++ (and using SIMD instructions). According to my time measurements NTT appeared 1015 times faster than FFT for doing multiplications of numbers around 16 MegaBytes in size. Both NTT and FFT code I tried to optimize as much as possible. Also tested that both algorithms gave correct results, by doing naive multiplication. FFT was tweaked in such a way that error is around 0.070.1. NTT of cause gives no error as it is precise. Chinese Reminder Theorem was used at final step of NTT multiplication. Does anyone know if there was any trial to use NTT in Prime95 engine instead of FFT? Was it successful and/or gave any speed or other benefits? 
20210717, 18:01  #2 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×11×269 Posts 
Welcome.
If you're feeling brave, show some source code. The accomplished code smiths may have some helpful comments. How do your multiplication times compare to GWNUM /prime95 timings for same number size, processor, number of cores, etc.? I assume George did some sort of analysis if not actual experimentation with alternate code for prime95. His code is much faster than GMP multiple precision routines that avoid use of floating point because of error concerns. Prime95 uses the IBDWT and multiple instructionsettailored code paths. Preda experimented with DP, SP, M31, and M61 back in gpuowl V1.9. M61 was substantially slower than DP on AMD GPUs but gave slightly higher maximum exponent for the same fft length. SP and M31 were not useful; the usable bits/word were too small. See https://www.mersenneforum.org/showpo...35&postcount=2 including the first attachment. There's a lot of GIMPS computing reference info at https://www.mersenneforum.org/showthread.php?t=24607. Last fiddled with by kriesel on 20210717 at 18:09 
20210717, 18:13  #3  
"Viliam Furík"
Jul 2018
Martin, Slovakia
2·353 Posts 
Quote:
Quote:
What times did you get with your code, and what times do you get with Prime95 at similar exponents? 

20210717, 18:20  #4 
Jul 2021
3×11 Posts 
@kriesel Thanks for quick reply!
Right now my code is quite dirty draft. Is not very well suited for usage or reading. Also didn't do comarison to current's Prime95 code. Sure Prime95's FFT is much faster than mine. I just wanted to create a Proof of Concept about comparison of FFT and NTT for myself. Actually I quite accidentally found out that NTT was much faster. Before implementing I didn't even expect this. Just to make comparison of 2 my algos fair I made as much optimizations as possible and used SIMD instructions. Actually only FFT is currently using AVX1 SIMD in my code. So FFT is even in better position. Still NTT is 1015 times faster than FFT. Actually by posting this Thread first of all I wanted to find out if anyone knows if there were any efforts or comparisons to NTT implementations? Maybe there were already such exepriments and they didn't show such good improvement of NTT compared to FFT. I can do speed comparisons if somebody can point out to me how to do that inside gwnum. If there is a ready made function that does multiplication of two large numbers. I.e. if there is some imaginary function `mult(char const * a, char const * b, int size, char * c)` that is when given two arrays of bytes computes resulting multiplication of those two. I can call such function to compare to my NTT. Just of curiosity I compared my NTT implementation to some random optimized NTT inside GitHub like this one https://github.com/Mysticial/ProtoNT...aster/Binaries , this implementation is said to be a part of YCruncher, a famous one. So my implementation is 57% slower than this, hence I can conclude that mine is quite fast. 
20210717, 18:27  #5 
"Viliam Furík"
Jul 2018
Martin, Slovakia
2·353 Posts 
I think the easiest way to do a speed comparison is to find a similar size exponent of a Mersenne number, and run its PRP test in Prime95 for a few iterations. The reported time per iteration is your result for Prime95, and thus gwnum.
Make sure your code is using the same number of cores as Prime95. 
20210717, 18:32  #6  
Jul 2021
3×11 Posts 
Quote:
So for this two numbers each 2^25 bytes I get 9.8 seconds on my NTT and 9.1 seconds on YCruncher's NTT on my old 1Ghz laptop, these two a measured singlethreaded. If somebody can point me to how I can do multiply using `gwnum` lib or how to use Prime95 to test one multiplication of given size, then I can do measurements on my laptop. But just to tell in advance  I actually wanted to tell not that my implementation is fast but wanted to stress out that NTT is maybe faster compared to FFT. 

20210717, 18:52  #7  
"Viliam Furík"
Jul 2018
Martin, Slovakia
2×353 Posts 
Quote:
Code:
PRP=N/A,1,2,268435399,1,85,0 If you are testing your NTT and ycruncher's NTT on one core, one thread, set it in the Prime95 in the upper menu Test > Worker Windows (number of workers = 1, CPU cores to use = 1). Last fiddled with by Viliam Furik on 20210717 at 19:00 Reason: Tf level too low + correction 

20210717, 18:59  #8 
Sep 2002
Database er0rr
7536_{8} Posts 

20210717, 19:01  #9 
"Viliam Furík"
Jul 2018
Martin, Slovakia
1011000010_{2} Posts 

20210717, 19:14  #10  
Jul 2021
41_{8} Posts 
Quote:
Can you tell approximately how often print out to worker window is done for exponent of this size at 1 thread (1 core of 1Ghz)? 

20210717, 19:44  #11  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·11·269 Posts 
Quote:
To run prime95 for comparison timings: In the reference info link I gave you earlier, scroll down to the thread that is specific to prime95. Read and follow the how to get started info in that thread's first post. Create worktodo entries manually. Set workers to one, cores to one. Set prime95 to print out interim residues and timing every 100 iterations. This timing will include the error detection's slight overhead. Prime95 can be configured for fraction of a second resolution on time stamp of screen output also. There are several good documentation files included in the package, recommended reading. Try to replicate in prime95, with timing, while nothing else runs, the 100iteration residues for larger known Mersenne primes in the attachment of https://www.mersenneforum.org/showpo...83&postcount=5. Then attempt to compute the same type 1 PRP sequence with a=3 using your FFT or NTT code, with output of timing and least significant 64 bits of the 100iteration results, on the same exponents, iterations, and hardware. It's not clear to me how you avoid issues with carry propagation or loss, apparently using no guard bits. Perhaps I misunderstood your reference to 2^{25} byte operands (32MB each) in 2^{22} 8byte words each. As a point of reference, from one of the benchmark attachments in that prime95 reference thread, on a dualcore E8200 Core2Duo, in older prime95 v29.4b8, 32M 64bit words fft length, max ~596M exponent, I get ~2.53 iter/sec. If I pessimistically scale that down 2:1 for core count and 2.66:1 for clock rate, I extrapolate 1/2.53 * 2 *2.66 = 2.1 seconds*coreGHz/iter. Same hardware, scaled pessimistically to 4M DPwords fft length, 1 core, 1 GHz, max ~79M exponent, ~23.5 iter/sec; 1/23.5*2*2.66 ~0.226 sec*coreGHz/iter. The E8200 is SSE2 at best as I recall; https://ark.intel.com/content/www/us...3mhzfsb.html does not say. (Prime95 is usually more memory bandwidth limited than cpu limited.) Last fiddled with by kriesel on 20210717 at 19:45 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster way to do LLT?  1260  Miscellaneous Math  23  20050904 07:12 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 