Numerous wave applications (e.g. sound, radio) rely on the Fast Fourier Transform (FFT) including signal processing, communications, and multi-media. However, it is a very challenging problem to parallelize effectively. This is because for big FFT datasets running on large clusters, 50%-90% of time can be spent waiting on node-node data transfers rather than useful calculation.
It was just announced at the awards session at SC12 that Intel scientists have won the “best paper” for research into more efficient processing for a fundamental calculation in high performance computing, entitled “A Framework for Low-Communication 1-D FFT.”
Intel’s Software and Services Group & Intel Labs devised a new framework for distributed 1-D FFT problems which traditionally require three costly all-to-all inter-node data exchanges. The new approach delivers multiple 1D FFT algorithms requiring just a single all-to-all inter-node data exchange. According to the research, for large-scale problems this can double FFT performance (see the paper for details). Another key feature is that users can opt to further increase FFT performance by accepting reduced-accuracy results, so the algorithms scale to fit the needs of the particular application.
In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental computational algorithms has become an important field of research. For distributed 1-D FFT, communication cost has hitherto remained high as all industry-standard implementations perform three all-to-all internode data exchanges (also called global transpose). These communication steps indeed dominate execution time. In this paper, we present a mathematical framework from which many single-all-to-all and easy-to-implement 1-D FFT algorithms can be derived. For large-scale problems, our implementation can be twice as fast as leading FFT libraries on state-of-the-art computer clusters. Moreover, our framework allows tradeoff between accuracy and performance, further boosting performance if reduced accuracy is acceptable.
Ping Tak, Peter Tang, Jongsoo Park, Daehyun Kim, and Vladimir Petrov. 2012. A framework for low-communication 1-D FFT. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA [DOI] [Free PDF]
[via Intel blog]