GTC2013

Proceedings of 2011 International Conference for High Performance Computing (SC11)

| 29 November, 2011
Table of Contents
SC 2011 Keynote
Jen-Hsun Huang
doi>10.1145/2063384.2070751
Full text: Mp4 AwardsMp4 Awards  Mp4 Full KeynoteMp4 Full Keynote  Mp4 General ChairMp4 General Chair
The supercomputing industry is in a global race for better science. This race has some parallels to the space race in the 1960s. Much like then, today there are daunting challenges facing today’s supercomputer designers in their pursuit of exascale and …
SESSION: ACM Gordon Bell finalist
First-principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer
Yukihiro Hasegawa, Jun-Ichi Iwata, Miwako Tsuji, Daisuke Takahashi, Atsushi Oshiyama, Kazuo Minami, Taisuke Boku, Fumiyoshi Shoji, Atsuya Uno, Motoyoshi Kurokawa, Hikaru Inoue, Ikuo Miyoshi, Mitsuo Yokokawa
Article No.: 1
doi>10.1145/2063384.2063386
Full text: PDFPDF
Real space DFT (RSDFT) is a simulation technique most suitable for massively-parallel architectures to perform first-principles electronic-structure calculations based on density functional theory. We here report unprecedented simulations on the electron …
Atomistic nanoelectronic device engineering with sustained performances up to 1.44 PFlop/s
Mathieu Luisier, Timothy B. Boykin, Gerhard Klimeck, Wolfgang Fichtner
Article No.: 2
doi>10.1145/2063384.2063387
Full text: PDFPDF
We present a multi-dimensional, atomistic, quantum transport simulation approach to investigate the performances of realistic nanoscale transistors for various geometries and material systems. The central computation consists in solving the Schrödinger …
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
Takashi Shimokawabe, Takayuki Aoki, Tomohiro Takaki, Toshio Endo, Akinori Yamanaka, Naoya Maruyama, Akira Nukada, Satoshi Matsuoka
Article No.: 3
doi>10.1145/2063384.2063388
Full text: PDFPDF
The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation …
Petaflop biofluidics simulations on a two million-core system
Massimo Bernaschi, Mauro Bisson, Toshio Endo, Satoshi Matsuoka, Massimiliano Fatica, Simone Melchionna
Article No.: 4
doi>10.1145/2063384.2063389
Full text: PDFPDF
We present a computational framework for multi-scale simulations of real-life biofluidic problems. The framework allows to simulate suspensions composed by hundreds of millions of bodies interacting with each other and with a surrounding fluid in complex …
A new computational paradigm in multiscale simulations: application to brain blood flow
Leopold Grinberg, Joseph A. Insley, Vitali Morozov, Michael E. Papka, George Em Karniadakis, Dmitry Fedosov, Kalyan Kumaran
Article No.: 5
doi>10.1145/2063384.2063390
Full text: PDFPDF
Interfacing atomistic-based with continuum-based simulation codes is now required in many multiscale physical and biological systems. We present the computational advances that have enabled the first multiscale simulation on 190,740 processors by coupling …
SESSION: Dense linear algebra
Optimizing symmetric dense matrix-vector multiplication on GPUs
Rajib Nath, Stanimire Tomov, Tingxing “Tim” Dong, Jack Dongarra
Article No.: 6
doi>10.1145/2063384.2063392
Full text: PDFPDF
GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product …
Tiled QR factorization algorithms
Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, Yves Robert
Article No.: 7
doi>10.1145/2063384.2063393
Full text: PDFPDF
This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p × q tiles, where p ≥ q. Within this framework, we study the critical paths and performance of algorithms such as …
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels
Azzam Haidar, Hatem Ltaief, Jack Dongarra
Article No.: 8
doi>10.1145/2063384.2063394
Full text: PDFPDF
This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, …
SESSION: Domain specific languages
Liszt: a domain specific language for building portable mesh-based PDE solvers
Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, Pat Hanrahan
Article No.: 9
doi>10.1145/2063384.2063396
Full text: PDFPDF
Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written …
Simplified parallel domain traversal
Wesley Kendall, Jingyuan Wang, Melissa Allen, Tom Peterka, Jian Huang, David Erickson
Article No.: 10
doi>10.1145/2063384.2063397
Full text: PDFPDF
Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel …
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
Naoya Maruyama, Tatsuo Nomura, Kento Sato, Satoshi Matsuoka
Article No.: 11
doi>10.1145/2063384.2063398
Full text: PDFPDF
This paper proposes a compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters. To enable such automatic translations, we design a small …
SESSION: GPU optimizations
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Michael Bauer, Henry Cook, Brucek Khailany
Article No.: 12
doi>10.1145/2063384.2063400
Full text: PDFPDF
As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing …
Dymaxion: optimizing memory access patterns for heterogeneous systems
Shuai Che, Jeremy W. Sheaffer, Kevin Skadron
Article No.: 13
doi>10.1145/2063384.2063401
Full text: PDFPDF
Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal …
GROPHECY: GPU performance projection from CPU code skeletons
Jiayuan Meng, Vitali A. Morozov, Kalyan Kumaran, Venkatram Vishwanath, Thomas D. Uram
Article No.: 14
doi>10.1145/2063384.2063402
Full text: PDFPDF
We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for …
SESSION: Best paper finalists
Parallel random numbers: as easy as 1, 2, 3
John K. Salmon, Mark A. Moraes, Ron O. Dror, David E. Shaw
Article No.: 16
doi>10.1145/2063384.2063405
Full text: PDFPDF
Most pseudorandom number generators (PRNGs) scale poorly to massively parallel high-performance computation because they are designed as sequentially dependent state transformations. We demonstrate that independent, keyed transformations of counters …
SESSION: Coordinating I/O
Server-side I/O coordination for parallel file systems
Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, Samuel Lang
Article No.: 17
doi>10.1145/2063384.2063407
Full text: PDFPDF
Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems …
QoS support for end users of I/O-intensive applications using shared storage systems
Xuechen Zhang, Kei Davis, Song Jiang
Article No.: 18
doi>10.1145/2063384.2063408
Full text: PDFPDF
While the performance of compute-bound applications can be effectively guaranteed with techniques such as space sharing or QoS-aware process scheduling, it remains a challenge to meet QoS requirements for end users of I/O-intensive applications using …
Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems
Venkatram Vishwanath, Mark Hereld, Vitali Morozov, Michael E. Papka
Article No.: 19
doi>10.1145/2063384.2063409
Full text: PDFPDF
There is growing concern that I/O systems will be hard pressed to satisfy the requirements of future leadership-class machines. Even current machines are found to be I/O bound for some applications. In this paper, we identify existing performance bottlenecks …
SESSION: Power optimization
GreenSlot: scheduling energy consumption in green datacenters
Íñigo Goiri, Ryan Beauchea, Kien Le, Thu D. Nguyen, Md. E. Haque, Jordi Guitart, Jordi Torres, Ricardo Bianchini
Article No.: 20
doi>10.1145/2063384.2063411
Full text: PDFPDF
In this paper, we propose GreenSlot, a parallel batch job scheduler for a datacenter powered by a photovoltaic solar array and the electrical grid (as a backup). GreenSlot predicts the amount of solar energy that will be available in the near future, …
A ‘cool’ load balancer for parallel applications
Osman Sarood, Laxmikant V. Kale
Article No.: 21
doi>10.1145/2063384.2063412
Full text: PDFPDF
Meeting power requirements of huge exascale machines of the future will be a major challenge. Our focus in this paper is to minimize cooling power and we propose a technique that uses a combination of DVFS and temperature aware load balancing to constrain …
Reducing electricity cost through virtual machine placement in high performance computing clouds
Kien Le, Ricardo Bianchini, Jingru Zhang, Yogesh Jaluria, Jiandong Meng, Thu D. Nguyen
Article No.: 22
doi>10.1145/2063384.2063413
Full text: PDFPDF
In this paper, we first study the impact of load placement policies on cooling and maximum data center temperatures in cloud service providers that operate multiple geographically distributed data centers. Based on this study, we then propose dynamic …
SESSION: Applications
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems
Kamesh Madduri, Khaled Z. Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, Leonid Oliker
Article No.: 23
doi>10.1145/2063384.2063415
Full text: PDFPDF
The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation research. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production …
Unitary qubit lattice simulations of multiscale phenomena in quantum turbulence
George Vahala, Min Soe, Bo Zhang, Jeffrey Yepez, Linda Vahala, Jonathan Carter, Sean Ziegeler
Article No.: 24
doi>10.1145/2063384.2063416
Full text: PDFPDF
A unitary qubit lattice algorithm, which scales almost perfectly to the full number of cores available (e.g., 216000 cores on a CRAY XT5), is used to examine quantum turbulence and its interrelationship to classical turbulence with production …
An image compositing solution at scale
Kenneth Moreland, Wesley Kendall, Tom Peterka, Jian Huang
Article No.: 25
doi>10.1145/2063384.2063417
Full text: PDFPDF
The only proven method for performing distributed-memory parallel rendering at large scales, tens of thousands of nodes, is a class of algorithms called sort last. The fundamental operation of sort-last parallel rendering is an image composite, which …
SESSION: Large scale systems
The IBM Blue Gene/Q interconnection network and message unit
Dong Chen, Noel A. Eisley, Philip Heidelberger, Robert M. Senger, Yutaka Sugawara, Sameer Kumar, Valentina Salapura, David L. Satterfield, Burkhard Steinmacher-Burow, Jeffrey J. Parker
Article No.: 26
doi>10.1145/2063384.2063419
Full text: PDFPDF
This is the first paper describing the IBM Blue Gene/Q interconnection network and message unit. The Blue Gene/Q system is the third generation in the IBM Blue Gene line of massively parallel supercomputers. The Blue Gene/Q architecture can be scaled …
High-efficiency server design
Eitan Frachtenberg, Ali Heydari, Harry Li, Amir Michael, Jacob Na, Avery Nisbet, Pierluigi Sarti
Article No.: 27
doi>10.1145/2063384.2063420
Full text: PDFPDF
Large-scale data centers consume megawatts in power and cost hundreds of millions of dollars to equip. Reducing the energy and cost footprint of servers can therefore have substantial impact. Web, Grid, and cloud servers in particular can be hard to …
Using the TOP500 to trace and project technology and architecture trends
Peter M. Kogge, Timothy J. Dysart
Article No.: 28
doi>10.1145/2063384.2063421
Full text: PDFPDF
The TOP500 is a treasure trove of information on the leading edge of high performance computing. It was used in the 2008 DARPA Exascale technology report to isolate out the effects of architecture and technology on high performance computing, and lay …
SESSION: Querying large scale data
I/O streaming evaluation of batch queries for data-intensive computational turbulence
Kalin Kanov, Eric Perlman, Randal Burns, Yanif Ahmad, Alexander Szalay
Article No.: 29
doi>10.1145/2063384.2063423
Full text: PDFPDF
We describe a method for evaluating computational turbulence queries, including Lagrange Polynomial interpolation, based on partial sums that allows the underlying data to be accessed in any order and in parts. We exploit these properties to stream data …
Parallel index and query for large scale data analysis
Jerry Chou, Mark Howison, Brian Austin, Kesheng Wu, Ji Qiang, E. Wes Bethel, Arie Shoshani, Oliver Rübel, Prabhat, Rob D. Ryne
Article No.: 30
doi>10.1145/2063384.2063424
Full text: PDFPDF
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing …
ISABELA-QA: query-driven analytics with ISABELA-compressed extreme-scale scientific data
Sriram Lakshminarasimhan, John Jenkins, Isha Arkatkar, Zhenhuan Gong, Hemanth Kolla, Seung-Hoe Ku, Stephane Ethier, Jackie Chen, C. S. Chang, Scott Klasky, Robert Latham, Robert Ross, Nagiza F. Samatova
Article No.: 31
doi>10.1145/2063384.2063425
Full text: PDFPDF
Efficient analytics of scientific data from extreme-scale simulations is quickly becoming a top-notch priority. The increasing simulation output data sizes demand for a paradigm shift in how analytics is conducted. In this paper, we argue that query-driven …
SESSION: Checkpointing optimization
FTI: high performance fault tolerance interface for hybrid systems
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka
Article No.: 32
doi>10.1145/2063384.2063427
Full text: PDFPDF
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. …
Checkpointing strategies for parallel jobs
Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien
Article No.: 33
doi>10.1145/2063384.2063428
Full text: PDFPDF
This work provides an analysis of checkpointing strategies for minimizing expected job execution times in an environment that is subject to processor failures. In the case of both sequential and parallel jobs, we give the optimal solution for exponentially …
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots
Bogdan Nicolae, Franck Cappello
Article No.: 34
doi>10.1145/2063384.2063429
Full text: PDFPDF
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization …
SESSION: GPU applications
Fast implementation of DGEMM on Fermi GPU
Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun
Article No.: 35
doi>10.1145/2063384.2063431
Full text: PDFPDF
In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints …
Scalable fast multipole methods on distributed heterogeneous architectures
Qi Hu, Nail A. Gumerov, Ramani Duraiswami
Article No.: 36
doi>10.1145/2063384.2063432
Full text: PDFPDF
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. …
Multi-science applications with single codebase – GAMER – for massively parallel architectures
Hemant Shukla, Hsi-Yu Schive, Tak-Pong Woo, Tzihong Chiueh
Article No.: 37
doi>10.1145/2063384.2063433
Full text: PDFPDF
The growing need for power efficient extreme-scale highperformance computing (HPC) coupled with plateauing clock-speeds is driving the emergence of massively parallel compute architectures. Tens to many hundreds of cores are increasingly made available …
SESSION: Storage and memory
Virtual I/O caching: dynamic storage cache management for concurrent workloads
Michael Frasca, Ramya Prabhakar, Padma Raghavan, Mahmut Kandemir
Article No.: 38
doi>10.1145/2063384.2063435
Full text: PDFPDF
A leading cause of reduced or unpredictable application performance in distributed systems is contention at the storage layer, where resources are multiplexed among many concurrent data intensive workloads. We target the shared storage cache, used to …
SCMFS: a file system for storage class memory
Xiaojian Wu, A. L. Narasimha Reddy
Article No.: 39
doi>10.1145/2063384.2063436
Full text: PDFPDF
This paper considers the problem of how to implement a file system on Storage Class Memory (SCM), that is directly connected to the memory bus, byte addressable and is also non-volatile. In this paper, we propose a new file system, called SCMFS, which …
Optimized pre-copy live migration for memory intensive applications
Khaled Z. Ibrahim, Steven Hofmeyr, Costin Iancu, Eric Roman
Article No.: 40
doi>10.1145/2063384.2063437
Full text: PDFPDF
Live migration is a widely used technique for resource consolidation and fault tolerance. KVM and Xen use iterative pre-copy approaches which work well in practice for commercial applications. In this paper, we study pre-copy live migration of MPI and …
SESSION: Performance evaluation and analysis
Scalable hashing for shared memory supercomputers
Eric Goodman, M. Nicole Lemaster, Edward Jimenez
Article No.: 41
doi>10.1145/2063384.2063439
Full text: PDFPDF
Hashing is a fundamental technique in computer science to allow O(1) insert and lookups of items in an associative array. Here we present several thread coordination and hashing strategies and compare and contrast their performance on large, shared …
An early performance analysis of POWER7-IH HPC systems
Kevin J. Barker, Adolfy Hoisie, Darren J. Kerbyson
Article No.: 42
doi>10.1145/2063384.2063440
Full text: PDFPDF
In this work we present a performance evaluation of the POWER7-IH processor and of integrated systems built from it. We describe the architecture of P7-IH with an emphasis on those characteristics that have a direct impact on the performance for large-scale …
A similarity measure for time, frequency, and dependencies in large-scale workloads
Mario Lassnig, Thomas Fahringer, Vincent Garonne, Angelos Molfetas, Martin Barisits
Article No.: 43
doi>10.1145/2063384.2063441
Full text: PDFPDF
Performance evaluations of large-scale systems require the use of representative workloads with certifiable similar or dissimilar characteristics. To quantify the similarity of the characteristics, we describe a novel measure comprising two efficient …
SESSION: Reliability
Evaluating the viability of process replication reliability for exascale systems
Kurt Ferreira, Jon Stearley, James H. Laros, III, Ron Oldfield, Kevin Pedretti, Ron Brightwell, Rolf Riesen, Patrick G. Bridges, Dorian Arnold
Article No.: 44
doi>10.1145/2063384.2063443
Full text: PDFPDF
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these …
Modeling and tolerating heterogeneous failures in large parallel systems
Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, Franck Cappello
Article No.: 45
doi>10.1145/2063384.2063444
Full text: PDFPDF
As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect …
System implications of memory reliability in exascale computing
Sheng Li, Ke Chen, Ming-Yu Hsieh, Naveen Muralimanohar, Chad D. Kersey, Jay B. Brockman, Arun F. Rodrigues, Norman P. Jouppi
Article No.: 46
doi>10.1145/2063384.2063445
Full text: PDFPDF
Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) …
SESSION: Scheduling and resource allocation
TRACON: interference-aware scheduling for data-intensive applications in virtualized environments
Ron C. Chiang, H. Howie Huang
Article No.: 47
doi>10.1145/2063384.2063447
Full text: PDFPDF
Large-scale data centers leverage virtualization technology to achieve excellent resource utilization, scalability, and high availability. Ideally, the performance of an application running inside a virtual machine (VM) shall be independent of co-located …
Flexible resource allocation for reliable virtual cluster computing systems
Thomas J. Hacker, Kanak Mahadik
Article No.: 48
doi>10.1145/2063384.2063448
Full text: PDFPDF
Virtualization and cloud computing technologies now make it possible to create scalable and reliable virtual high performance computing clusters. Integrating these technologies, however, is complicated by fundamental and inherent differences in …
Auto-scaling to minimize cost and meet application deadlines in cloud workflows
Ming Mao, Marty Humphrey
Article No.: 49
doi>10.1145/2063384.2063449
Full text: PDFPDF
A goal in cloud computing is to allocate (and thus pay for) only those cloud resources that are truly needed. To date, cloud practitioners have pursued schedule-based (e.g., time-of-day) and rule-based mechanisms to attempt to automate this matching …
SESSION: Debugging
Large scale debugging of parallel tasks with AutomaDeD
Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Anh, Martin Schulz, Barry Rountree
Article No.: 50
doi>10.1145/2063384.2063451
Full text: PDFPDF
Developing correct HPC applications continues to be a challenge as the number of cores increases in today’s largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application …
Efficient data race detection for distributed memory parallel programs
Chang-Seo Park, Koushik Sen, Paul Hargrove, Costin Iancu
Article No.: 51
doi>10.1145/2063384.2063452
Full text: PDFPDF
In this paper we present a precise data race detection technique for distributed memory parallel programs. Our technique, which we call Active Testing, builds on our previous work on race detection for shared memory Java and C programs and it handles …
SESSION: Multicore architectural tools
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
Trevor E. Carlson, Wim Heirman, Lieven Eeckhout
Article No.: 52
doi>10.1145/2063384.2063454
Full text: PDFPDF
Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations …
MAximum Multicore POwer (MAMPO): an automatic multithreaded synthetic power virus generation framework for multicore systems
Karthik Ganesan, Lizy K. John
Article No.: 53
doi>10.1145/2063384.2063455
Full text: PDFPDF
The practically attainable worst case power consumption for a computer system is a significant design parameter and it is a very tedious process to determine it by manually writing high power consuming code snippets called power viruses. Previous research …
Multithreaded Global Address Space Communication Techniques for Gyrokinetic Fusion Applications on Ultra-Scale Platforms
Robert Preissl, Nathan Wichmann, Bill Long, John Shalf, Stephane Ethier, Alice Koniges
Article no.: 78
doi>10.1145/2063384.2071033
Full text: PDFPDF
SESSION: Application performance
Performance of the community earth system model
Patrick H. Worley, Arthur A. Mirin, Anthony P. Craig, Mark A. Taylor, John M. Dennis, Mariana Vertenstein
Article No.: 54
doi>10.1145/2063384.2063457
Full text: PDFPDF
The Community Earth System Model (CESM), released in June 2010, incorporates new physical process and new numerical algorithm options, significantly enhancing simulation capabilities over its predecessor, the June 2004 release of the Community Climate …
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Samuel Williams, Leonid Oliker, Jonathan Carter, John Shalf
Article No.: 55
doi>10.1145/2063384.2063458
Full text: PDFPDF
We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting …
Highly scalable ab initio genomic motif identification
Benoît Marchand, Vladimir B. Bajic, Dinesh K. Kaushik
Article No.: 56
doi>10.1145/2063384.2063459
Full text: PDFPDF
We present results of scaling an ab initio motif family identification system, Dragon Motif Finder (DMF), to 65,536 processor cores of IBM Blue Gene/P. DMF seeks groups of mutually similar polynucleotide patterns within a …
SESSION: MapReduce
Hadoop acceleration through network levitated merge
Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, Dhiraj Sehgal
Article No.: 57
doi>10.1145/2063384.2063461
Full text: PDFPDF
Hadoop is a popular open-source implementation of the MapReduce programming model for cloud computing. However, it faces a number of issues to achieve the best performance from the underlying system. These include a serialization barrier that delays …
Purlieus: locality-aware resource allocation for MapReduce in a cloud
Balaji Palanisamy, Aameek Singh, Ling Liu, Bhushan Jain
Article No.: 58
doi>10.1145/2063384.2063462
Full text: PDFPDF
We present Purlieus, a MapReduce resource allocation system aimed at enhancing the performance of MapReduce jobs in the cloud. Purlieus provisions virtual MapReduce clusters in a locality-aware manner enabling MapReduce virtual machines (VMs) access …
A distributed look-up architecture for text mining applications using MapReduce
Atilla Soner Balkir, Ian Foster, Andrey Rzhetsky
Article No.: 59
doi>10.1145/2063384.2063463
Full text: PDFPDF
We study text analysis algorithms that use global optimization methods to compute local characteristics that are consistent with properties of the entire corpus rather than computed locally based on exogenous parameters. In the iterative implementations …
SESSION: Molecular dynamics and computational physics
Copernicus: a new paradigm for parallel adaptive molecular dynamics
Sander Pronk, Per Larsson, Iman Pouya, Gregory R. Bowman, Imran S. Haque, Kyle Beauchamp, Berk Hess, Vijay S. Pande, Peter M. Kasson, Erik Lindahl
Article No.: 60
doi>10.1145/2063384.2063465
Full text: PDFPDF
Biomolecular simulation is a core application on supercomputers, but it is exceptionally difficult to achieve the strong scaling necessary to reach biologically relevant timescales. Here, we present a new paradigm for parallel adaptive molecular dynamics …
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime
Chao Mei, Yanhua Sun, Gengbin Zheng, Eric J. Bohm, Laxmikant V. Kale, James C. Phillips, Chris Harrison
Article No.: 61
doi>10.1145/2063384.2063466
Full text: PDFPDF
A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large …
Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D quantum strongly-correlated systems
Susumu Yamada, Toshiyuki Imamura, Masahiko Machida
Article No.: 62
doi>10.1145/2063384.2063467
Full text: PDFPDF
One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches …
SESSION: Applications
A scalable eigensolver for large scale-free graphs using 2D graph partitioning
Andy Yoo, Allison H. Baker, Roger Pearce, Van Emden Henson
Article No.: 63
doi>10.1145/2063384.2063469
Full text: PDFPDF
Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs …
Scalable stochastic optimization of complex energy systems
Miles Lubin, Cosmin G. Petra, Mihai Anitescu, Victor Zavala
Article No.: 64
doi>10.1145/2063384.2063470
Full text: PDFPDF
We present a scalable approach and implementation for solving stochastic programming problems, with application to the optimization of complex energy systems under uncertainty. Stochastic programming is used to make decisions in the present while incorporating …
Parallel breadth-first search on distributed memory systems
Aydin Buluç, Kamesh Madduri
Article No.: 65
doi>10.1145/2063384.2063471
Full text: PDFPDF
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First …
SESSION: MapReduce and network QoS
SciHadoop: array-based query processing in Hadoop
Joe B. Buck, Noah Watkins, Jeff LeFevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Scott Brandt
Article No.: 66
doi>10.1145/2063384.2063473
Full text: PDFPDF
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop’s byte stream data model causes inefficiencies when used to process scientific data that …
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Wittawat Tantisiriroj, Seung Woo Son, Swapnil Patil, Samuel J. Lang, Garth Gibson, Robert B. Ross
Article No.: 67
doi>10.1145/2063384.2063474
Full text: PDFPDF
Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, …
End-to-end network QoS via scheduling of flexible resource reservation requests
Sushant Sharma, Dimitrios Katramatos, Dantong Yu
Article No.: 68
doi>10.1145/2063384.2063475
Full text: PDFPDF
Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. …
SESSION: QCD and DFT
High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach
Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Jee Choi, Bálint Joó, Jatin Chhugani, Michael A. Clark, Pradeep Dubey
Article No.: 69
doi>10.1145/2063384.2063477
Full text: PDFPDF
Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed …
Scaling lattice QCD beyond 100 GPUs
R. Babich, M. A. Clark, B. Joó, G. Shi, R. C. Brower, S. Gottlieb
Article No.: 70
doi>10.1145/2063384.2063478
Full text: PDFPDF
Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte …
Large scale plane wave pseudopotential density functional theory calculations on GPU clusters
Long Wang, Yue Wu, Weile Jia, Weiguo Gao, Xuebin Chi, Lin-Wang Wang
Article No.: 71
doi>10.1145/2063384.2063479
Full text: PDFPDF
In this work, we present our implementation of the density functional theory (DFT) plane wave pseudopotential (PWP) calculations on GPU clusters. This GPU version is developed based on a CPU DFT-PWP code: PEtot, which can calculate up to a thousand atoms …
SESSION: Applications
Scalable implementations of accurate excited-state coupled cluster theories: application of high-level methods to porphyrin-based systems
Karol Kowalski, Sriram Krishnamoorthy, Ryan M. Olson, Vinod Tipparaju, E. Aprà
Article No.: 72
doi>10.1145/2063384.2063481
Full text: PDFPDF
The development of reliable tools for excited-state simulations is very important for understanding complex processes in the broad class of light harvesting systems and optoelectronic devices. Over the last years we have been developing equation of motion …
Hardware/software co-design for energy-efficient seismic modeling
Jens Krueger, David Donofrio, John Shalf, Marghoob Mohiyuddin, Samuel Williams, Leonid Oliker, Franz-Josef Pfreund
Article No.: 73
doi>10.1145/2063384.2063482
Full text: PDFPDF
Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8th order or larger, which require large-scale HPC clusters to meet the …
A fast solver for modeling the evolution of virus populations
Gerhard Niederbrucker, Wilfried N. Gansterer
Article No.: 74
doi>10.1145/2063384.2063483
Full text: PDFPDF
Solving Eigen’s quasispecies model for the evolution of virus populations involves the computation of the dominant eigenvector of a matrix whose size N grows exponentially with the chain length of the virus to be modeled. Most biologically interesting …
SESSION: Optimizing communication performance
Optimizing the Barnes-Hut algorithm in UPC
Junchao Zhang, Babak Behzad, Marc Snir
Article No.: 75
doi>10.1145/2063384.2063485
Full text: PDFPDF
PGAS languages’ support of a global name space facilitates the expression of parallel algorithms, since communication is implicit. This is especially convenient when writing irregular applications with data-dependent, dynamically changing communication …
Avoiding hot-spots on two-level direct networks
Abhinav Bhatele, Nikhil Jain, William D. Gropp, Laxmikant V. Kale
Article No.: 76
doi>10.1145/2063384.2063486
Full text: PDFPDF
A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBM’s PERCS topology and the dragonfly …
Improving communication performance in dense linear algebra via topology aware collectives
Edgar Solomonik, Abhinav Bhatele, James Demmel
Article No.: 77
doi>10.1145/2063384.2063487
Full text: PDFPDF
Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient …

Tags: , , , , ,

Category: Articles, Computer Science, Life Science, Physical Science, SC11

Comments are closed.