Feb 22, 2008 (02:02 PM EST)
'Exaflop' Supercomputer Planning Begins
Read the Original Article at InformationWeek
Researchers at Sandia and Oak Ridge National Laboratories are preparing for the challenges of developing an exascale computer at the new Institute for Advanced Architectures.
Through the IAA, scientists plan to conduct the basic research required to create a computer capable of performing a million trillion calculations per second, otherwise known as an exaflop. That's a million times faster than today's teraflop computers and a thousand times faster than the petaflop barrier, which was broken in 2006.
Sandia's ASCI Red became the world's first teraflop computer in late 1996.
Backed by $7.4 million in funding, computer scientists aim to narrow the gap between theoretical peak performance and actual performance through new architectures.
"We're actually not building an exaflop supercomputer," said Sandia project lead Sudip Dosanjh. Rather, he said, the U.S. Department of Energy and the National Security Agency have made it clear that they expect to have need for exaflop computing around 2018. The anticipated applications, he said, include large-scale prediction, such as global climate change predictions, materials science analysis, fusion research, and national security problems that he could not discuss.
To meet those requirements, "there are a number of research challenges we need to get to work on," said Dosanjh. "We really need to do that in collaboration with industry and academia. We want to do R&D that will impact real systems in the next decade."
One such challenge is power consumption. "An exaflop supercomputer might need 100 megawatts of power, which is a significant portion of a power plant," said Dosanjh. "We need to do some research to get that down. Otherwise no one will be able to power one."
Then there's the issue of reliability, which tends to decline as the parts count increases. Given that an exascale computer might have a million hundred-core processors, Dosanjh speculated that such a machine might run for 10 minutes before suffering a failure. To manage a machine with so many parts, new fault-tolerance schemes need to be developed.
Data movement is also a critical concern, said Dosanjh. "The rate of memory access has not kept up with the ability of these processors to do floating point operations," he said.
And in addition to the hardware engineering challenges, programmers have to be educated to write code for such massively parallel systems. "As far as the industry is concerned, there needs to be an education effort as well to get people trained to write software at this scale," said Dosanjh.
Just such an effort is already under way. Last October, Google and IBM launched an educational initiative to teach programmers at several universities how to code for large-scale distributed computing systems.
The IAA had its initial meeting in January, attended by almost 50 representatives from government, academia, and industry. The topic of discussion was memory in high-performance computing. At the organization's next meeting, Dosanjh said researchers will discuss interconnects, the networks inside supercomputers.