The Future of Computing
The Future of Computing
Moore’s Law is a techno-economic model that has enabled the information technology
industry to double the performance and functionality of digital electronics roughly every 2
years within a fixed cost, power and area. Advances in silicon lithography have enabled
this exponential miniaturization of electronics, but, as transistors reach atomic scale and
fabrication costs continue to rise, the classical technological driver that has underpinned
Moore’s Law for 50 years is failing and is anticipated to flatten by 2025. This article
provides an updated view of what a post-exascale system will look like and the challenges
ahead, based on our most recent understanding of technology roadmaps. It also discusses
the tapering of historical improvements, and how it affects options available to continue
scaling of successors to the first exascale machine. Lastly, this article covers the many
different opportunities and strategies available to continue computing performance
improvements in the absence of historical technology drivers.
This article is part of a discussion meeting issue ‘Numerical algorithms for
high-performance computational science’.
1. Introduction
Society has come to depend on the rapid, predictable and affordable scaling of computing
performance for consumer electronics, the rise of ‘big data’ and data centres (Google,
Facebook), scientific discovery and national security. There are many other parts of the
economy and economic development that are intimately linked with these dramatic
improvements in information technology (IT) and computing, such as avionics systems for
aircraft, the automotive industry (e.g. self-driving cars) and smart grid technologies. The
approaching end of lithographic scaling threatens to hinder continued
25
2013 report
2015 report
20
10
0
2013 2015 2017 2019 2021 2023 2024 2025 2027 2028 2030
year
Figure 1. The ITRS most recent report predicts transistor scaling will end in 2021 (a decade sooner than was predicted in 2013).
Figure from ITRS. (Online version in colour.)
health of the $4 trillion electronics industry, impacting many related fields that depend on
computing and electronics.
Moore’s Law [1] is a techno-economic model that has enabled the IT industry to double the
performance and functionality of digital electronics roughly every 2 years within a fixed cost,
power and area. This expectation has led to a relatively stable ecosystem (e.g. electronic design
automation tools, compilers, simulators and emulators) built around general-purpose processor
technologies, such as the ×86, ARM and Power instruction set architectures. However, within a
decade, the technological underpinnings for the process that Gordon Moore described will come
to an end, as lithography gets down to atomic scale. At that point, it will be feasible to create
lithographically produced devices with dimensions nearing atomic scale, where a dozen or fewer
silicon atoms are present across critical device features, and will therefore represent a practical
limit for implementing logic gates for digital computing [2]. Indeed, the ITRS (International
Technology Roadmap for Semiconductors), which has tracked the historical improvements over
the past 30 years, has projected no improvements beyond 2021, as shown in figure 1, and
subsequently disbanded, having no further purpose. The classical technological driver that has
underpinned Moore’s Law for the past 50 years is failing [3] and is anticipated to flatten by
2025, as shown in figure 2. Evolving technology in the absence of Moore’s Law will require an
investment now in computer architecture and the basic sciences (including materials science),
to study candidate replacement materials and alternative device physics to foster continued
technology scaling.
106
104
performance
clock frequency
103
power (watts)
102
# cores
10
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030
Figure 2. Sources of computing performance have been challenged by the end of Dennard scaling in 2004. All additional
approaches to further performance improvements end in approximately 2025 due to the end of the roadmap for improvements
to semiconductor lithography. Figure from Kunle Olukotun, Lance Hammond, Herb Sutter, Mark Horowitz and extended by John
Shalf. (Online version in colour.)
be six orders of magnitude smaller than today’s devices. As we approach the longer term,
we will require ground-breaking advances in device technology going beyond CMOS (arising
from fundamentally new knowledge of control pathways), system architecture and programming
models to allow the energy benefits of scaling to be realized. Using the history of the silicon fin
field-effect transistor (FinFET), it takes about 10 years for an advance in basic device physics to
reach mainstream use. Therefore, any new technology will require a long lead-time and sustained
R&D of one to two decades. Options abound, the race outcome is undecided, and the prize
is invaluable. The winner not only will influence chip technology, but also will define a new
direction for the entire computing industry and many other industries that have come to depend
heavily on computing technology.
There are numerous paths forward to continue performance scaling in the absence of
lithographic scaling, as shown in figure 3. These three axes represent different technology scaling
paths that could be used to extract additional performance beyond the end of lithographic
scaling. The near-term focus will be on development of ever more specialized architectures and
advanced packaging technologies that arrange existing building blocks (the horizontal axis of
figure 3). In the mid-term, emphasis will likely be on developing CMOS-based devices that
extend into the third, or vertical, dimension and on improving materials and transistors that
will enhance performance by creating more efficient underlying logic devices. The third axis
represents opportunities to develop new models of computation such as neuro-inspired or
quantum computing, which solve problems that are not well addressed by digital computing.
calculations that are accurate within the precision limit of the digital representation. Brain-
inspired computational methods such as machine learning have substantially improved our
ability to recognize patterns in ‘big data’ and automate data mining processes over traditional
pattern recognition algorithms, but they are less reliable for handling operations that require
precise response and reproducibility (even ‘explainability’ for that matter). Quantum computing
will expand our ability to solve combinatorically complex problems in polynomial time, but they
will not be much good for word processing or graphics rendering, for example. It is quite exciting
and gratifying to see computing expand into new spaces, but equally important to know the
complementary role that digital computing plays in our society that is not and cannot be replaced
by these emerging modes of computation.
Quantum and brain-inspired technologies have garnered much attention recently due to
their rapid pace of recent improvements. Much of advanced architecture development and new
startup companies in the digital computing space are targeting the artificial intelligence/machine
learning (AI/ML) market because of its explosive market growth rate. Growth markets are far
more appealing business opportunities for companies and venture capital, as they offer a path
to profit growth, whereas a large market that is static invites competition that slowly erodes
profits over time. As a result, there is far more attention paid to technologies that are seeing a
rapid rate of expansion, even in cases where the market is still comparatively small. So interest
in quantum computing and AI/ML is currently superheated due to market opportunities, but it
is still urgent to advance digital computing even as we pursue these new computing directions.
Neither quantum nor brain-inspired architectures are replacement technologies for functionality
that digital technologies are good at. Indeed, current AI/ML solutions are deeply dependent
upon digital computing technology, and if there is any lesson to be learned from the diversity
of AI/ML hardware solutions, it is that architecture specialization and custom hardware is very
effective—the topic of the next section.
3. Architectural specialization
In the near term, the most practical path to continued performance growth will be architectural
specialization in the form of many different kinds of accelerators. We believe this to be
true because historically it has taken approximately 10 years for a new transistor concept
demonstrated in the laboratory to become incorporated into a commercial fabrication process.
Our US Office of Science and Technology Policy (OSTP) report with Robert Leland surveyed the
landscape of potential CMOS-replacement technologies and found many potential candidates [4],
but no obvious replacements demonstrated in the laboratory at this point. Therefore, we are
already a decade too late to resolve this crisis by finding a scalable post-CMOS path forward.
The only hardware option for the coming decade will be architectural specialization and
advanced packaging for lack of a credible alternative. When competing against an exponentially
improving general-purpose computing ecosystem, it was very difficult to compete using
hardware specialization. In the past, the path of specialization has not been productive to pursue
due to long lead-times and high development costs. However, as Thompson & Spanuth’s [6]
article on the evaluation of the economics of Moore’s Law points out, the tapering of Moore’s Law
improvements makes architecture specialization a credible and economically viable alternative
to fully general-purpose computing, but such a path will have a profound effect on algorithm
development and on the programming environment.
Therefore, in the absence of any miraculous new transistor or other device to enable continued
technology scaling, the only tool left to a computer architect for extracting continued performance
improvements is to use transistors more efficiently by specializing the architecture to the target
scientific problem(s), as projected. Overall, there is strong consensus that the tapering of Moore’s
Law will lead to a broader range of accelerators or specialization technologies than we have
seen in the past three decades. Examples of this trend exist in smartphone technologies, which
contain dozens of specialized accelerators co-located on the same chip; in hardware deployed
in massive data centres, such as Google’s Tensor Processing Unit (TPU), which accelerates the
Tensorflow programming framework for ML tasks; in field-programmable gate arrays (FPGAs)
in the Microsoft Cloud used for Bing search and other applications; and a vast array of other deep
learning accelerators. The industry is already moving forward with production implementation of
diverse acceleration in the AI and ML markets (e.g. Google TPU [7], Nervana’s AI architecture [8],
Facebook’s Big Sur [9]) and other forms of compute-in-network acceleration for mega-data
centres (e.g. Microsoft’s FPGA Configurable Cloud and Project Catapult for FPGA-accelerated
search [10]). Even before the explosive growth in the AI/ML market, system-on-chip (SoC)
vendors for embedded, Internet of things (IoT) and smartphone applications were already
pursuing specialization to good effect. Shao et al. [11] from Harvard University tracked the growth
rate of specialized accelerators in iPhone chips, and found a steady growth rate for discrete
hardware accelerator units, which grew from around 22 accelerators for Apple’s 6th-generation
iPhone SoC to well over 40 discrete accelerators in their 11th-generation chip. Companies engaged
in this practice of developing such diverse heterogeneous accelerators because the strategy works!
There have also been demonstrated successes in creating science-targeted accelerators such
as D.E. Shaw’s Anton, which accelerates molecular dynamics (MD) simulations nearly 180×
over contemporary high-performance computing (HPC) systems [12], and the GRAPE series
of specialized accelerators for cosmology and MD [13]. A recent International Symposium on
Computer Architecture workshop on the future of computing research beyond 2030 (http://
arch2030.cs.washington.edu/) concluded that heterogeneity and diversity of architecture are
nearly inevitable given current architecture trends. This trend toward co-packaging of diverse
‘extremely heterogeneous’ accelerators is already well under way, as shown in figure 4.
Therefore, specialization is the most promising technique for continuing to provide the
year-on-year performance increases required by all users of scientific computing systems, but
specialization needs to have a well-defined application target to specialize for. This creates
a particular need for the sciences to focus on the unique aspects of scientific computing
for both analysis and simulation. Recent communications with computing industry leaders
past - homogeneous present - heterogeneous future - post-CMOS extreme
present - CPU+GPU
architectures architectures heterogeneity
architecture, device and memory
heterogeneity
system architect
CMOS
Acc1
CPU CPU
L1 Cache
Mem
CPU CPU
L1 Cache
Acc
cpu
paradise
CPU CPU buses
MEM MEM MEM
MRAM
GPU/
CMOS
accelerator
inter-
MRAM
CPU CPU
Acc
inter- GPU/DSP inter-
Acc0
CNFET
MRAM
face DSP Acc Acc Acc
face face
buses
Acc Acc Acc
buses
Acc Acc Acc system bus
hybrid DRAM
memory
cube
Figure 4. Architectural specialization and extreme heterogeneity are anticipated to be the near-term response to the end of
classical technology scaling. Figure courtesy of Dilip Vasudevan from LBNL. (Online version in colour.)
suggest that post-exascale HPC platforms will become increasingly heterogeneous environments.
Heterogeneous processor accelerators—whether they are commercial designs (evolutions of
GPU or CPU technologies), emerging reconfigurable hardware or bespoke architectures that
are customized for specific science applications—optimize hardware and software for particular
tasks or algorithms and enable performance and/or energy efficiency gains that would not be
realized using general-purpose approaches. These long-term trends in the underlying hardware
technology (driven by the physics) are creating daunting challenges for maintaining the
productivity and continued performance scaling of HPC codes on future systems.
As a means to organize the universe of options available, we subdivide the solution space into
three different strategies:
(i) Hardware-driven algorithm design: where we evaluate emerging accelerators in the context
of workload, and modify algorithms to take full advantage of new accelerators.
(ii) Algorithm-driven hardware design: where we design largely fixed-function accelerators
based on algorithm or application requirements.
(iii) Co-develop hardware and algorithms: this represents a cooperative design with a selected
industry partner or partnership to design algorithms and hardware together.
For hardware-driven algorithm design, we recognize that the industry will continue to
produce accelerators that are targeted at other markets such as ML applications. In the near
future, GPUs, accelerators (NVIDIA, AMD/ATI, Intel) and multi-core processors with wide-
vector extensions (such as ARM SVE and Intel’s AVX512) will continue to dominate. However,
the boost in performance offered by the GPUs and wide-vector extensions to CPUs have offered
a one-time jump in performance, but do not offer a new exponential growth path. There are a
number of extensions emerging that are targeted at accelerating the burgeoning AI workloads,
such as NVIDIA’s tensor extensions in the V100 GPU. Such extensions are very specific tensor
operations that operate at much lower (16-bit and 8-bit) precision, which may limit them unless
algorithms are completely redesigned to exploit these features (where possible). While this puts
the primary burden upon the algorithm and application developers, to some extent this is the
strategy that has more or less been common practice since the ‘attack of the killer micros’
transformed the HPC landscape from purpose-built vector machines to clusters of commercial
off-the-shelf (COTS) nodes nearly 3 decades ago.
Algorithm-driven hardware design would mark a return to past practices of designing
purpose-built machines for targeted high-value workloads. As mentioned earlier, the rapid
growth and diversity in specialized AI architectures (Google TPU and others) as well as isolated
examples in the sciences (D.E. Shaw’s Anton, SPINNAKER, etc.) demonstrate that this approach
can offer a path to performance growth. However, the development costs are high (tens to
hundreds of millions of dollars per system in today’s technology market), it requires long
development lead times, and it risks having the application requirements shift so as to make
the hardware obsolete. This concern has caused an increased interest in reconfigurable hardware
such as FPGAs and coarse-grained reconfigurable arrays (CGRAs). These devices allow the logic
and specializations within the chip to be reconfigured rather than having to build a new chip. The
challenge with FPGAs is that the extreme flexibility to enable hardware reconfiguration comes at
a cost of 5× slower clock rates (typical designs run at 200 MHz rather than at the gigahertz clock
rates expected of custom logic) and a reduction of effective logic density (number of usable gates
per chip) by a similar factor. CGRAs, such as Stanford’s Plasticine [14], mitigate these problems
by offering a coarser granularity of reconfiguration where the building blocks are full floating-
point adders and multipliers rather than individual wires and gates offered by the FPGAs. The
biggest challenge to making these devices useful is that the tools and programming models for
programming these devices are extraordinarily difficult to use and it requires a lot of effort to
get even simple algorithms to perform well. There is a lot of work going in to developing more
agile hardware design methodologies such as higher-level hardware development languages
(e.g. CHISEL, PyMTL), and more design automation to reduce human effort to make production
of custom chips more affordable.
The third option of deeper co-design is less of a technological option than it is a new
economic model for interacting with the industry that produces computer systems and the
potential customers of said technologies. The era of general-purpose computing led to a more
or less hands-off relationship between technology suppliers and their customers, as documented
by Thompson & Spanuth [6], where a general-purpose processor could serve many different
applications. In an era where specializing hardware to the application is the only means of
performance improvement, the economic model for the design of future systems is going to
need to change dramatically to lower design and verification costs for the development of new
hardware. Otherwise, the future predicted by economists such as Thompson is one where high-
value markets such as AI for Google and Facebook will be able to afford to create custom
hardware (the fast lane) and the rest of the market will receive no such boosts (remaining in
the slow lane). To prevent this kind of future from happening, the industry is adopting more
agile hardware production methods such as using chiplets. Rather than have a single large
piece of silicon that integrates together all of the diverse accelerators comprising the customized
hardware, the chiplets break each piece of functionality into a very tiny tile. These chiplets/tiles
are then stitched together into a mosaic by bonding them to a common silicon substrate. This
enables manufacturers to rapidly piece together a mosaic of these chiplets to serve the diverse
specialized applications at a much lower cost and much faster turn-around. However, this
approach falls down if the desired functionality does not already exist in the available chiplets.
Perhaps in the future the ‘algorithm-driven hardware design’ and this chiplets approach might
be able to meet in the middle to bring forth a new economic model that can enable productive
architecture specialization for small markets, such as Dr Sophia Shao’s [11] vision for her Aladdin
integrated hardware specialization/design environment.
102
2008 (45 nm)
2018 (11 nm)
10
p
p
m
p
r
t
p
ec
ste
hi
hi
hi
flo
te
RA
nn
-c
-c
-c
gi
ys
on
on
on
P
co
re
ss
D
p/
r
m
m
m
os
te
hi
m
m
m
in
cr
f-c
15
5
1
al
of
c
lo
Figure 5. The energy consumption of compute and data movement operations at different levels of the compute hierarchy—
from the arithmetic logic unit on the left to system-scale data movement across the interconnect on the right. As lithography
has improved, the energy efficiency of wires has not improved as fast as the efficiency of transistors. Consequently, moving
two operands just 2 mm across a silicon chip consumes more energy than the floating-point operation performed upon them.
(Online version in colour.)
M M M M M M M M
CMP1 CMP2
GPU2 GPU4
Figure 6. The three primary enabling optical technologies for system-wide disaggregation for data centres—efficient comb
laser sources, photonic MCMs, and optical circuit switches for bandwidth steering to reconfigure the MCMs. System-wide
resource disaggregation offers a path to co-integrating diverse technologies to support diverse workload requirements; this
is being developed by industry/academic collaborations such as the ARPAe ENLITENED PINE efficient data centres project, led by
Keren Bergman of Columbia University. (Online version in colour.)
(e.g. lower picojoules per bit), these emerging workloads and technology trends will shift
the emphasis to other metrics such as bandwidth density (as opposed to bandwidth alone),
reduced latency and performance consistency. For example, copper-based signalling technologies
currently exhibit a maximum at 54 gigabits/second per wire and are struggling to double that
figure—with the roadmap slipping by nearly 2 years at this point. By contrast, a single optical
fibre can carry 1–10 terabits/second of bandwidth by carrying many non-interfering channels
down the same path using different colours of light for each channel. This is a full 5 orders
of magnitude improvement in carrying capacity for photonics in comparison to copper wires.
However, such metrics cannot be accomplished with device improvements alone, but require a
systems view of photonics in computing platforms.
Data centres support diverse workloads by purchasing from a limited menu of application-
area-tailored node designs (e.g. big compute node, big DRAM node and big NVRAM node)
and allocate resources based on instantaneous workload requirements. However, this can lead
to marooned resources when the system runs out of one of those node types and is under-
using other node types due to the ephemeral requirements of the workload. The ‘disaggregated
rack’ involves purchasing the individual components and allocating the resources dynamically
from these different node types on an as-needed basis across the rack [23,24]. Data centres are
motivated to support this kind of disaggregation because it enables more flexible sharing of
hardware resources. However, a conventional Ethernet fabric is a severe inhibitor to efficient
resource sharing. Substantial increases in bandwidth density will be required.
Numerous projects have been working on using high-bandwidth-density photonics to
enable this kind of system wide resource disaggregation by pumping up the off-package data
bandwidths [25]. For example, PINE (Photonic Integrated Networked Energy efficient data
centres) is an ARPAe ENLITENED project led by Keren Bergman of Columbia University and
involving numerous industry and university partners, including NVIDIA, Microsoft, Cisco,
University of California–Santa Barbara (UCSB), Lawrence Berkeley National Laboratory (LBNL)
and Freedom Photonics [26,27]. The three principal elements of the project, shown in figure 6, are
ultra-high-bandwidth-density (multiple terabits/second of bandwidth per fibre using a single
comb laser source) links that are co-packaged with compute accelerators and memory in MCMs.
This approach could revolutionize the use of resource disaggregation within the data centre to
overcome the challenges of co-integrating extremely heterogeneous accelerators. These efforts
will likely coevolve with new architectural approaches that better tailor computing capability
to specific problems, driven principally by large economic forces associated with the global IT
market.
source drain
substrate
Figure 7. LBNL’s prototype deep codesign framework to accelerate the discovery of CMOS replacement technologies. (Online
version in colour.)
long-term solution requires fundamental advances in our knowledge of materials and pathways
to control and manipulate information elements at the limits of energy flow. As we approach
the longer term, we will require ground-breaking advances in device technology going beyond
CMOS (arising from fundamentally new knowledge of control pathways), system architecture
and programming models to allow the energy benefits of scaling to be realized. A complete
workflow will be constructed, linking device models and materials to circuits and then evaluating
these circuits through efficient generation of specialized hardware architectural models such that
advances can be compared for their benefits to ultimate system performance. The architectural
simulations that result from this work will yield better understanding of the performance impact
of these emerging approaches on target applications and enable early exploration of new software
systems that would make these new architectures useful and programmable.
In the longer term, we will expand the modelling framework to include non-traditional
computing models and accelerators, such as neuro-inspired and quantum accelerators, as
components in our simulation infrastructure. We will also develop the technology to automate
aspects of the algorithm/architecture/software environment system co-design process so
developers can evaluate their ideas early in future hardware. Ultimately, we will close the
feedback loop from the software all the way down to the device to make software an integrated
part of this infrastructure.
5. Conclusion
Semiconductor technology has a pervasive role to play in future energy, economic and technology
security. To effectively meet societal needs and expectations in a broad context, these new
devices and computing paradigms must be economically manufacturable at scale and provide an
exponential improvement path. Such requirements could necessitate a substantial technological
shift analogous to the transition from vacuum tubes to semiconductors. This transition will
require not years, but decades, so whether the semiconductor roadmap has 10 or 20 years of
remaining vitality, researchers must begin now to lay a strategic foundation for change.
Data accessibility. This article has no additional data.
Competing interests. The author declares that he has no competing interests.
Funding. LBNL is supported by the Office of Advanced Scientific Computing Research in the Department of
Energy Office of Science under contract no. DE-AC02-05CH11231.
Acknowledgements. I would like to acknowledge Ramamorthy Ramesh (LBNL/Berkeley), Dan Armbrust (former
CEO of SEMATECH), Shekhar Borkar (Qualcomm), Bill Dally and Larry Dennison (NVIDIA Research) and
Keren Bergman of Columbia University for productive discussions about the vision for the future of
computing. I would also like to acknowledge the US Office of Science and Technology Policy (OSTP) and John
Holdren for commissioning a report from me and Robert Leland (on loan to OSTP from Sandia National
Labs) to research and write a report on ‘Computing beyond Moore’s Law’ for them in 2013 that introduced
me to many of the key technology challenges involved.
Disclaimer. The opinions expressed by the author are his own and not necessarily reflective of the official policy
or opinions of the DOE or of LBNL.
References
1. Moore GE. 1965 Cramming more components onto integrated circuits. Electronics 38, 33–35.
(doi:10.1109/N-SSC.2006.4785860)
2. Mack C. 2015 The multiple lives of Moore’s law. IEEE Spectrum 52, 31–31. (doi:10.1109/
MSPEC.2015.7065415)
3. Markov IL. 2014 Limits on fundamental limits to computation. Nature 512, 147–154.
(doi:10.1038/nature13570)
4. Shalf JM, Leland R. 2015 Computing beyond Moore’s law. IEEE Computer 48, 14–23.
(doi:10.1109/MC.2015.374)
5. Law M, Colwell RC. 2013 The chip design game at the end of Moore’s Law. Hot Chips
Symposium. Keynote, pp. 1–16. See https://www.hotchips.org/wp-content/uploads/hc_
archives/hc25/HC25.15-keynote1-Chipdesign-epub/HC25.26.190-Keynote1-ChipDesignGa
me-Colwell-DARPA.pdf.
6. Thompson N, Spanuth S. 2018 The decline of computers as a general purpose technology:
why deep learning and the end of Moore’s Law are fragmenting computing. SSRN
abstract 3287769. (doi:10.2139/ssrn.3287769)
7. Jouppi NP et al. 2017 In-datacenter performance analysis of a tensor processing unit. In
Newslett. ACM SIGARCH Computer Architecture News – ISCA’17, vol. 45 (2), May, pp. 1–12.
New York, NY: ACM. (doi:10.1145/3079856.3080246)
8. Hsu J. 2016 Nervana systems: turning neural networks into a service. IEEE Spectrum 53, 19.
(doi:10.1109/MSPEC.2016.7473141)
9. Facebook Inc. 2017 Introducing Big Basin: our next-generation AI hardware. See
https://code.facebook.com/posts/1835166200089399/introducing-big-basin-our-next-gener
ation-ai-hardware/.
10. Caulfield A. 2016 A cloud-scale acceleration architecture. In 2016 49th Annu. IEEE/ACM Int.
Symp. on Microarchitecture (MICRO-49), Taipei, Taiwan, 15–19 October, 13pp. New York, NY:
IEEE. (doi:10.1109/MICRO.2016.7783710)
11. Shao YS, Xi SL, Srinivisan V, Wei GY, Brooks D. 2016 Co-designing accelerators
and SoC interfaces using gem5-Aladdin. In 2016 49th Annu. IEEE/ACM Int. Symp. on
Microarchitecture (MICRO-49), Taipei, Taiwan, 15–19 October, 12pp. New York, NY: IEEE.
(doi:10.1109/MICRO.2016.7783751)
12. Shaw DE et al. 2014 Anton 2: raising the bar for performance and programmability in a special-
purpose molecular dynamics supercomputer. In SC’14: Proc. Int. Conf. for High Performance
Computing, Networking, Storage and Analysis, New Orleans, LA, 16–21 November, pp. 41–53.
New York, NY: IEEE. (doi:10.1109/SC.2014.9)
13. Ohmura I, Morimoto G, Ohno Y, Hasegawa A, Taiji M. 2014 MDGRAPE-4: a special-purpose
computer system for molecular dynamics simulations. Phil. Trans. R. Soc. A 372, 20130387.
(doi:10.1098/rsta.2013.0387)
14. Prabhakar R, Zhang Y, Koeplinger D, Feldman M, Zhao T, Hadjis S, Pedram A, Kozyrakis C,
Olukotun K. 2018 Plasticine: a reconfigurable accelerator for parallel patterns. IEEE Micro 38,
20–31. (doi:10.1109/MM.2018.032271058)
15. Johansen H et al. 2014 Software productivity for extreme-scale science. Report on
DOE Workshop. See http://www.orau.gov/swproductivity2014/SoftwareProductivity
WorkshopReport2014.pdf.
16. Asanovic K et al. 2006 The landscape of parallel computing research: a view from Berkeley.
EECS Department, UC Berkeley. Technical Report No. UCB/EECS-2006-183. See http://
www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf.
17. Miller DAB, Ozaktas HM. 1997 Limit to the bit-rate capacity of electrical interconnects
from the aspect ratio of the system architecture. J. Parallel Distrib. Comput. 41, 42–52.
(doi:10.1006/jpdc.1996.1285)
18. Miller DAB. 2000 Rationale and challenges for optical interconnects to electronic chips. Proc.
IEEE 88, 728–749. (doi:10.1109/5.867687)
19. Horowitz M, Yang CKK, Sidiropoulos S. 1998 High-speed electrical signaling: overview and
limitations. IEEE Micro 18, 12–24. (doi:10.1109/40.653013)
20. Kogge P, Shalf J. 2013 Exascale computing trends: adjusting to the ‘new normal’ for computer
architecture. Comput. Sci. Eng. 15, 16–26. (doi:10.1109/MCSE.2013.95)
21. Unat D et al. 2017 Trends in data locality abstractions for HPC systems. IEEE Trans. Parallel
Distrib. Syst. 28, 3007–3020. (doi:10.1109/TPDS.2017.2703149)
22. Unat D, Shalf J, Hoefler T, Dubey A, Schulthess T. 2014 PADAL: Programming Abstractions
for Data Locality Workshop Series. See http://www.padalworkshop.org/.
23. Meyer H, Sancho JC, Quiroga JV, Zyulkyarov F, Roca D, Nemirovsky M. 2017 Disaggregated
computing. An evaluation of current trends for datacentres. Procedia Comput. Sci. 108, 685–694.
(doi:10.1016/j.procs.2017.05.129)
24. Taylor J. 2015 Facebook’s data center infrastructure: open compute, disaggregated rack, and
beyond. In 2015 Optical Fiber Communications Conf. and Exhibition (OFC), Los Angeles, CA,
22–26 March, 1p. New York, NY: IEEE. See https://ieeexplore.ieee.org/abstract/document/
7121902.
25. Tokunari M, Hsu HH, Toriyama K, Noma H, Nakagawa S. 2014 High-bandwidth density and
low-power optical MCM using waveguide-integrated organic substrate. J. Lightwave Technol.
32, 1207–1212. (doi:10.1109/JLT.2013.2292703)
26. Bergman K. 2018 Empowering flexible and scalable high performance architectures
with embedded photonics. In 2018 IEEE Int. Parallel and Distributed Processing Symp.
(IPDPS), Vancouver, BC, Canada, 21–25 May, p. 378. New York, NY: IEEE. (doi:10.1109/
IPDPS.2018.00047)
27. Michelogiannakis G, Wilke J, Teh MY, Glick M, Shalf J, Bergman K. 2019 Challenges and
opportunities in system-level evaluation of photonics. In Metro and Data Center Optical
Networks and Short-Reach Links II, SPIE OPTO, San Francisco, CA, 2–7 February, Proc. SPIE
10946. (doi:10.1117/12.2510443)
28. Nikonov DE, Young IA. 2016 Overview of beyond-CMOS devices and a uniform methodology
for their benchmarking. Proc. IEEE 101, 2498–2533. (doi:10.1109/JPROC.2013.2252317)