LITERATURE REVIEW - INTELLIGENT MEMORY
Sinclair Yeh
Writer’s comment:
Writing has always challenged me. Everyone I knew told me I was no good
at composition. In fact, when my parents first heard this paper was
selected for Prized Writing,their reaction was, “This is an
English paper? We didn’t know you could write!” Besides, why would an
engineer need to know how to write anyway? A drastic change in my
attitude toward writing occurred spring quarter, 1999, the quarter I
took English 104E (Scientific Writing) with Pam Demory. I came to
realize that being able to communicate technical knowledge is as
important as possessing such technical knowledge. And through Pam’s
encouragement, I gained more confidence in my writing ability. After
many helpful suggestions from Pam Demory and Larry Greer, I finally
gathered the courage to submit my review to the Prized Writing contest.
I am grateful for their assistance along the way; this review would not
have come out so successfully without them.
- Sinclair Yeh
Instructor’s comment:
The review paper I assign in English 104E: Scientific Writing, is a
very difficult one to write. It requires searching scientific journals
for recent articles on a specific topic, reading and assessing the
articles, and finally writing a paper that not only reports and
synthesizes the research but helps readers understand the significance
of the material in some broader context. Sinclair succeeded admirably
with his paper on intelligent memory models. Not only does the paper
reflect an obvious understanding of the complexities of the technology
under review, it does so in remarkably clear prose. Sinclair obviously
took to heart one of the central tenets of this course, that technical
material aimed at a technical audience can be clearly written.
- Pamela Demory, English Department
Abstract
The growing processor-memory performance gap creates a
bottleneck in the system; the memory system cannot supply enough data
to keep the processor busy. Before this bottleneck is resolved, faster
processors can do little to improve the overall system performance.
Intelligent memory is a new memory/system architecture that aims to
resolve this bottleneck.
There are four intelligent memory models with published
results: Active Pages, CRAM, PPRAM, and IRAM. Despite their
architectural differences, they all agree to put processing elements
physically closer to the memory, lifting the bottleneck by increasing
processor-memory data bandwidth.
Initial studies of these four models have shown promising
results. However, in order for these academic ideas to become a
reality, intelligent memory researchers must study how their models can
be cost-effectively integrated into commercial computer systems.
Introduction
Microprocessor and DRAM (Dynamic Random Access Memory)
technology are headed in different directions: the former increases in
speed while the latter increases in capacity. This technological
difference has led to what is known as the Processor-Memory Performance
Gap. This performance gap, which is growing at about 50% per year,
creates a serious bottleneck to the overall system performance [Pat97].
The problem boils down to efficient transfer of data between
the microprocessor and the memory system. This is a two-fold problem.
First, the CPU is running several orders of magnitude faster than the
memory system and therefore the memory system cannot supply data fast
enough. Second, the data bus is the main data transfer medium and is
shared among many devices. In other words, the memory system has to
compete with other devices to get data to the CPU, making the transfer
time even worse.
This performance gap results in a data-starved CPU that
wastes valuable computing power in idle cycles. From smarter bus
controllers to smarter memories, a variety of research has been done to
try to lessen this ever growing gap. Here, I will cover intelligent
memory, a new architecture that uses smarter memories to maximize the
CPU and data transfer media efficiency.
This paper considers four intelligent memory models: Active
Pages, Computational RAM (CRAM), Parallel Processing RAM (PPRAM), and
Intelligent RAM (IRAM). Some of these models have a number of possible
prototype implementations and only those with published results will be
discussed here.
The next section gives architectural descriptions of the
intelligent memory models. Each of the four models has a working
prototype. Performance measurements of these prototypes are evaluated
in Section 3. Finally, the conclusion discusses the feasibility of
integrating these models into a real-world system.
Architectural Description
Active Pages
Improvements in the DRAM technology make it possible to
produce denser chips. In other words, 64 megabytes of RAM that occupy
50 mm2 on a silicon die now may only take up 10 mm2 in 2 years. Instead of packing more memory into the 40 mm2 of freed silicon, researchers from UC Davis propose to put logic blocks into this area [Osk98].
The Active Pages model divides up all the memory in a DRAM
chip into equally sized pages and assigns a logic block to each page.
This logic block can be built in a number of ways—as a simple
processor, as a piece of reconfigurable fabric, etc. Regardless of how
this logic block is built, it has only enough die area for a simple
circuitry. Therefore, more complex operations such as float point
arithmetic have to be done at the CPU. Since a typical application
contains both simple and complex operations, task breakdown, or “task
partitioning,” is necessary and is done at the software level using a
number of specific functions of Active Pages.
Under the Active Pages model, the CPU becomes a dispatcher
that rapidly assigns tasks to all available Active Pages in the system.
These pages then do computations in parallel and report the results
back to the CPU if necessary. Since only complex operations are sent to
the CPU, the data bus is no longer a bottleneck. As a result, the CPU
wastes less time in idle cycles.
CRAM
Computational RAM is architecturally very similar to the
Active Pages in that it also puts processing blocks into DRAM chips.
The differences are that the processing blocks in CRAM are simpler and
more abundant compared to those in the Active Pages.
The CRAM model was spawned from an interesting observation
made by D. Elliot at the University of Toronto: memory chips have an
enormous internal bandwidth, some of which can be as high as 1.1
terabytes per second. However, the connection pins on the memory chips
allow for a maximum of 270 megabytes per second, making the external
bandwidth 4000 times less than the internal bandwidth. CRAM is a model
that uses the internal bandwidth to speed up computation while making
as little modification to the existing DRAM chips as possible [Ell92].
Currently, DRAM stores data in rows and columns of small
capacitors. To get data on and off the memory chip, a set of wires is
connected to the capacitors in a way that allows an entire row to be
accessed at once. The problem is that these small capacitors store so
little charge that the internal wiring will absorb the signals before
they get out of the chip. To solve this problem, DRAM designers place
“sense amplifiers” on internal wires to boost the output signals.
In order to take advantage of DRAM’s high internal bandwidth,
CRAM researchers assign every sense amplifier to a processing element
(PE). At a high level, this PE consists of a Single Instruction
Multiple Data (SIMD) processor and a set of buses for data transfer.
The processor has two 1-bit registers, X and Y, and an arbitrary
function Arithmetic Logic Unit (ALU). This ALU accepts two inputs and
places the 1-bit output onto the result bus. One of the two ALU input
ports (call it ‘A’ for ease of reference) has three bits, coming from
register X, Y, and the sense amplifier. The other input port, ‘B’,
contains an 8-bit instruction from the global bus. Essentially, the ALU
is a multiplexor with ‘A’ being the 3-bit selector that controls which
one of the 8-bits from ‘B’ gets placed onto the result bus.
Like the Active Pages, CRAM does computations in the memory
chips to bypass the data bus bottleneck. A majority of the power that
would have been wasted in idle cycles in a typical system is now spent
delegating the CRAMs.
PPRAM
Similar to Active Pages and CRAM, Parallel Processing RAM
assigns a logic block to a memory block. However, while the previous
two models make minimal modifications to the traditional system and
memory hierarchy, PPRAM proposes radical alterations [Mur97].
A PPRAM system is made up of multiple PPRAM nodes and these
nodes have 3 components: a logic block, a memory block, and a
communication block. The logic block can be anything from a general
purpose processor to an I/O controller. Each logic block is accompanied
by a memory block made up of conventional memory, for example DRAM,
SRAM, etc. The communication block takes care of inter-node
communication with a special protocol.
PPRAM’s communication scheme is not the only modification
that sets it apart from a traditional system. A PPRAM system is highly
customizable; in theory, it is possible to build a system using
multiple CPU nodes, I/O nodes, and Graphics nodes to suit different
applications. Ideally, these components can work on an application in
parallel, i.e., an 100-CPU node system can run roughly 100 times faster
than a 1-CPU node system.
Because of the alterations in the fundamental system
architecture, the data bus bottleneck is no longer present in a PPRAM
system. As in the previous two models, each processing block is
physically located next to its memory. Therefore the overall
inter-component data traffic is significantly reduced.
IRAM
A less aggressive modification to the traditional system
architecture than PPRAM, UC Berkeley’s Intelligent RAM model merges the
processor, cache, and main memory into one chip [Pat97].
A high speed bus connects the CPU to the level 1 cache. The
on-chip main memory offers a performance comparable to that for a level
2 cache. Therefore, one level of cache is sufficient. Between the level
1 cache and memory(DRAM) is the Memory Interface Unit, which is what
IRAM uses to tap into the enormous DRAM internal bandwidth.
The IRAM researchers realized, as D. Elliot did, that the
DRAM internal bandwidth can provide a large amount of data. Therefore,
not only does the Memory Interface Unit eliminate the data bus
bottleneck, it decreases the chance for a data-starved CPU.
In theory, all four intelligent memory models will lessen the
processor-memory performance gap. I will examine the prototype results
in the next section.
Performance
Active Pages
The first prototype built under the Active Pages model is the
Reconfigurable Architecture DRAM (RADram) which implements the logic
block in Active Pages with a piece of reconfigurable fabric made up of
Field Programmable Gate Arrays (FPGA). This piece of FPGA fabric can be
programmed at run time to execute a small portion of an application
[Osk98].
The method used to get RADram results is this: First, a
simulator for the RADram system is written. Then, several applications
are run on the simulator and their execution times recorded. After
that, the same set of applications are then run on a conventional
memory system. Finally, the execution times on a conventional memory
system are divided by those on a RADram system to get speedups.
Speedups ranging from 1 to 1000X were recorded. Most applications
tended to perform better as the data size increased.
CRAM
While RADram can achieve up to 1000X speedups over a
conventional system on some applications, CRAM claims to have over
1000X speedups on all of their selected applications [EllWeb]. However,
these applications all have very high levels of parallelism, allowing
CRAM to work on different portions of the applications simultaneously.
CRAM results are obtained through a simulator. On this
simulator a set of ten applications are run and their execution times
recorded. Timing data over the same set of applications are also
obtained from a SUN SparcStation-5 75MHz. All of the selected
applications showed significant speedup, ranging from 1252X to 41391X
[EllWeb]. Compared with CRAM, PPRAM’s speedup is insignificant.
PPRAM
As RADram is an Active Pages prototype, PPRAMR
is a PPRAM prototype which consists of multiple processing elements
that use 32-bit RISC processors to implement the logic blocks. These
RISC processors are less powerful than the conventional superscalar
processors. However, the idea here is to achieve better performance by
exploiting parallelism within applications using multiple simple
processors.
PPRAMR performance measurements are
taken in the same fashion as those taken in the previous two models,
that is, from a simulator. However, instead of comparing PPRAM’s
performance to a conventional system, its performance is compared to
two types of “future” systems, namely, a Multiple-Powerful-processors
with Cache only system (MPC), and a Single-Powerful-processor with Main
memory system (SPM). The results show that PPRAMR
has a maximum speedup of 1.41X over MPC and 2.22X over SPM on five
selected applications. The goal here is to show that the PPRAM model is
a viable alternative for future systems.
IRAM
IRAM uses a different method to obtain performance
measurements. Unlike the previous three prototypes where data are
recorded from simulators, IRAM predicts performance measurements from
mathematical formulas. There are two steps to this process: The first
is to get application execution information, for example clock cycles,
from a commercial processor. The second step is to use mathematical
formulas to determine how IRAM would affect the execution information
gathered [Bow97].
The Alpha and Pentium Pro are two processors with built-in
hardware counters and they are the two processors used for performance
predictions. From these performance predictions the IRAM researchers
determined that IRAM performs worse than a conventional system on
computation-intensive applications, and performs 1.5 to 2 times better
than a conventional system on memory-intensive applications.
Summary and Conclusion
The growing processor-memory performance gap has spawned a
number of intelligent memory projects that attempt to improve memory
performance. Those that are discussed in this paper are Active Pages,
CRAM, PPRAM and IRAM.
The Active Pages model assigns one logic block to each page
of memory. Therefore, more memory means more processing power. This
explains why RADram tends to get better speedups on larger data size.
Similar to Active Pages, CRAM makes memory smarter by adding
processing elements to it. This model taps the sense amplifier to
exploit the large memory internal bandwidth. Initial results have
demonstrated tremendous speedups. However, these nearly impossible
speedups are only achievable on highly parallelizable applications.
While Active Pages and CRAM stay somewhat with the
traditional architecture, PPRAM calls for a drastic shift in the system
paradigm; instead of having CPU, cache, and memory, a PPRAM system
consists of a large number of different PPRAM chips communicating
through specialized interfaces. Although PPRAM does not have the over
41000X speedup of the CRAM model, the results have shown that it is at
least as feasible an alternative, if not more feasible, for future
systems as any other approach.
Compared to the PPRAM model, the IRAM model requires a less
drastic change to the traditional system architecture; IRAM merges
memory and processor into one chip and uses a special memory interface
unit to maximize memory to processor bandwidth. As expected, this
maximization of bandwidth has helped IRAM’s performance on
memory-intensive applications.
All four models have demonstrated performance gain over a
range of applications. An interesting project for future research would
be to try to run a common set of benchmark programs on the prototypes
of these models and see how well they perform against each other.
Researchers also need to address the commercial feasibility
of these four intelligent memory projects. There are two common
problems with these models: One problem is that DRAM and processor are
produced from different fabrication lines; one focuses on density and
the other focuses on speed. Intelligent memory models discussed here
all agree on putting logic and DRAM on the same chip. Invariably, this
means merging two technologically different fabrication lines into one,
an extremely costly proposal. The second problem is that these models
require changes to the current system architecture. This means existing
hardware and software will be abandoned, something that the commercial
world is not ready to do. I expect the researchers to address these two
problems. Before the problems are resolved, intelligent memory will
remain an interesting idea in the academic community.
Bowman N.; Cardwell N.; Kozyrakis C.; Romer C.; Wang H.; “Evaluation of Existing Architectures in IRAM Systems,” Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA ’97, Denver, CO, 1, June 1997.
Elliott D.; “Computational Ram: A Memory-SIMD Hybrid and its Application to DSP,” The Proceedings of the Custom Integrated Circuits Conference, Boston, MA, 3, May 1992.
Elliott D.; “Computational RAM,” http://www.eecg.toronto.edu/~dunc/cram
Murakami, K.; Inoue, K.; and Miyajima, H.; “Parallel Processing RAM (PPRAM) (in English),” Japan-Germany Forum on Information Technology, Nov. 1997.
Oskin M.; Chong F.; Sherwood T.; “Active Pages: A Comutation Model for Intelligent Memory,” International Symposium on Computer Architecture, Barcelona, 1998.
Patterson, D.; Anderson T.; Cardwell N.; Fromm R., et al; “A Case for Intelligent DRAM: IRAM,” IEEE Micro, April 1997.