Literature Review - Intelligent Memory

Sinclair Yeh

Writer’s comment: Writing has always challenged me. Everyone I knew told me I was no good at composition. In fact, when my parents first heard this paper was selected for Prized Writing,their reaction was, “This is an English paper? We didn’t know you could write!” Besides, why would an engineer need to know how to write anyway? A drastic change in my attitude toward writing occurred spring quarter, 1999, the quarter I took English 104E (Scientific Writing) with Pam Demory. I came to realize that being able to communicate technical knowledge is as important as possessing such technical knowledge. And through Pam’s encouragement, I gained more confidence in my writing ability. After many helpful suggestions from Pam Demory and Larry Greer, I finally gathered the courage to submit my review to the Prized Writing contest. I am grateful for their assistance along the way; this review would not have come out so successfully without them.
- Sinclair Yeh

Instructor’s comment: The review paper I assign in English 104E: Scientific Writing, is a very difficult one to write. It requires searching scientific journals for recent articles on a specific topic, reading and assessing the articles, and finally writing a paper that not only reports and synthesizes the research but helps readers understand the significance of the material in some broader context. Sinclair succeeded admirably with his paper on intelligent memory models. Not only does the paper reflect an obvious understanding of the complexities of the technology under review, it does so in remarkably clear prose. Sinclair obviously took to heart one of the central tenets of this course, that technical material aimed at a technical audience can be clearly written.
- Pamela Demory, English Department

Abstract

     The growing processor-memory performance gap creates a bottleneck in the system; the memory system cannot supply enough data to keep the processor busy. Before this bottleneck is resolved, faster processors can do little to improve the overall system performance. Intelligent memory is a new memory/system architecture that aims to resolve this bottleneck.
     There are four intelligent memory models with published results: Active Pages, CRAM, PPRAM, and IRAM. Despite their architectural differences, they all agree to put processing elements physically closer to the memory, lifting the bottleneck by increasing processor-memory data bandwidth.
     Initial studies of these four models have shown promising results. However, in order for these academic ideas to become a reality, intelligent memory researchers must study how their models can be cost-effectively integrated into commercial computer systems.

Introduction

     Microprocessor and DRAM (Dynamic Random Access Memory) technology are headed in different directions: the former increases in speed while the latter increases in capacity. This technological difference has led to what is known as the Processor-Memory Performance Gap. This performance gap, which is growing at about 50% per year, creates a serious bottleneck to the overall system performance [Pat97].
     The problem boils down to efficient transfer of data between the microprocessor and the memory system. This is a two-fold problem. First, the CPU is running several orders of magnitude faster than the memory system and therefore the memory system cannot supply data fast enough. Second, the data bus is the main data transfer medium and is shared among many devices. In other words, the memory system has to compete with other devices to get data to the CPU, making the transfer time even worse.
     This performance gap results in a data-starved CPU that wastes valuable computing power in idle cycles. From smarter bus controllers to smarter memories, a variety of research has been done to try to lessen this ever growing gap. Here, I will cover intelligent memory, a new architecture that uses smarter memories to maximize the CPU and data transfer media efficiency.
     This paper considers four intelligent memory models: Active Pages, Computational RAM (CRAM), Parallel Processing RAM (PPRAM), and Intelligent RAM (IRAM). Some of these models have a number of possible prototype implementations and only those with published results will be discussed here.
     The next section gives architectural descriptions of the intelligent memory models. Each of the four models has a working prototype. Performance measurements of these prototypes are evaluated in Section 3. Finally, the conclusion discusses the feasibility of integrating these models into a real-world system.

Architectural Description

Active Pages
     Improvements in the DRAM technology make it possible to produce denser chips. In other words, 64 megabytes of RAM that occupy 50 mm2 on a silicon die now may only take up 10 mm2 in 2 years. Instead of packing more memory into the 40 mm2 of freed silicon, researchers from UC Davis propose to put logic blocks into this area [Osk98].
     The Active Pages model divides up all the memory in a DRAM chip into equally sized pages and assigns a logic block to each page. This logic block can be built in a number of ways—as a simple processor, as a piece of reconfigurable fabric, etc. Regardless of how this logic block is built, it has only enough die area for a simple circuitry. Therefore, more complex operations such as float point arithmetic have to be done at the CPU. Since a typical application contains both simple and complex operations, task breakdown, or “task partitioning,” is necessary and is done at the software level using a number of specific functions of Active Pages.
     Under the Active Pages model, the CPU becomes a dispatcher that rapidly assigns tasks to all available Active Pages in the system. These pages then do computations in parallel and report the results back to the CPU if necessary. Since only complex operations are sent to the CPU, the data bus is no longer a bottleneck. As a result, the CPU wastes less time in idle cycles.

CRAM
     Computational RAM is architecturally very similar to the Active Pages in that it also puts processing blocks into DRAM chips. The differences are that the processing blocks in CRAM are simpler and more abundant compared to those in the Active Pages.
     The CRAM model was spawned from an interesting observation made by D. Elliot at the University of Toronto: memory chips have an enormous internal bandwidth, some of which can be as high as 1.1 terabytes per second. However, the connection pins on the memory chips allow for a maximum of 270 megabytes per second, making the external bandwidth 4000 times less than the internal bandwidth. CRAM is a model that uses the internal bandwidth to speed up computation while making as little modification to the existing DRAM chips as possible [Ell92].
     Currently, DRAM stores data in rows and columns of small capacitors. To get data on and off the memory chip, a set of wires is connected to the capacitors in a way that allows an entire row to be accessed at once. The problem is that these small capacitors store so little charge that the internal wiring will absorb the signals before they get out of the chip. To solve this problem, DRAM designers place “sense amplifiers” on internal wires to boost the output signals.
     In order to take advantage of DRAM’s high internal bandwidth, CRAM researchers assign every sense amplifier to a processing element (PE). At a high level, this PE consists of a Single Instruction Multiple Data (SIMD) processor and a set of buses for data transfer. The processor has two 1-bit registers, X and Y, and an arbitrary function Arithmetic Logic Unit (ALU). This ALU accepts two inputs and places the 1-bit output onto the result bus. One of the two ALU input ports (call it ‘A’ for ease of reference) has three bits, coming from register X, Y, and the sense amplifier. The other input port, ‘B’, contains an 8-bit instruction from the global bus. Essentially, the ALU is a multiplexor with ‘A’ being the 3-bit selector that controls which one of the 8-bits from ‘B’ gets placed onto the result bus.
     Like the Active Pages, CRAM does computations in the memory chips to bypass the data bus bottleneck. A majority of the power that would have been wasted in idle cycles in a typical system is now spent delegating the CRAMs.

PPRAM
     Similar to Active Pages and CRAM, Parallel Processing RAM assigns a logic block to a memory block. However, while the previous two models make minimal modifications to the traditional system and memory hierarchy, PPRAM proposes radical alterations [Mur97].
     A PPRAM system is made up of multiple PPRAM nodes and these nodes have 3 components: a logic block, a memory block, and a communication block. The logic block can be anything from a general purpose processor to an I/O controller. Each logic block is accompanied by a memory block made up of conventional memory, for example DRAM, SRAM, etc. The communication block takes care of inter-node communication with a special protocol.
     PPRAM’s communication scheme is not the only modification that sets it apart from a traditional system. A PPRAM system is highly customizable; in theory, it is possible to build a system using multiple CPU nodes, I/O nodes, and Graphics nodes to suit different applications. Ideally, these components can work on an application in parallel, i.e., an 100-CPU node system can run roughly 100 times faster than a 1-CPU node system.
     Because of the alterations in the fundamental system architecture, the data bus bottleneck is no longer present in a PPRAM system. As in the previous two models, each processing block is physically located next to its memory. Therefore the overall inter-component data traffic is significantly reduced.

IRAM
     A less aggressive modification to the traditional system architecture than PPRAM, UC Berkeley’s Intelligent RAM model merges the processor, cache, and main memory into one chip [Pat97].
     A high speed bus connects the CPU to the level 1 cache. The on-chip main memory offers a performance comparable to that for a level 2 cache. Therefore, one level of cache is sufficient. Between the level 1 cache and memory(DRAM) is the Memory Interface Unit, which is what IRAM uses to tap into the enormous DRAM internal bandwidth.
     The IRAM researchers realized, as D. Elliot did, that the DRAM internal bandwidth can provide a large amount of data. Therefore, not only does the Memory Interface Unit eliminate the data bus bottleneck, it decreases the chance for a data-starved CPU.
     In theory, all four intelligent memory models will lessen the processor-memory performance gap. I will examine the prototype results in the next section.

Performance

Active Pages
     The first prototype built under the Active Pages model is the Reconfigurable Architecture DRAM (RADram) which implements the logic block in Active Pages with a piece of reconfigurable fabric made up of Field Programmable Gate Arrays (FPGA). This piece of FPGA fabric can be programmed at run time to execute a small portion of an application [Osk98].
     The method used to get RADram results is this: First, a simulator for the RADram system is written. Then, several applications are run on the simulator and their execution times recorded. After that, the same set of applications are then run on a conventional memory system. Finally, the execution times on a conventional memory system are divided by those on a RADram system to get speedups. Speedups ranging from 1 to 1000X were recorded. Most applications tended to perform better as the data size increased.

CRAM
     While RADram can achieve up to 1000X speedups over a conventional system on some applications, CRAM claims to have over 1000X speedups on all of their selected applications [EllWeb]. However, these applications all have very high levels of parallelism, allowing CRAM to work on different portions of the applications simultaneously.
     CRAM results are obtained through a simulator. On this simulator a set of ten applications are run and their execution times recorded. Timing data over the same set of applications are also obtained from a SUN SparcStation-5 75MHz. All of the selected applications showed significant speedup, ranging from 1252X to 41391X [EllWeb]. Compared with CRAM, PPRAM’s speedup is insignificant.

PPRAM
     As RADram is an Active Pages prototype, PPRAMR is a PPRAM prototype which consists of multiple processing elements that use 32-bit RISC processors to implement the logic blocks. These RISC processors are less powerful than the conventional superscalar processors. However, the idea here is to achieve better performance by exploiting parallelism within applications using multiple simple processors.
     PPRAMR performance measurements are taken in the same fashion as those taken in the previous two models, that is, from a simulator. However, instead of comparing PPRAM’s performance to a conventional system, its performance is compared to two types of “future” systems, namely, a Multiple-Powerful-processors with Cache only system (MPC), and a Single-Powerful-processor with Main memory system (SPM). The results show that PPRAMR has a maximum speedup of 1.41X over MPC and 2.22X over SPM on five selected applications. The goal here is to show that the PPRAM model is a viable alternative for future systems.

IRAM
     IRAM uses a different method to obtain performance measurements. Unlike the previous three prototypes where data are recorded from simulators, IRAM predicts performance measurements from mathematical formulas. There are two steps to this process: The first is to get application execution information, for example clock cycles, from a commercial processor. The second step is to use mathematical formulas to determine how IRAM would affect the execution information gathered [Bow97].
     The Alpha and Pentium Pro are two processors with built-in hardware counters and they are the two processors used for performance predictions. From these performance predictions the IRAM researchers determined that IRAM performs worse than a conventional system on computation-intensive applications, and performs 1.5 to 2 times better than a conventional system on memory-intensive applications.

Summary and Conclusion

     The growing processor-memory performance gap has spawned a number of intelligent memory projects that attempt to improve memory performance. Those that are discussed in this paper are Active Pages, CRAM, PPRAM and IRAM.
     The Active Pages model assigns one logic block to each page of memory. Therefore, more memory means more processing power. This explains why RADram tends to get better speedups on larger data size.
     Similar to Active Pages, CRAM makes memory smarter by adding processing elements to it. This model taps the sense amplifier to exploit the large memory internal bandwidth. Initial results have demonstrated tremendous speedups. However, these nearly impossible speedups are only achievable on highly parallelizable applications.
     While Active Pages and CRAM stay somewhat with the traditional architecture, PPRAM calls for a drastic shift in the system paradigm; instead of having CPU, cache, and memory, a PPRAM system consists of a large number of different PPRAM chips communicating through specialized interfaces. Although PPRAM does not have the over 41000X speedup of the CRAM model, the results have shown that it is at least as feasible an alternative, if not more feasible, for future systems as any other approach.
     Compared to the PPRAM model, the IRAM model requires a less drastic change to the traditional system architecture; IRAM merges memory and processor into one chip and uses a special memory interface unit to maximize memory to processor bandwidth. As expected, this maximization of bandwidth has helped IRAM’s performance on memory-intensive applications.
     All four models have demonstrated performance gain over a range of applications. An interesting project for future research would be to try to run a common set of benchmark programs on the prototypes of these models and see how well they perform against each other.
     Researchers also need to address the commercial feasibility of these four intelligent memory projects. There are two common problems with these models: One problem is that DRAM and processor are produced from different fabrication lines; one focuses on density and the other focuses on speed. Intelligent memory models discussed here all agree on putting logic and DRAM on the same chip. Invariably, this means merging two technologically different fabrication lines into one, an extremely costly proposal. The second problem is that these models require changes to the current system architecture. This means existing hardware and software will be abandoned, something that the commercial world is not ready to do. I expect the researchers to address these two problems. Before the problems are resolved, intelligent memory will remain an interesting idea in the academic community.

Literature Cited

Bowman N.; Cardwell N.; Kozyrakis C.; Romer C.; Wang H.; “Evaluation of Existing Architectures in IRAM Systems,” Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA ’97, Denver, CO, 1, June 1997.

Elliott D.; “Computational Ram: A Memory-SIMD Hybrid and its Application to DSP,” The Proceedings of the Custom Integrated Circuits Conference, Boston, MA, 3, May 1992.

Elliott D.; “Computational RAM,” http://www.eecg.toronto.edu/~dunc/cram

Murakami, K.; Inoue, K.; and Miyajima, H.; “Parallel Processing RAM (PPRAM) (in English),” Japan-Germany Forum on Information Technology, Nov. 1997.

Oskin M.; Chong F.; Sherwood T.; “Active Pages: A Comutation Model for Intelligent Memory,” International Symposium on Computer Architecture, Barcelona, 1998.

Patterson, D.; Anderson T.; Cardwell N.; Fromm R., et al; “A Case for Intelligent DRAM: IRAM,” IEEE Micro, April 1997.