New Advances in ONFI (Open NAND Flash Interface)
Knut Grimsrud
Intel Fellow, Director of Storage Architecture MEMS002
2007 Intel Corporation
Agenda
PART 1 Flash opportunities in IA Further improvement options NAND interface performance PART 2 ONFI 2.0 overview High Speed NAND Block Abstracted (BA) NAND Connector and module
Flash Opportunities in IA
Flash provides substantial performance, responsiveness, and power savings benefits
Elapsed Time = 2.26s
Elapsed Time = 6.79s
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Dispelling Performance Myths
Raw single-component Flash transfer rates not faster than HDD today
Total=4.7ms
100% 90% 80% 70% 60% 94% 94% 50%
Mainstream flash is about
40MB/s reads Mainstream HDD is about 75MB/s reads
Mechanical Latency (4.4ms)
Latency Media Interface
Raw flash access time much faster than HDD
40% 30% 20% 10% 0% 16KB Transfers
Media Transfer Media Transfer (220us) (220us)
Mainstream flash is
<100us Mainstream HDD is several milliseconds
16KB Size 16KB Size
Interface Interface Transfer (55us) Transfer (55us)
*IOPS performance breakdown for HDD running sample workload (75MB/s, 3Gbps)
Transfer time accounts for insignificant fraction of actual disk service time. Latency is dominant factor.
5
Latency vs Transfer Rate Comparison
Read Media Transfer Rate
16.0
3.5 desktop drive 3.5 desktop drive 3.5 250GB, 7200 RPM 250GB, 7200 RPM
Transfer13.582 ms ==51.0MB/s Total latency TransferRate Rate 51.0MB/s Latency ==13.6ms Latency 13.6ms Crossover is 3405 Crossover is 3405 sectors (1.74MB) sectors (1.74MB)
Transfer rate 50.952 MB/sec
14.0
12.0
Service Time (ms)
10.0
8.0
6.0
4.0
Total latency 0.055 ms 2.0
Platform NVM lowTransfer rate 42.413 Platform NVMlowlowlevel performance level performance
48 56 64
0.0 0 8 16 24 32 Transfer Size (sectors) 40
Although streaming rate is slightly lower than HDD, realized performance is much better for modest sizes
6
Resulting Flash Platform Impact
HDD baseline system HDD baseline system
NVM enhanced system NVM enhanced system
Effective data rate for flash solution 7X higher than HDD
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Further Improvement Option 1
Algorithm improvements and cache size increases can improve performance by further reducing disk accesses
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Further Improvement Option 2
For high hit-rates, improve performance further by decreasing cache hit time
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Sample Workload Breakdown
Option 2 Option 2 improves this improves this component component Sample Workload I/O Time Breakdown
100%
75% Cache Disk
50%
25%
Option 1 Option 1 improves this improves this component component
0% I/O Time
Best approach for further performance improvement is improving cache hit times
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
10
Current Flash Performance
tt approx 20us R approx 20us R RE# freq 40MHz RE# freq 40MHz
Flash performance consists of 2 primary elements
Time to transfer data between the array and the page register (tR) Time to transfer data between the page register and the host (RE# cycle time)
11
Sample Workload Additional Breakdown
Sample Workload I/O Time Breakdown Flash interface Flash interface transfer time transfer time
100%
75% NAND Xfer NAND Acc Disk
Flash array Flash array access time access time
50%
25%
Remaining disk Remaining disk access time access time
0% I/O Time
Largest performance improvement potential from NAND interface improvements
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
12
Summary
NAND Flash shows tremendous promise for accelerating compute applications
NAND access latency is strong suit
Performance can be further improved with two approaches
Increase cache hit rate to further reduce disk accesses Increase NAND performance to reduce cache hit times
Largest improvement potential from increasing NAND Flash interface performance
Flash interface is the largest remaining component of the I/O time breakdown
13
New Advances in ONFI (Open NAND Flash Interface)
Amber Huffman
Principal Engineer MEMS002
2007 Intel Corporation
NAND Interoperability Before ONFI
Prior to ONFI formation in 2006, NAND has been the only commodity memory with no standard interface Basic NAND commands are similar amongst vendors
Read, Program, Erase, Reset, Read Status Often supporting a few vendors easy, all is more difficult
Timings vary from vendor to vendor Enhanced NAND commands vary widely
IOx R/B# 00h C1 C2 R1 R2 R3 31h
tR
D0
Dn
Read Cache protocol variants
00h C1 C2 R1 R2 R3 30h
tR
IOx R/B#
31h
tRCBSY
D0
Dn
15
ONFI 1.0 Overview
ONFI 1.0 Defines
Uniform NAND electrical and protocol interface
Raw NAND component interface for embedded use Includes timings, electricals, protocol Standardized base command set
Uniform mechanism for device to report its capabilities to the host
ONFI 1.0 Status
Ratified specification in Dec 06
Delivered spec in less than 8 months!
ONFI 1.0 establishes a standard interface for NAND
16
Continually Growing Coalition
A-Data Arasan Chip Systems BitMicro DataFab Systems FCI Genesys Logic Intelliprop Kingston Technology NVidia PQI Shenzhen Netcom Silicon Motion Smart Modular Tech. Super Talent Elec. Tyco
Alcor Micro ATI Biwin Technology DataIO Foxconn Hagiwara Sys-Com ITE Tech Marvell Orient Semiconductor Qimonda Sigmatel SimpleTech Solid State System Telechips UCA Technology
Aleph One Avid Electronics Cypress Denali Fusion Media Tech InComm Jinvani Systech Molex Powerchip Semi. Seagate Silicon Storage Tech Skymedi Spansion Testmetrix WinBond
Members
17
ONFI Delivering Advanced Features
ONFI is building on the foundation established by the 1.0 specification with significant new features:
High speed NAND definition to dramatically improve the interface transfer rate Block abstracted NAND interface to simplify integration of NAND into host platforms Connector definition for insertion of raw NAND modules into build-to-order systems
Join ONFI to participate in new feature definition.
18
Why is the Legacy Interface Stalling?
Issue 1: The legacy interface requires that the NAND process commands in a single cycle directly impacting the write cycle time
Example: Reads require the NAND to process two commands and five addresses within seven cycles, followed by assertion of busy in 100 ns
Issue 2: NAND timing is not source synchronous, making it difficult for the host to know where the data is valid at higher speeds
Supporting different configurations (e.g. single die vs quad die package) makes it difficult to latch data cleanly at higher speeds
These issues are reflected by the slowdown in NAND timing improvements 40% faster 50 ns 50 ns I/O timings I/O timings 17% faster 25 ns 25 ns I/Otimings I/O timings
30 ns 30 ns I/Otimings I/O timings
19
Delivering Higher Speed
The Path to Higher Speed Step 1: Source synchronous ONFI Interface Rate Roadmap
Legacy 40 MB/s ~ 133 MB/s ~ 266 MB/s 400 MB/s +
Add source synchronous data strobes
Step 2: Learn from DRAM
The lessons of DDR can take us far
Gen1 Gen2 Gen3
Step 3: Easy transition
Break apart the command phase and data phase
20
Going Source Synchronous
I/O[7:0]: Data/address bus
Changed name to DQ to align with DRAM DDR naming conventions
DQS: Data strobe
Only new signal for first generation of high speed Strobe is used to indicate where data should be latched
WE#: Write enable becomes source synchronous CLK
CLK is used for all interface transfers
RE#: Read enable becomes direction signal, W/R#
No longer used to latch read data Indicates owner of the DQ bus and the DQS signal
Symbol Traditional I/O[7:0] WE# RE# Source synchronous DQ[7:0] DQS CLK W/R# Type I/O I/O Input Input Description Data inputs/outputs Data strobe Write enable => Clock Read enable => Write / Read# direction
21
Adopting DDR Protocol
High speed NAND uses a DDR protocol DQS identifies start of data byte on the DQ bus
Data is latched on each edge of DQS (rising and falling)
Value of having a data strobe:
Eliminates the uncertainty of the clock insertion delay across vendors Makes the design more robust to noise since the strobe and the data are impacted by noise events together Easier to deal with different loading (single-die vs quad-die)
22
VccQ and Lower Power
With increased interface speed, comes increased power consumption
NAND is targeted at low power applications, important to optimize for power
Solution: Scale the I/O voltage (VccQ) lower
For a CMOS based I/O buffer, most of the power consumption is from the driver swinging the output from 0V to VccQ The power consumption per single data lane is governed by P = C * V * V *f
For an 8-bit data bus, lowering VccQ to 1.8V can save over 600 mW of power for worst case I/O patterns! Recommend scaling NAND VccQ along the lines of DRAM
DDR2 = 1.8V VccQ, DDR3 = 1.5V VccQ
8-bit I/O Power
50 MHz VccQ = 3.3V VccQ = 1.8V VccQ = 1.5V 218 mW 64 mW 45 mW 100 MHz 435 mW 129 mW 90 mW 150 MHz 653 mW 194 mW 135 mW 200 MHz 871 mW 259 mW 180 mW
23
Block Abstracted NAND to enable Broader NAND Use
NAND may have ECC or other management requirements that are beyond the hosts capabilities Block Abstracted NAND allows a controller to be inserted in the middle that abstracts some of the complexities of NAND
Host
ONFI Controller
(3 bit ECC)
ONFI NAND
(2-bit ECC)
Host
ONFI Controller
(3 bit ECC)
Block Abstracted Controller
(8-bit ECC)
ONFI NAND
(4-bit ECC)
24
Block Abstracted Details
Block abstracted uses the same physical interface as raw NAND
May also use high speed interface
BA NAND Command Set LBA Read (Continue)
The command set abstracts the NAND to look more like a hard drive
LBA Write (Continue) LBA Deallocate LBA Flush Read Status Read ID Read Parameter Page Get Features Set Features Reset
Uses LBAs rather than NAND pages
Block abstracted NAND controller manages bad blocks, wear levels, performs ECC, etc All the vagaries of NAND management may be avoided by the host
25
NAND in the Platform
NAND in the platform has started with modules plugged in on PCIe As NAND becomes more prevalent, the controller will be integrated with the platform
Down on motherboard or higher levels of integration
Chipset
OEMs want to offer customers capacity/feature choice, so NAND will remain on a module Issue: How to plug a NAND-only module into a PC platform?
NAND does not talk PCIe*
Intel Turbo Memory
26
Connector for NAND-only Modules
To offer capacity choice, ONFI is defining a standard connector
Enables OEMs to sell NAND on a module Like an unbuffered and unregistered DIMM
The ONFI connector effort is leveraging existing DRAM standards - Avoids major connector tooling costs - Re-uses electrical verification - Ensures low cost with quick time to market
Both right-angle and vertical entry form factors are being delivered
27
Summary
ONFI 1.0 has established a standard interface for NAND ONFI 2.0 is adding significant new features on this foundation
High speed NAND definition to dramatically improve the interface transfer rate Block abstracted NAND interface to simplify integration of NAND into host platforms Connector definition for insertion of raw NAND modules into systems for late-binding configurations
Join the ONFI Workgroup to get involved in these exciting new development activities!
28
Additional sources of information on this topic:
More web based info: [Link]
This Session presentation (PDF) is available from [Link]/idf web site under Technical Training. Some sessions will also provide Audioenabled presentations after the event.
29
Please fill out the Session Evaluation Form
Thank You for your input, we use it to improve future Intel Developer Forum events Save your date for IDFs this Fall:
San Francisco, USA 2007 September 18 -20 Taipei, Taiwan, 2007 October 15 -16
30