HUG Nov 2010: HDFS Raid - Facebook

HDFS RAIDDhrubaBorthakur (dhruba@fb.com)Rodrigo Schmidt (rschmidt@fb.com)RamkumarVadali (rvadali@fb.com)Scott Chen (schen@fb.com)Patrick Kling (pkling@fb.com)

AgendaWhat is RAIDRAID at FacebookAnatomy of RAIDHow to DeployQuestions

What Is RAIDContrib project in MAPREDUCEDefault HDFS replication is 3Too much at PetaByte scaleRAID helps save space in HDFSReduce replication of “source” dataData safety using “parity” data

Tolerates 2 missing blocks, Storage cost 3x 123456789101234567891012345678910Tolerates 4 missing blocks, Storage cost 1.4x 12345678910Source fileP1P2P3P4Parity fileReed-Solomon Erasure Codes

RAID at FacebookReduces disk usage in the warehouseCurrently saving about 5PB with XOR RAIDGradual deploymentStarted with few tablesNow used with all tablesReed Solomon RAID under way

Anatomy of RAIDServer-side:RaidNodeBlockFixerBlock placement policyClient-side:DistributedRaidFileSystemRaid Shell

Anatomy of RAIDDataNodesNameNode Obtain missing blocks

Get files to raidRaidNode Create parity files

Recover files while readingJobTrackerRaid File System

RaidNodeDaemon that scans filesystemPolicy file used to provide file patternsGenerate parity filesSingle threadMap-Reduce jobReduces replication of source fileOne thread to purge outdated parity filesIf the source gets deletedOne thread to HAR parity filesTo reduce inode count

Block FixerReconstructs missing/corrupt blocksRetrieves a list of corrupt files from NameNodeSource blocks are reconstructed by “decoding”Parity blocks are reconstructed by “encoding”

Block FixerBonus: Parity HARsOne HAR block => multiple parity blocksReconstructs all necessary blocks

Erasure CodeErasureCodeabstraction for erasure code implementations public void encode(int[] message, int[] parity); public void decode(int[] data, int[] erasedLocations,int[] erasedValues);Current implementationsXOR CodeReed Solomon CodeEncoder/Decoder – uses ErasureCode to integrate with RAID framework

Block PlacementReplication = 3, Tolerates any 2 errors123456789101234567891012345678910Dependent BlocksReplication = 1, Parity Length = 4, Tolerates any 4 errors12345678910P1P2P3P4Dependent Blocks

Block PlacementRaid introduces new dependency between blocks in source and parity filesDefault block placement is bad for RAIDSource/Parity blocks can be on a single node/rackParity blocks could co-locate with source blocksRaid Block PolicySource files: After RAIDing, disperse blocksParity files: Control placement of parity blocks to avoid source blocks and other parity blocks

DistributedRaidFileSystemA filter file system implementationAllows clients to read “corrupt” source filesCatches BlockMissingException, ChecksumExceptionRecreates missing blocks on the fly by using parityDoes not fix the missing blocksOnly allows the reads to succeed

RaidShellAdministrator toolRecover blocksReconstruct missing blocksSend reconstructed block to a data nodeRaid FSCK Report corrupt files that cannot be fixed by raidHandy tool as a last resort to fix blocks

DeploymentSingle configuration file “raid.xml”Specifies file patterns to RAIDIn HDFS config fileSpecify raid.xml locationSpecify location of parity files (default: /raid)Specify FileSystem, BlockPlacementPolicyStarting RaidNodestart-raidnode.sh, stop-raidnode.shhttp://wiki.apache.org/hadoop/HDFS-RAID

HUG Nov 2010: HDFS Raid - Facebook

More Related Content

HUG Nov 2010: HDFS Raid - Facebook