Troytec.com is a place where you can find various types of 000-205exam certifications preparation material. Troytec’s full range of study material for the 000-205exam helps you to be prepared for the 000-205exam fully and e (more)
Troytec.com is a place where you can find various types of 000-205exam certifications preparation material. Troytec’s full range of study material for the 000-205exam helps you to be prepared for the 000-205exam fully and enter the exam centre with full confidence.We provide you easy, simple and updated study material. After preparing from the 000-205exam material prepared by us we guarantee you that you will be a certified professional. We guarantee that with Troytec 000-205study material, you will pass the Certification exam.
(less)
From:
Deliasyr
Views: 204
Comments: 0
Passcert IBM certification 000-205 exam is written to the highest standards of technical accuracy, using only certified subject matter experts and published authors for development. With Passcert IBM 000-205 exam, you will 100% passing.
From:
aravindramesh
Views: 8
Comments: 0
Welcome to Living Storage the home of the Boxer storage bed . We manufacture a range of beds with storage most available in a range of sizes including double king and superking.
Slide 1: Beyond the File System
Designing Large Scale File Storage and Serving
Cal Henderson
Slide 2: Hello!
Web 2.0 Expo, 17 April 2007
2
Slide 3: Big file systems?
• • • • Too vague! What is a file system? What constitutes big? Some requirements would be nice
Web 2.0 Expo, 17 April 2007
3
Slide 4: 1
Web 2.0 Expo, 17 April 2007
Scalable
Looking at storage and serving infrastructures
4
Slide 5: 2
Web 2.0 Expo, 17 April 2007
Reliable
Looking at redundancy, failure rates, on the fly changes
5
Slide 6: 3
Web 2.0 Expo, 17 April 2007
Cheap
Looking at upfront costs, TCO and lifetimes
6
Slide 7: Four buckets
Storage Serving BCP Cost
Web 2.0 Expo, 17 April 2007
7
Slide 8: Storage
Web 2.0 Expo, 17 April 2007 8
Slide 9: The storage stack
File protocol File system Block protocol RAID Hardware
Web 2.0 Expo, 17 April 2007
NFS, CIFS, SMB
ext, reiserFS, NTFS
SCSI, SATA, FC
Mirrors, Stripes
Disks and stuff
9
Slide 10: Hardware overview
The storage scale
Lower Internal DAS SAN Higher NAS
Web 2.0 Expo, 17 April 2007
10
Slide 11: Internal storage
• A disk in a computer
– SCSI, IDE, SATA
• 4 disks in 1U is common • 8 for half depth boxes
Web 2.0 Expo, 17 April 2007
11
Slide 12: DAS
Direct attached storage
Disk shelf, connected by SCSI/SATA
HP MSA30 – 14 disks in 3U
Web 2.0 Expo, 17 April 2007 12
Slide 13: SAN
• • • • Storage Area Network Dumb disk shelves Clients connect via a ‘fabric’ Fibre Channel, iSCSI, Infiniband
– Low level protocols
Web 2.0 Expo, 17 April 2007
13
Slide 14: NAS
• • • • Network Attached Storage Intelligent disk shelf Clients connect via a network NFS, SMB, CIFS
– High level protocols
Web 2.0 Expo, 17 April 2007
14
Slide 15: Of course, it’s more confusing than that
Web 2.0 Expo, 17 April 2007
15
Slide 16: Meet the LUN
• Logical Unit Number • A slice of storage space • Originally for addressing a single drive:
– c1t2d3 – Controller, Target, Disk (Slice)
• Now means a virtual partition/volume
– LVM, Logical Volume Management
Web 2.0 Expo, 17 April 2007
16
Slide 17: NAS vs SAN
With a SAN, a single host (initiator) owns a single LUN/volume With NAS, multiple hosts own a single LUN/volume NAS head – NAS access to a SAN
Web 2.0 Expo, 17 April 2007
17
Slide 18: SAN Advantages
Virtualization within a SAN offers some nice features: • Real-time LUN replication • Transparent backup • SAN booting for host replacement
Web 2.0 Expo, 17 April 2007
18
Slide 19: Some Practical Examples
• There are a lot of vendors • Configurations vary • Prices vary wildly • Let’s look at a couple
– Ones I happen to have experience with – Not an endorsement ;)
Web 2.0 Expo, 17 April 2007
19
Slide 20: NetApp Filers
Heads and shelves, up to 500TB in 6 Cabs
FC SAN with 1 or 2 NAS heads
Web 2.0 Expo, 17 April 2007 20
Slide 21: Isilon IQ
• 2U Nodes, 3-96 nodes/cluster, 6-600 TB • FC/InfiniBand SAN with NAS head on each node
Web 2.0 Expo, 17 April 2007
21
Slide 22: Scaling
Vertical vs Horizontal
Web 2.0 Expo, 17 April 2007
22
Slide 23: Vertical scaling
• Get a bigger box • Bigger disk(s) • More disks • Limited by current tech – size of each disk and total number in appliance
Web 2.0 Expo, 17 April 2007
23
Slide 24: Horizontal scaling
• Buy more boxes • Add more servers/appliances • Scales forever*
*sort of
Web 2.0 Expo, 17 April 2007 24
Slide 25: Storage scaling approaches
• Four common models: • • • • Huge FS Physical nodes Virtual nodes Chunked space
Web 2.0 Expo, 17 April 2007
25
Slide 26: Huge FS
• Create one giant volume with growing space
– Sun’s ZFS – Isilon IQ
• Expandable on-the-fly? • Upper limits
– Always limited somewhere
Web 2.0 Expo, 17 April 2007 26
Slide 27: Huge FS
• Pluses
– Simple from the application side – Logically simple – Low administrative overhead
• Minuses
– All your eggs in one basket – Hard to expand – Has an upper limit
Web 2.0 Expo, 17 April 2007 27
Slide 28: Physical nodes
• Application handles distribution to multiple physical nodes
– Disks, Boxes, Appliances, whatever
• • • •
One ‘volume’ per node Each node acts by itself Expandable on-the-fly – add more nodes Scales forever
28
Web 2.0 Expo, 17 April 2007
Slide 29: Physical Nodes
• Pluses
– Limitless expansion – Easy to expand – Unlikely to all fail at once
• Minuses
– Many ‘mounts’ to manage – More administration
Web 2.0 Expo, 17 April 2007 29
Slide 30: Virtual nodes
• Application handles distribution to multiple virtual volumes, contained on multiple physical nodes • • • • Multiple volumes per node Flexible Expandable on-the-fly – add more nodes Scales forever
30
Web 2.0 Expo, 17 April 2007
Slide 31: Virtual Nodes
• Pluses
– – – – – Limitless expansion Easy to expand Unlikely to all fail at once Addressing is logical, not physical Flexible volume sizing, consolidation
• Minuses
– Many ‘mounts’ to manage – More administration
Web 2.0 Expo, 17 April 2007 31
Slide 32: Chunked space
• Storage layer writes parts of files to different physical nodes • A higher-level RAID striping • High performance for large files
– read multiple parts simultaneously
Web 2.0 Expo, 17 April 2007
32
Slide 33: Chunked space
• Pluses
– High performance – Limitless size
• Minuses
– Conceptually complex – Can be hard to expand on the fly – Can’t manually poke it
Web 2.0 Expo, 17 April 2007 33
Slide 34: Real Life Case Studies
Web 2.0 Expo, 17 April 2007 34
Slide 35: GFS – Google File System
• Developed by … Google • Proprietary • Everything we know about it is based on talks they’ve given • Designed to store huge files for fast access
Web 2.0 Expo, 17 April 2007
35
Slide 36: GFS – Google File System
• Single ‘Master’ node holds metadata
– SPF – Shadow master allows warm swap
• Grid of ‘chunkservers’
– 64bit filenames – 64 MB file chunks
Web 2.0 Expo, 17 April 2007
36
Slide 37: GFS – Google File System
Master
1(a)
2(a)
1(b)
Web 2.0 Expo, 17 April 2007
37
Slide 38: GFS – Google File System
• Client reads metadata from master then file parts from multiple chunkservers • Designed for big files (>100MB) • Master server allocates access leases • Replication is automatic and self repairing
– Synchronously for atomicity
Web 2.0 Expo, 17 April 2007 38
Slide 39: GFS – Google File System
• Reading is fast (parallelizable)
– But requires a lease
• Master server is required for all reads and writes
Web 2.0 Expo, 17 April 2007
39
Slide 40: MogileFS – OMG Files
• Developed by Danga / SixApart • Open source • Designed for scalable web app storage
Web 2.0 Expo, 17 April 2007
40
Slide 41: MogileFS – OMG Files
• Single metadata store (MySQL)
– MySQL Cluster avoids SPF
• Multiple ‘tracker’ nodes locate files • Multiple ‘storage’ nodes store files
Web 2.0 Expo, 17 April 2007
41
Slide 42: MogileFS – OMG Files
Tracker MySQL
Tracker
Web 2.0 Expo, 17 April 2007
42
Slide 43: MogileFS – OMG Files
• Replication of file ‘classes’ happens transparently • Storage nodes are not mirrored – replication is piecemeal • Reading and writing go through trackers, but are performed directly upon storage nodes
Web 2.0 Expo, 17 April 2007 43
Slide 44: Flickr File System
• Developed by Flickr • Proprietary • Designed for very large scalable web app storage
Web 2.0 Expo, 17 April 2007
44
Slide 45: Flickr File System
• No metadata store
– Deal with it yourself
• Multiple ‘StorageMaster’ nodes • Multiple storage nodes with virtual volumes
Web 2.0 Expo, 17 April 2007 45
Slide 46: Flickr File System
SM
SM
SM
Web 2.0 Expo, 17 April 2007
46
Slide 47: Flickr File System
• Metadata stored by app
– Just a virtual volume number – App chooses a path
• Virtual nodes are mirrored
– Locally and remotely
• Reading is done directly from nodes
Web 2.0 Expo, 17 April 2007 47
Slide 48: Flickr File System
• StorageMaster nodes only used for write operations • Reading and writing can scale separately
Web 2.0 Expo, 17 April 2007
48
Slide 49: Amazon S3
• • • • A big disk in the sky Multiple ‘buckets’ Files have user-defined keys Data + metadata
Web 2.0 Expo, 17 April 2007
49
Slide 50: Amazon S3
Servers
Amazon
Web 2.0 Expo, 17 April 2007
50
Slide 51: Amazon S3
Servers
Amazon
Users
Web 2.0 Expo, 17 April 2007 51
Slide 52: The cost
• Fixed price, by the GB • Store: $0.15 per GB per month • Serve: $0.20 per GB
Web 2.0 Expo, 17 April 2007
52
Slide 53: The cost
S3
Web 2.0 Expo, 17 April 2007
53
Slide 54: The cost
S3 Regular Bandwidth
Web 2.0 Expo, 17 April 2007
54
Slide 55: End costs
• ~$2k to store 1TB for a year • ~$63 a month for 1Mb • ~$65k a month for 1Gb
Web 2.0 Expo, 17 April 2007
55
Slide 56: Serving
Web 2.0 Expo, 17 April 2007 56
Slide 57: Serving files
Serving files is easy!
Disk
Apache
Web 2.0 Expo, 17 April 2007
57
Slide 58: Serving files
Scaling is harder
Disk Apache
Disk
Apache
Disk
Apache
Web 2.0 Expo, 17 April 2007
58
Slide 59: Serving files
• This doesn’t scale well • Primary storage is expensive
– And takes a lot of space
• In many systems, we only access a small number of files most of the time
Web 2.0 Expo, 17 April 2007 59
Slide 60: Caching
• Insert caches between the storage and serving nodes • Cache frequently accessed content to reduce reads on the storage nodes • Software (Squid, mod_cache) • Hardware (Netcache, Cacheflow)
Web 2.0 Expo, 17 April 2007 60
Slide 61: Why it works
• Keep a smaller working set • Use faster hardware
– Lots of RAM – SCSI – Outer edge of disks (ZCAV)
• Use more duplicates
– Cheaper, since they’re smaller
Web 2.0 Expo, 17 April 2007 61
Slide 62: Two models
• Layer 4
– ‘Simple’ balanced cache – Objects in multiple caches – Good for few objects requested many times
• Layer 7
– URL balances cache – Objects in a single cache – Good for many objects requested a few times
Web 2.0 Expo, 17 April 2007 62
Slide 63: Replacement policies
• LRU – Least recently used • GDSF – Greedy dual size frequency • LFUDA – Least frequently used with dynamic aging • All have advantages and disadvantages • Performance varies greatly with each
Web 2.0 Expo, 17 April 2007
63
Slide 64: Cache Churn
• How long do objects typically stay in cache? • If it gets too short, we’re doing badly
– But it depends on your traffic profile
• Make the cached object store larger
Web 2.0 Expo, 17 April 2007 64
Slide 65: Problems
• Caching has some problems:
– Invalidation is hard – Replacement is dumb (even LFUDA)
• Avoiding caching makes your life (somewhat) easier
Web 2.0 Expo, 17 April 2007
65
Slide 66: CDN – Content Delivery Network
• Akamai, Savvis, Mirror Image Internet, etc • Caches operated by other people
– Already in-place – In lots of places
• GSLB/DNS balancing
Web 2.0 Expo, 17 April 2007
66
Slide 67: Edge networks
Origin
Web 2.0 Expo, 17 April 2007
67
Slide 68: Edge networks
Cache Cache Cache
Cache
Origin
Cache
Cache Cache Cache
Web 2.0 Expo, 17 April 2007
68
Slide 69: CDN Models
• Simple model
– You push content to them, they serve it
• Reverse proxy model
– You publish content on an origin, they proxy and cache it
Web 2.0 Expo, 17 April 2007
69
Slide 70: CDN Invalidation
• You don’t control the caches
– Just like those awful ISP ones
• Once something is cached by a CDN, assume it can never change
– Nothing can be deleted – Nothing can be modified
Web 2.0 Expo, 17 April 2007 70
Slide 71: Versioning
• When you start to cache things, you need to care about versioning
– Invalidation & Expiry – Naming & Sync
Web 2.0 Expo, 17 April 2007
71
Slide 72: Cache Invalidation
• If you control the caches, invalidation is possible • But remember ISP and client caches • Remove deleted content explicitly
– Avoid users finding old content – Save cache space
Web 2.0 Expo, 17 April 2007 72
Slide 73: Cache versioning
• Simple rule of thumb:
– If an item is modified, change its name (URL)
• This can be independent of the file system!
Web 2.0 Expo, 17 April 2007
73
Slide 74: Virtual versioning
Version 3
• Database indicates version 3 of file • Web app writes version number into URL • Request comes through cache and is cached with the versioned URL • mod_rewrite converts versioned URL to path
74
example.com/foo_3.jpg
Cached: foo_3.jpg
foo_3.jpg -> foo.jpg
Web 2.0 Expo, 17 April 2007
Slide 75: Authentication
• Authentication inline layer
– Apache / perlbal
• Authentication sideline
– ICP (CARP/HTCP)
• Authentication by URL
– FlickrFS
Web 2.0 Expo, 17 April 2007 75
Slide 76: Auth layer
• Authenticator sits between client and storage • Typically built into the cache software
Authenticator
Cache
Origin
Web 2.0 Expo, 17 April 2007
76
Slide 77: Auth sideline
Cache Origin
Authenticator
• Authenticator sits beside the cache • Lightweight protocol used for authenticator
Web 2.0 Expo, 17 April 2007 77
Slide 78: Auth by URL
Web Server
Cache
Origin
• Someone else performs authentication and gives URLs to client (typically the web app) • URLs hold the ‘keys’ for accessing files
Web 2.0 Expo, 17 April 2007 78
Slide 79: BCP
Web 2.0 Expo, 17 April 2007 79
Slide 80: Business Continuity Planning
• How can I deal with the unexpected?
– The core of BCP
• Redundancy • Replication
Web 2.0 Expo, 17 April 2007
80
Slide 81: Reality
• On a long enough timescale, anything that can fail, will fail • Of course, everything can fail • True reliability comes only through redundancy
Web 2.0 Expo, 17 April 2007 81
Slide 82: Reality
• Define your own SLAs • • • • How long can you afford to be down? How manual is the recovery process? How far can you roll back? How many $node boxes can fail at once?
Web 2.0 Expo, 17 April 2007
82
Slide 83: Failure scenarios
• • • • • • • Disk failure Storage array failure Storage head failure Fabric failure Metadata node failure Power outage Routing outage
Web 2.0 Expo, 17 April 2007
83
Slide 84: Reliable by design
• RAID avoids disk failures, but not head or fabric failures • Duplicated nodes avoid host and fabric failures, but not routing or power failures • Dual-colo avoids routing and power failures, but may need duplication too
Web 2.0 Expo, 17 April 2007 84
Slide 85: Tend to all points in the stack
• Going dual-colo: great • Taking a whole colo offline because of a single failed disk: bad • We need a combination of these
Web 2.0 Expo, 17 April 2007
85
Slide 86: Recovery times
• BCP is not just about continuing when things fail • How can we restore after they come back? • Host and colo level syncing
– replication queuing
• Host and colo level rebuilding
Web 2.0 Expo, 17 April 2007 86
Slide 87: Reliable Reads & Writes
• Reliable reads are easy
– 2 or more copies of files
• Reliable writes are harder
– Write 2 copies at once – But what do we do when we can’t write to one?
Web 2.0 Expo, 17 April 2007 87
Slide 88: Dual writes
• Queue up data to be written
– Where? – Needs itself to be reliable
• Queue up journal of changes
– And then read data from the disk whose write succeeded
• Duplicate whole volume after failure
– Slow!
Web 2.0 Expo, 17 April 2007 88
Slide 89: Cost
Web 2.0 Expo, 17 April 2007 89
Slide 90: Judging cost
• Per GB? • Per GB upfront and per year • Not as simple as you’d hope
– How about an example
Web 2.0 Expo, 17 April 2007
90
Slide 91: Hardware costs
Single Cost
Cost of hardware Usable GB
Web 2.0 Expo, 17 April 2007
91
Slide 92: Power costs
Recurring Cost
Cost of power per year Usable GB
Web 2.0 Expo, 17 April 2007
92
Slide 93: Power costs
Single Cost
Power installation cost Usable GB
Web 2.0 Expo, 17 April 2007
93
Slide 94: Space costs
Recurring Cost
[
Cost per U U’s needed (inc network) Usable GB
x
]
94
Web 2.0 Expo, 17 April 2007
Slide 95: Network costs
Single Cost
Cost of network gear Usable GB
Web 2.0 Expo, 17 April 2007
95
Slide 96: Misc costs
Single & Recurring Costs
[
Support contracts + spare disks + bus adaptors + cables Usable GB
]
96
Web 2.0 Expo, 17 April 2007
Slide 97: Human costs
Recurring Cost
[
Web 2.0 Expo, 17 April 2007
Admin cost per node Node count Usable GB
x
]
97
Slide 98: TCO
• Total cost of ownership in two parts
– Upfront – Ongoing
• Architecture plays a huge part in costing
– Don’t get tied to hardware – Allow heterogeneity – Move with the market
Web 2.0 Expo, 17 April 2007 98
Slide 99: (fin)
Slide 100: Photo credits
• flickr.com/photos/ebright/260823954/ • flickr.com/photos/thomashawk/243477905 / • flickr.com/photos/tom-carden/116315962/ • flickr.com/photos/sillydog/287354869/ • flickr.com/photos/foreversouls/131972916/ • flickr.com/photos/julianb/324897/ • flickr.com/photos/primejunta/140957047/ • flickr.com/photos/whatknot/28973703/ • flickr.com/photos/dcjohn/85504455/
Web 2.0 Expo, 17 April 2007
100
Slide 101: You can find these slides online:
iamcal.com/talks/
Web 2.0 Expo, 17 April 2007
101