From:
lampslightingandmo
Views: 179
Comments: 0
All type of Table Lamps, Floor Lamps and Outdoor Lighting Collection by http://www.lampslightingandmore.com/
Slide 1: The SmugMug Tale
Slide 2: Who are we?
Premium photo & video sharing. Bootstrapped in ’02. $10M+ as of ’07. Profitable. No debt. Top 400 website. Doubling yearly.
Slide 3: Our challenge
Premium means “more” and “better”. Unlimited storage. Unlimited bandwidth. Big photos (48Mpix). 500M+ of them. Big video (1920x180p). Lots of photos per page. Super fast.
Slide 4: Architecture overview
LAMP(hp). x86 (mostly AMD) on Linux (~300 4+ core hosts?) 4 datacenters: 2 x SV, 1 x VA, 1 x SEA 2 Ops guys. :) Majority of boxes are diskless. Consume lots of cloud services (S3, EC2, etc).
Slide 5: Storage
Binary data (photos, video, etc): Stored in Amazon’s S3. PBs. Akamai fronts for caching and acceleration. Structured data (Database, etc): MySQL (InnoDB mostly). 4+ cores, 64GB, >2TB storage Memcached fronts for caching.
Slide 6: Compute
Photo & video processing / encoding: Handled in Amazon EC2. Totally autonomous scaling (SkyNet) Customer facing: Diskless web boxes (PXE boot) Scaled up *and* out MySQL Memcached ~1TB
Slide 7: Secret Weapon: Akamai
Super-fast CDN: Reads often already close to customer. More than just a CDN: HTML/AJAX/etc inspection for pre-fetch Anticipate requests and get data to within low ms Optimal data path to SmugMug DNS latency reduction $$$ but worth it. Get what you pay for.
Slide 8: Secret Weapon: memcached
Screaming fast. ~1TB of data stored. >96% hit rate Contains MySQL row data, avoid SELECTs Misc other data cached, but MySQL biggest win Fall back on MySQL for cold data
Slide 9: Secret Weapon: MySQL
Most important technology at SmugMug. Super dependent on replication: Performance Reliability / High Availability No MySQL data loss in >7 years. No JOINs. (Or lots of 4.x+ features, either) Vertically partitioned, not horizontally (no shards)
Slide 10: Secret Weapon: InnoDB
Most important technology at SmugMug. Huge thanks to Heikki, Oracle, Percona and Google! Running 1.0.3+patches in production. Big performance gains with recent releases.
Slide 11: Secret Weapon: Percona
Crazy concentration of talent under one roof. Best MySQL dollars we’ve ever spent. Helped us out of a major bind Have you heard of the ‘back_log’ mysqld setting? Me neither. Hope you never do. Percona had. Helped build, integrate, and test InnoDB patches.
Slide 12: MySQL details
We care about write latency above all. Well, ok, maybe data integrity. ;) Scaling reads “easy”: replication and memcached. Replication needs to stay current (<1 sec). MySQL concurrency problems. (Much improved!) Parallel I/O - lots of cores. Large storage (TBs). Big RAM (64GB+) to keep indexes hot.
Slide 13: MySQL query details
Mostly SELECT pkey FROM table WHERE index; On cache miss, SELECT * FROM table WHERE pkey; UPDATEs/DELETEs mostly on single rows by pkey Easy memcached expiration. Easy slave-delay tracking. Very denormalized. No JOINs or complex SELECTs. OLTP benchmark imperfect. Time for sysbench-web?
Slide 14: MySQL Issues: Filesystems
Better filesystem: CentOS Linux shop (lots of expertise). MySQL is storage intensive (iops, size, etc). ext3 old and busted. fsck, well, sucks. ext4 already old and busted. :( Want good volume management. Serialized writes (non-parallel). Ugh.
Slide 15: Filesystem Solution - ZFS!
Transactional. Copy-on-write. End-to-end data integrity. On-the-fly corruption detection & repair. Integrated volume management. Snapshots & clones. Open source software.
Slide 16: The REAL Issue
We run Linux. ZFS doesn’t run on Linux. Crap.
Slide 17: MySQL Issues: Replication
Unknown state on crash: Did *.info get written at commit? Or is it *2 months* out of date? Bringing TB+ slaves online quickly. Backups using LVM/ZFS a pain. Keeping up with master. Single thread for replication SQL. Master promotion cludgy.
Slide 18: Replication solutions
Transactional replication patches: Slave always in known state. Either ok to bring back up or CHANGE MASTER. Safe to take snapshots anytime, no effort. Safe to use innodb_flush_log_at_trx_commit=2 InnoDB only. Stopgap. Global trx IDs better. Using in pre-production. Production next week?
Slide 19: Secret Weapon: Sushi
Toro aka S7410. NAS storage with a few twists. 2 x Quad-Core Opteron + 64GB RAM 100MB Readzilla SSD 2 x 18GB Writezilla SSD. 20K write iops. 22 x 1TB 7200rpm HDD Clustered HA configuration.
Slide 20: Mmm, Toro tastes good.
ZFS on Linux! SSD is here! SSD performance is cheap! Consume via NFS, iSCSI, CIFS, HTTP, FTP, etc. Massive flexibility - no more DAS. Fishworks interface is a dream. Analytics is a game changer.
Slide 21: Sushi’s quite reasonable
Initial sticker shock - $80K?! $142K clustered?! No one pays list price. Whew. Startup Essentials. Double-whew. Paradigm shift. Biggest whew! DAS -> NAS So much IO, in theory, can “stack” lots of clients. In practice, can stack *lots* of clients. We now have 5 clustered configs. :)
Slide 22: Sushi served fast
Crazy fast. 9.6K iops, 4.5K under 43us, 8K under 166us
Slide 23: Sushi served fast
Scalable. 15K 4k write iops w/16 threads. Low latency. ~250us @ 3K iops, ~700us @ 10K
fio write benchmark 20000
15000
4K write iops
10000
5000
0
1
2
4 threads
8
16
32
Slide 24: Sushi today
So fast, we’re stacking like crazy. 5 different MySQL workloads on single clustered Toro. 8 slaves on single Toro. Each used to have 15K disks + write cache. Lots of excess io and space capacity still. Compression “for free” (no client CPU usage) Crazy fast ~1.5X ratio across TBs of InnoDB
Slide 25: Sushi today
Backups a breeze. Automatic snapshots every n minutes / hours / days. No need to LOCK / shutdown / STOP SLAVE / etc Rollback anytime. Skip bad SQL statements. New slave? Click snapshot. Click clone. Done. Slaves share unchanged data on disk and in RAM. Future bright: clone + de-dupe = insanely efficient.
Slide 26: Analyzing sushi
DTrace on Linux! Never had analytics on storage before. Vendor used to say: “Um, we dunno. Buy more spindles?” Now I know all. Vendor now says: “What does Analytics say?” Drill down on everything. Correlate anything. God-like power.
Slide 27: MySQL on Toro so far
NFSv3 (rather than v4) 16KB record size in ZFS (InnoDB) Mirrored (RAID1+0) disks w/striped Logzilla MySQL concurrency bound - can’t use all the I/O If compressing, use LZJB. In theory, can optimize InnoDB: doublewrite = 0, checksums = 0. ZFS does these. In practice, no big gain with our workload.
Slide 28: MySQL on Toro problems
Replication *.info files not sync’d over NFS Found a slave with *2 month old* info files Transactional replication to the rescue! NFS locking and InnoDB Warnings on the Net. No hard data. Actively researching. What’s the problem?
Slide 29: Even faster?
10GbE for reduced latency? Actively testing this. Driver tuning required. Defaults for throughput. Cards (Intel) & switches (Arista) cheap & fast Less than $500/port. Copper twinax SFP+ cables cheap. Optical XFP not. $50 vs $1000+ Toro doesn’t support SFP+ cards yet. :(
Slide 30: Kitchen sink on Toro
Everything runs better on Toro. :) Revision control. Stateless Linux mounts. Email. Developer home directories. Built-in, automatic replication for multi-site backups. Photo and video serving?
Slide 31: The future?
100% SSD. Still too $$ for TB+ installs. Even better InnoDB. Community on fire. Oracle/MySQL accepting patches! Multi-threaded replication. Preview release is out. Yes! New storage engines PBXT, Falcon, Maria, oh my!
Slide 32: Oracle wishlist
MySQL is a crown jewel. Not a gateway drug to Oracle. Different customers. Kill btrfs. GPL ZFS. MySQL and InnoDB under one roof = opportunity. OpenStorage is game changer. Don’t kill it. Listen to your new communities. I’m busy. I’m up here because this is important.
Slide 33: Thanks!
Blog: http://blogs.smugmug.com/don Twitter: DonMacAskill Email: don@smugmug.com Percona Conference: Upstairs :)