Slide 1: From dedicated to cloud infrastructure
Gojko Adzic Advanced Games Lab http://gojko.net gadzic@advancedgameslab.com
Slide 2: Why?
Less hardware = less hassle Scale up on demand to handle peaks Scale down to save money after
Slide 3: What?
Anything not security-sensitive or required under regulation
Web sites Message servers Price feeds, screen scraping... Public data
Slide 4: Challenge #1: no NAS
S3 is slow EBS volumes attach only to one instance SimpleDB is a big hash table, reliable but slow
However:
New SQL service Asynchronous persistence with data caches Offload to SQS
Slide 5: Challenge #2: undedicated network
No multicast Machines being locked out for 10-15 mins Occasional unreliable networking between nodes
Slide 6: Challenge #3: load balancing
Can't count on any particular node being reliable Basic TCP clustering available
No sticky session
However:
Easy IP reassignment so DNS round-robin Automatic cluster up-scaling
Slide 7: Challenge #3: CDN
CloudFront has 24hrs refresh cycle 1 CNAME per distribution No SSL
However
New distribution ~ 10 mins S3 directly has HTTP + SSL
Slide 8: Challenge #4: shared knowledge
Machines go up and down, new ones get added No NAS to store shared configuration
However:
Map /etc/hosts on S3 Put config into SimpleDB, use cron tasks to refresh machines
Slide 9: Challenge #5: security
No cleanup guarantees No SLAs No real control over security
VPN to protect transport available
Slide 10: Preparing for the cloud
Split the data Break into standalone stateless systems Prefer horisontal scaling for stateful parts Closely monitor single points of failure Use HA resources
Slide 11: Splitting the data
Not all data is the same
Does it need transactions? Does it need security? Does it need querying?
Probably never for accounts, transactions and key customer data
But really good for profiles, reference data, possibly scrubbed tx logs/betting history...
Slide 12: Standalone stateless systems
Isolate blocks that can easily be replicated
Prepare AMIs with full software/security settings Retrieve configuration from SimpleDB or S3 on start Use TCP clustering or automated DNS round-robin to expose new servers Optional caching at this level
Push state into HA resources
Slide 13: Horisontal scaling for state
Use data grids to off-load and cluster
Coherence, GigaSpaces, Terracota
Automate packaging and configuration as much as possible (RPMs, S3, SimpleDB)... Ensure that the configuration can grow dynamically Use software that survives disconnects and unreliable networks
Slide 14: If clustering is not possible, keep your eyes open
Monitor the system closely and prepare for a quick reaction
Ideally a full AMI that loads configuration from S3 If not, have RPMs ready Internal IPs aren't recyclable, make sure other systems can switch to a different resource
−
S3 hosts file, (DNS?), cron to reload configuration
Slide 15: Use HA resources
SimpleDB S3 Hash databases (noSql) SQS (beware of 8k limit) CloudMQ
Slide 16: Questions?
gadzic@advancedgameslab.com http://gojko.net