Slide 1: HOW TO BUILD MAINTAIN HIGH TRAFFIC WEBSITES sdf
25 AUG 2011
Thursday 25 August 2011
AN INSIDER’S VIEW ON
Slide 2: OVERVIEW
‣ Company history / Overview ‣ Hardware level ‣ Software - Technologies ‣ Cases: ‣ Release workflow ‣ Remarks / Q&A
“2-way matching strategy” & “Statistics”
Thursday 25 August 2011
Slide 3: COMPANY HISTORY
Thursday 25 August 2011
Slide 4: OVERVIEW
‣ ‣ ‣ ‣ ‣
2000 Launch of Asl.to 2002 Rebranding redbox 2005 Fulltime focus of founders + first hires 2006 Rebrand to Netlog.com 2007 Pan- european rollout / Investment of Index Ventures
‣ 2009 Launch arabic version Netlog ‣ 2010 - 2011 Creation of parent holding:
Massive//Media, Launch of new products/ sites
Thursday 25 August 2011
Slide 5: WHAT’S NETLOG?
‣ An online community
where you can meet new people, create your own profile, share photos, videos, blogs, ..
Thursday 25 August 2011
Slide 6: GENDER SPLIT
12-16
49%
40+
20% 15% 24%
37-40
51%
4% 5%
17-20
33-36
6% 12%
29-32
Female Male
14%
25-28
21-24
AGE SPLIT
Thursday 25 August 2011
Slide 7: 42
LANGUAGES
82
MILLION MEMBERS
30
MILLION MONTHLY UNIQUES
Bulgaria
Thursday 25 August 2011
Slide 8: MASSIVE//MEDIA
Thursday 25 August 2011
Slide 9: TWOO.COM
Dating
‣ Play games ‣ Search for dates ‣ Chat & date
Thursday 25 August 2011
Slide 10: EKKO.COM
Next generation share/ chat platform ‣ Localized ‣ Friend focus ‣ Realtime experience ‣ Built upon public API
Thursday 25 August 2011
Slide 11: GATCHA.COM
Gaming ‣ Game platform ‣ Social & multiplayer games ‣ Support for multiple devices
Thursday 25 August 2011
Slide 12: KEZOO.COM
Daily deals ‣ Watch a trailer/ Answer a question & win prices
Thursday 25 August 2011
Slide 13: ABOUT ME
‣ Core development / Lead developer Netlog.com ‣ Lead framework team at Massive//Media ‣ Technical lead at Twoo.com
Thursday 25 August 2011
Slide 14: HARDWARE LEVEL
Thursday 25 August 2011
Slide 15: SOME NUMBERS
‣ ‣ ‣ ‣
620 Web servers 240 Database servers 150 Memcache servers 400 Servers for other purposes (cron, mail, deploy, Sphinx, Load balancer, hadoop development )
‣ + #### “servers” around the world for loading of static
content => CDN networks
Thursday 25 August 2011
Slide 16: SERVER SUPPORT
‣ Load balancing ‣ Haproxy, good support for failover ‣ Server management ‣ Dedicated system people in each team ‣ Configuration/ setup through Puppet ‣ Monitoring: Zabbix / Prowls on mobile ‣ Standby persons for overnight support ‣ Backups through bacula
Thursday 25 August 2011
Slide 17: EXAMPLE HAPROXY
‣ <insert example puppet file here>
Thursday 25 August 2011
Slide 18: EXAMPLE PUPPET
‣ Manifests & files ‣ Manifests are written in Ruby ‣ Contains location of where files need to be put/ what services need to run etc. ‣ Inheritance can be used
File : services/s_sphinx/manifests class s_sphinx::twoo::server inherits s_sphinx::server { File["/etc/sphinxsearch/sphinx.make.twoo.conf"] { source => "puppet:///s_sphinx/twoo/sphinx.make.twoo.conf" } File["/etc/sphinxsearch/sphinx.make.dev.conf"] { source => "puppet:///s_sphinx/twoo/sphinx.make.dev.conf" } }
Thursday 25 August 2011
Slide 19: CLOUD / AWS/ ..?
‣ Experience with cloud services: ‣ Works good until some level ‣ We have dedicated people in house to do server
management
‣ Price costs were much higher ‣ No overview of “real” hardware ‣ “Fixed” builds of kernel triggered errors ‣ No easy point of contact when downtime
Thursday 25 August 2011
Slide 20: GENERAL REMARKS
‣ Split up hardware among products (if one goes down.. ) ‣ Split up across data centers ‣ Invest in good hardware
=> hardware is cheap <> programming hours
‣ Try to split up DB/ web/ cache machines in early stage ‣Avoid manual changes ‣ Buy hardware designed for the purpose
(memcache=>lots of ram, db=>lots of cpu, fast disks .. )
Thursday 25 August 2011
Slide 21: SOFTWARE & TECHNOLOGIES
Thursday 25 August 2011
Slide 22: SOFTWARE
‣ Languages: PHP/ Python/ C++ ‣ Ubuntu ‣ Apache/ Nginx with php-fpm
‣ Evolution of version control systems ‣ CVS => SVN => GIT ‣ Main benefits of using GIT vs. SVN: ‣ Commit goes fast ‣ Easy to setup branches ‣ Merging is less of a hassle ‣ Easier to use in a release workflow ‣ Takes less space on the server (only line changes are kept)
Thursday 25 August 2011
Slide 23: APPLICATION SETUP
‣ New products (Twoo, Ekko, Kezoo, ..) code: ‣ <framework library> ‣ <application code> code compatible w <framework>
& following defined rules
‣ Netlog code: ‣ <framework library> (fixed port) ‣ <application code> compatible with framework
( some parts are grown historically => rewrite takes time)
Thursday 25 August 2011
Slide 24: FRAMEWORK SETUP
‣ Inhouse written framework ‣ MVC pattern ‣ Pear standard (all classes prefixed by “Core_“) ‣ Well documented & unit tested ‣ Framework structure:
‣ Io ‣ DB ‣ Memcache ‣ Redis/ File ... ‣ Services ‣ Search ‣ Image ‣ Upload ‣ Mail ‣ Geo ‣ ... ‣ Util ‣ HttpRequest ‣ Validator .. ‣ Bundle ‣ Permissions ‣ LocationSelector ‣ DebugBar ‣ ... ‣ Controller ‣ View
Thursday 25 August 2011
Slide 25: APPLICATION CODE
‣ Defined rules
‣ Codewise (Syntax/ Documentation/ ... ) ‣ But also rules programming wise: ‣ No code freewheeling when new people start
working on projects
‣ Avoids code duplication ‣ Better when maintaining code ‣ Easier when switching between projects to know
where to start
‣ Less error prone
Thursday 25 August 2011
Slide 26: APPLICATION CODE
‣ (Web) Application code philosophy (1)
‣ Keep it simple ‣ Code should not represent how ‣ Database is structured ‣ A view is build up => templates are used for that ‣ Code does represent the logic ‣ What data is put/ pulled from the data layer (through queries, .. ) ‣ Which data is assigned to a view (template) ‣ What operations are done on the data before assigning
Thursday 25 August 2011
Slide 27: APPLICATION CODE
‣ (Web) Application code philosophy (2)
‣ In Web/ HTTP central stateless applications a “model” is never rightfully used ‣ There is no sense in replicating a storage model as long as it is query-able ‣ Code should be readable in a very “pseudo code” way in a central place =>
Controller. Here it is easy to anticipate on logical errors while maintaining code
‣ ‣ ‣ ‣ ‣
Avoid deep inheritance Avoid dynamic queries (dynamic joins, .. ) Functions need to be simple & clear, make them do 1 thing Put the (distribution to function) logic in the controller Use type hinting in functions for arguments & return types should be documented
Thursday 25 August 2011
Slide 28: APPLICATION CODE
‣ Application layout
Object Structure Library Procedure Controller Exception
Thursday 25 August 2011
Contains everything representing real objects: UserLocation, Query, .. )
Classes without functions used to pass around & for type enforcing
Toolkit functions, don’t invoke datastore operations. Methods called statically, only project specific => the rest should be on framework level (TextParsingLibrary, PopularityLibrary, .. ) Classes that access datastores, called statically. Used to organize functions in classes. UserProcedure::register();. Make use of structures in params or return types Contains the real controlling functions, representing the entrance paths, reflecting the overall logic. User, Settings, Profile, ...
Representation of exception classes used within application.
Slide 29: TECHNOLOGIES
Thursday 25 August 2011
Slide 30: MYSQL
‣ The inevitable “make my db scale” problem ‣ How we handled read load initially:
fragmentation through languages and clustering parts of the application. (Messages, Friends, Blogs, Photos, ..)
Thursday 25 August 2011
Slide 31: MYSQL fragment by language master r/w
fragment by language & type of usage
master r/w
1 2
slave r
master r/w
en
3
en
messages r
search r music r
4
master r/w
slave r slave r
...
fr
messages r
search r music r
master r/w
...
master r/w slave r
fr
maste
master r/w
slave r
slave r
search messages r searc messag r master search r/w r music error 1040 music r too many search messages music connections r r r music r
error 1040 master too many connections r/w
5
messages r
...
Thursday 25 August 2011
Slide 32: MYSQL
‣ Solution: partition data horizontally,
through sharding
‣ Formula:
<userid> % number of shards == shardid where we store the data of the user
‣ Inhouse developed ‣ Positive about the system:
‣ Scale out easily ‣ Alters/ Selects/ Inserts go fast (small datasets) ‣ Failure (could) only affects x % of users ‣ High growth potential in size ‣ Helped us to grow in crucial periods
Thursday 25 August 2011
Slide 33: MYSQL
‣ Sharding drawbacks
‣ Lookup for shardid is point of failure ‣ Loose overview of all the data ‣ Adds complexity to the code ‣ Maintenance costs++ ‣ Harder to join data ‣ Not always perfectly balanced (users with lot’s of data/
activity)
Thursday 25 August 2011
Slide 34: MYSQL
‣ Looking towards the future ‣ Already better/ more mature alternatives available now: ‣ MySQL Cluster ‣ Project matured a lot, back when we created our sharding system there was: ‣ no user defined partitioning available ‣ needed to take cluster offline for alters ‣ Clustrix ‣ Commercial appliance/ runs with top level hardware ‣ Can be accessed/ used as normal DB (Master/ Master replication w normal
mysql f.e)
‣ Horizontal scaling built in ‣ Custom written optimizer (rewrites selects, .. ) ‣ Very fast write speed (100K/inserts sec in benchmarks) ‣ Running in test phase on one of our sites & have good results so far ‣ MongoDB cluster (document oriented)/
full switch to NoSQL solutions
‣ Requires a lot of changes in application code conceptually
Thursday 25 August 2011
Slide 35: MEMCACHE
‣ In memory key/ value store ‣ Important to know ‣ Always supply a TTL to your keys, no infinite timeouts (in the long run memcache
will become slower=> more calculation, .. )
‣ Cache wisely => single place in your application (no cache of a cache etc.. ) ‣ Don’t overuse caching => DB’s are still pretty fast when they can use indexes ‣ Add caching in a late state, only cache things where needed + try to look at the
cache efficiency of certain keys. (every cache adds a layer of complexity)
‣ If needed use “Cache revision numbers” when creating caches ‣ f.e. A user deletes a photo => bump the revision number
$cacherev = $cache->incr(“photos<$userid>”); .. . ...// somewhere else in the code: $cache->get(‘photos<$userid><$cacherev>’); $cache->get(‘publicphotos<$userid><$cacherev>’);
‣ Use memcached, not memcache extension
Thursday 25 August 2011
Slide 36: SEARCH - SPHINX
‣ Full-text search server ‣ Comparable to Solr ‣ Written in C++ ‣ our hardware is better fit to run sphinx ‣ we have more inhouse knowledge ‣ full support on sphinx side (custom requests, etc.) ‣ How does it work? ‣ Creates an inverted index of your data/ words
=> searches/ lookups are very fast
word Netlog Company
keywordid 1 2
‣ f.e.
document 1: “I work at Netlog.” document 2: “Netlog is a belgian company”
keywordid 1 2
documentid 1, 2 2
‣ Currently used at: ‣ Netlog people search ‣ Two way matching searches on Twoo ‣ Realtime results on Ekko
Thursday 25 August 2011
Slide 37: BACKGROUNDJOB
‣ ParallelProcessor ‣ Inhouse built ‣ Used for realtime tasks, => on the fly fetching of friends of friends, ... ‣ Splits up an array in slices to divide them among x number of threads ‣ 1) opens connections to x number of sockets (hostname: “cloud”
~loadbalanced)
‣ 2) sends data to the sockets ‣ 3) client uses stream_select (select() system call) to check which socket has
data available
‣ 4) client combines results ( ‣ can close sockets earlier when timeout reached ~ some cases where not
100% of the data is needed)
Thursday 25 August 2011
Slide 38: BACKGROUNDJOB batch1
server
‣ ParallelProcessor
batch1 batch2
server
batch2 batch3
server
server
server
batch4
client
batch3
server
...
server does tasks on the batches & returns results as soon as finished
batch4
server process returned results when finished or when <timeout> reached
...
Thursday 25 August 2011
Slide 39: BACKGROUNDJOB
‣ Gearman ‣ Job sheduler used for all things that can happen
on a later moment / in the background. ‣ 1 central job server handles all the queueing of jobs ‣ Other server have workers running that are connected to the job server and are waiting for a potential job in a loop
‣ 1) clients sends a job to the job server ‣ 2) job server checks if there is a worker free to do the job ‣ If so: dispatches the job to the worker ‣ If not: adds the job to a queue & pushes ‣
this to the next worker whenever a job is done. We also use this in our cron code ‣ for example tasks that need to run minutely / hourly etc. => all placed in separate functions, each function is queued to gearman when a minute/ hour is passed ==> all jobs run on time & no worries that some code can’t be executed due to a scrip that fails / dies in the middle
Thursday 25 August 2011
Slide 40: BACKGROUNDJOB
‣ Gearman
client
Sends a job to gearman, can a) wait for return value or b) continue immediately
worker thread
Gearmand
Central role in scheduling the jobs, keeps a queue in memory & sends out tasks to worker threads
worker thread
worker thread
Registers itself to the gearmand server and continuously polls it for new tasks/ jobs + executes them if needed. runs infinitely
worker thread
Thursday 25 August 2011
Slide 41: IN THE PIPELINE
‣ HandlerSocket ‣ Can only be used with MySQL innodb engine ‣ Low level interface (hack) for doing updates/ inserts/ (key/value) to MySQL ‣ Very fast writes/ Still able to do normal selects ‣ Waiting for more stable version & fully adaptation in MySQL package ‣ Membase ‣ A bit unstable when we tested it
promising product, will be the next generation memcached with focus on lists/ sets in memory
‣ Redis ‣ Has a lot of potential, currently used for debugging remote pages (Pub/Sub) ‣ Merging lists/ sets in memory have a lot of use cases > online friends/ users
etc.
‣ Waiting for Redis cluster before we can take full use of this
Thursday 25 August 2011
Slide 42: CASES
Thursday 25 August 2011
Slide 43: STATISTICS
‣Case:
‣ Deliver realtime statistics of what is happening on the website ‣ We want a system where we can easily monitor the evolution of
# unique logins/ # photo uploads / # registrations, ... + Should be possible to segment on age/ gender/ location/ browser/ language/ ...
‣ easily generate graphs for all common statistics ‣ segment those graphs in various ways on-the-fly ‣ dive into the data and examine things more closely
1.
Thursday 25 August 2011
Slide 44: STATISTICS
‣ Attempted solution #1
‣ Create logging table in MySQL and add a row to it each time an event takes
place.
‣ Issues with this approach: ‣ structure needs to be decided in advance, since altering the table is
practically impossible once it gets very large
‣ generating (segmented) graphs is slow as it will basically correspond to
running SQL queries on a very big table
‣ diving into the data is slow too -- all of the raw logging data is there, but
querying it gets infeasible in practice
Thursday 25 August 2011
Slide 45: STATISTICS
‣Attempted solution #2
‣ Keep aggregated stats in a MySQL table and update them on-the-fly when
events happen.
‣ This makes generating graphs fast, but: ‣ diving into the data is impossible since only the high-level aggregated stats
are stored, not the individual logs
Thursday 25 August 2011
Slide 46: STATISTICS
‣Hadoop to the rescue
‣ Hadoop = platform for sequential data processing in batch ‣ can reliably store lots of big files on a cluster of computers ‣ is able to process the data very efficiently as well ‣ the data doesn't have to be structured and can evolve ‣ HBase = Hadoop-based DB for random and real-time queries ‣ allows very big data sets to be queried very quickly ‣ low maintenance due to auto partitioning and redundancy ‣ runs on top of Hadoop, so doesn't require extra servers ‣ it's very easy to write to HBase from Hadoop jobs
Thursday 25 August 2011
Slide 47: STATISTICS
‣ Proper solution
‣ Combination of Hadoop and HBase: ‣ 1. store detailed log files on Hadoop (even for events you're not particularly
interested in yet at the current moment)
‣ 2. run batch jobs that precompute all common stats (and their
segmentations) and store the results in HBase
‣ With this approach, we can: ‣ quickly generate (segmented) graphs by querying HBase ‣ dive into the data by running custom batch jobs on Hadoop ‣ keep lots of data around without worrying about schemas
Thursday 25 August 2011
Slide 48: STATISTICS
‣ Our Hadoop setup (1)
‣ Logs are gathered by syslog-ng ‣ easy to set up and seems to work OK so far ‣ but looking at more reliable alternatives such as Flume ‣ LZO compression is applied before uploading to Hadoop ‣ storage space is our main bottleneck ‣ makes jobs run faster too, as they tend to be IO-bound ‣ Hadoop cluster itself currently consists of 9 nodes ‣ started off with 6 recycled database servers ‣ now moving to 9 custom tailored nodes that each have: ‣ 12 hyperthreaded cores (2 virtual cores per core) (24) ‣ 24GB RAM ‣ 12 disks (8 x 3TB and 4 x 2TB) ‣ so in total we have > quarter of a PB raw disk space
Thursday 25 August 2011
Slide 49: STATISTICS
‣ Our Hadoop setup (2)
‣ Accessing/ Processing ‣ Thrift api through python ‣ On the fly from HBase with Java ‣ Dumbo jobs with python ‣ Every hour / day / month ‣ Summaries are stored in HBase for every combination of age, language,
gender date
‣ Custom queries with Hive => SQL interface on top of Hadoop ‣ When doing deep analysis ‣ Map/Reduce jobs in Hadoop if we want to dive in the data
Thursday 25 August 2011
Slide 50: STATISTICS
Example dashboard
Example filtering
Thursday 25 August 2011
Slide 51: STATISTICS
‣ Future plans
‣ Having all this valuable data at our fingertips also enables us to go beyond
traditional analytics:
‣ social network analysis ‣ natural language processing ‣ ranking models ‣ topic models ‣ collaborative filtering ‣ ... ‣ In this way, Hadoop will also be used to directly power and improve features
of our products.
Thursday 25 August 2011
Slide 52: RELEASE WORKFLOW
Thursday 25 August 2011
Slide 53: RELEASE WORKFLOW
‣ Previous way (svn) ‣ Every developer has his own dev environment on a remote dev box ‣ Code changes get rsynced on save ‣ After testing enough on the dev environment code gets commited ‣ Every developer can put code live immediately on all the webs ‣ 1st on test webs and after confirming that there are no errors,
other webs
‣ Full deploys of code happen on weekly base (code/ templates/ frontend)
Positive: flexible & fast way of developing/ putting things live, works well for small teams Negative: Bugs can get live much easier & hard to track down issues if there are ## deploys per day
Thursday 25 August 2011
Slide 54: RELEASE WORKFLOW
‣ New deploy/ coding strategy: ‣ Developer runs all the code locally & commits/ pushes changes to the
develop or feature-branch, once stable
‣ Git branches: ‣ Develop: all main develop tasks (bugfixes, small refactors) ‣ Feature-branches: Things that involve more then 1 day of work/ can be
shared (feature-payments, feature-advancedsearch, .. )
‣ Release-candidate: Created every evening / tested overnight, in the
morning & goes live every day at noon.
‣ Master: version that is live / No development is done on this branch/
Only merges from rc to master
‣ Dedicated release manager that has ownership over the deploys
Note: Critical bugs can be cherry-picked on the master & rc, deployed afterwards => limit this usage (only for urgent issues)
Thursday 25 August 2011
Slide 55: RELEASE WORKFLOW
‣ Benefits of new strategy ‣ Dedicated people for code release => ownership & responsibility ‣ Low impact on site visitors, less errors/ bugs can be put live ‣ More time to test ‣ Easier to see what has come live
(mailing is sent to everyone with commit messages)
‣ Can go back to a previous release if something is wrong ‣ Future plans: ‣ Increase the release-candidate <> deploy timeframe to x days
right now: code not mature enough & daily releases are needed
Thursday 25 August 2011
Slide 56: TESTING
‣ Unit testing ‣ Tests are written in phpunit classes for new code pieces & bugs (almost 70%
code coverage)
‣ Database (table structure) differences between production <> development
environment are checked on a daily base
‣ We use Jenkins to do automated testing in the background (after each push) ‣ Selenium for replaying browser actions ‣ Manual testing ‣ We have a QA engineer that tests new features & important parts of the sites
daily/ weekly
‣ Responsibility of developer++
Thursday 25 August 2011
Slide 57: JENKINS
Thursday 25 August 2011
Slide 58: JENKINS
Thursday 25 August 2011
Slide 59: REMARKS/ Q&A
Thursday 25 August 2011
Slide 60: QUESTIONS
‣That’s all folks! ‣Contact
‣ http://nl.netlog.com/jayme ‣ http://twoo.com/648281476 ‣ http://be.linkedin.com/in/jaymerotsaert ‣ http://twitter.com/_jayme ‣ or email : <firstname> @ <lastname> .be
Thursday 25 August 2011