What do I have to do to make 20k mysql inserts per second possible (during peak hours around 1k/sec during slower times)? I’ve been doing some research and I’ve seen the “INSERT DELAYED” suggestion, writing to a flat file, “fopen(file,’a’)”, and then running a chron job to dump the “needed” data into mysql, etc. I’ve also heard you need multiple servers and “load balancers” which I’ve never heard of, to make something like this work. I’ve also been looking at these “cloud server” thing-a-ma-jigs, and their automatic scalability, but not sure about what’s actually scalable.
The application is just a tracker script, so if I have 100 websites that get 3 million page loads a day, there will be around 300 million inserts a day. The data will be ran through a script that will run every 15-30 minutes which will normalize the data and insert it into another mysql table.
How do the big dogs do it? How do the little dogs do it? I can’t afford a huge server anymore so any intuitive ways, if there are multiple ways of going at it, you smart people can think of.. please let me know 🙂
How do the big dogs do it?
Multiple servers. Load balancing.
How do the little dogs do it?
Multiple servers. Load balancing.
You really want to save up inserts and push them to the database in bulk. 20k individual inserts a second is a crapton of overhead, and simplifying that down to one big insert each second eliminates most of that.
A couple of ways:
Firstly, you will reach a point where you need to partition or shard your data to split it across multiple servers. This could be as simple as A-C on server1, D-F on server2 and so on.
Secondly, defer writing to the database. Instead write to a fast memory store using either beanstalkd or memcached directly. Have another process collect those states and write aggregated data to the database. Periodically amalgamate those records into summary data.
Thats impressive. Most of my data has been from massive inserts at once. One thing that I find is that bulk inserts do a lot better than individual inserts. Also, the design of your tables, indexes etc has a lot to do with insert speed. The problem with using cron and bulk inserting are the edge cases. (When it goes to do the inserts).
Additionally with flatfiles. You can easily run into issues with concurrency with writing the inserts to the file. If you are writting 1k+ inserts a s you’ll quickly run into lots of conflicts and loss when there are issues with the file writing.
This is not a problem you can handle in PHP alone.
If you have 20 000 requests a second hitting your “low-budget” (as I understood by the undertone of your question) server, then it will reach its limit before most of them reach the PHP processor (and, eventually, MySQL).
If you have a traffic tracker script, you’ll very likely cause problems for all the sites you track too.
PHP is not well-suited to high volume web traffic IMHO. However, the database will likely bog you down before the PHP performance – especially with PHP’s connection model (opens a new connection for every requst).
I have two suggestions for you:
- Give SQL Relay a look: http://sqlrelay.sourceforge.net/
- Check out some PHP accelerators: http://en.wikipedia.org/wiki/List_of_PHP_accelerators
SQL Relay effectively allows PHP to tke advantage of connection pooling and that will give much better performance for a high volume database application.
PHP accelrators (generally speaking) cache the PHP opcodes which saves the overhead of interpreting the PHP code with every request.
I’d recommend memcaching, too.
Write your data into a memcache and have a periodically running job aggregate it and do the inserts.
Writing to an actual file would probably DECREASE your performance since file system access is mostly slower than talking to a database that can handle writing access much more efficiently.
Writing to a file is great, but you still need to synchronize your file writes which puts you back to square one.
- MQ system, though sometimes the DB can be faster,
- On the MQ idea: in-memory queue. I know you said PHP, but I’ve seen this done quite well in Java/Servlets,
- Depending on what it is you’re tracking, you can deploy a static file into a CDN (the cloud thing you talked about) and aggregate the access logs in batch. Allows you to rent scaling out,
- INSERT DELAYED good idea but I don’t know what the backlog/queue size is for that in MySQL? (anyone)
Since you’re tracking impressions, what if try only saving, say, one in every 5. Then you still have a completely “random” sample, and you can just apply the percentages to the bigger dataset.