I see a lot of statements like: “Cassandra very fast on writes”, “Cassandra has reads really slower than writes, but much faster than Mysql”
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: “INSERT INTO
title) VALUES (‘id01′,’test’)” gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert(‘id01’,array(‘title’=>’test’)) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.
When people say “Cassandra is faster than MySQL”, they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you’d need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.
Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.
It’s likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It’s also entirely possible that the simplicity of the data in your query is impacting the results – maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can’t do a single test and base your decision on that.
Many user space factors can impact write performance. Such as:
- Dozens of settings in each of the database server’s configuration.
- The table structure and settings.
- The connection settings.
- The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn’t something I’m familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn’t something I’ve tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra’s first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don’t allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.