Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It’s slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.
I agree with @Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
- I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
- If you are concerned about reporting the data then don’t do that on a transactional system. Create a reporting DB and update that periodically.
The “correct” way to design is to use another table, join it and
COUNT. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot
JOIN anymore; how can you
JOIN against half a table and receive meaningful results? And obviously, trying to
JOIN against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.