I was finally convinced to put my smaller tables into one large one, but exactly how big is too big for a MySQL table?
I have a table with 18 fields. Some are
TEXT, some are short
VARCHAR(16), others longer
Right now we get about 200,000 rows a day, which would be 6 million+ a month. How big is too big? Does it matter how many fields you have, or just rows?
There’s not a great general solution to the question “How big is too big” – such concerns are frequently dependent on what you’re doing with your data and what your performance considerations are.
There are some fundamental limits on table sizes. You can’t have more than 1000 columns. Your records can’t be bigger than 8k each. These limits change depending on database engine. (The ones here are for InnoDB.)
It sounds like you’ve merged several different data sets into one table. You probably have some fields that tell you what data set this record pertains to, along with some data fields, and some timestamp information. That’s not a very wide record (unless you’re logging, say, all the input parameters of each request.) Your main problem will be with selectivity. Indexing this table in a meaningful way will be a challenge. If your common fields can be selective enough that you can use them to get to the records you want without consulting the table, that will be a huge plus. (Cf. table scan)
For that many records per day (basically, two a second all day, and I’m presuming you have a peak-load period where it’s much higher), you’ll also want to make sure that you specifically look at optimizations on improving insertion speed. As a general rule, more indexes = slower insertions. If you can, consider archiving off outdated records to another table entirely. In prior workplaces, we’ve used an archival strategy of Last Month, Prior Three Months, Prior Six Months, each in separate tables. Another idea is to delete older records. Many environments simply don’t need information beyond a certain date. Hanging on to logging records from three months ago is often overly expensive.
Finally, don’t neglect the physical storage of your table. The thinner your records are, the less physical IO needs to occur to read (or for that matter, to insert) a record. You can store your indexes on a separate physical hard drive. If there’s a lot of redundant data in your records storing the table compressed might actually be a speed increase. If you have a little cash to burn, consider the value of a good RAID array for striping your data.
So, to answer your basic question: it’s a lot of records, but with a careful eye towards tuning, it won’t be a problem.
I think it depends, basically. Which version of MySQL are you using, what OS, and are you using MyISAM or innoDB tables ? It’s different on 32-bit and 64-bit too, and varies on your logging settings. The MySQL manual says:
The effective maximum table size for
MySQL databases is usually determined
by operating system constraints on
file sizes, not by MySQL internal
There’s more detail on what that those limits are on that page too.
I have a table with ~98M rows and inserts/deletes occur all day long. We keep records for 90 days… I expect this table to be ~100M rows this month. Personally, I would have designed the database schema differently, but it was purchased and we need to keep it intact so that we do not void any vendor support.
We’re using mysql replication (MASTER-MASTER) and performing the inserts/deletes on one & performing the queries on the other. This has really helped with performance as the deletes would lock the table and block queries before we changed to using replication.
We’re not experiencing any performance issues using this implementation.
I also perform a table optimize once a week…
The choice of how many columns to put in a single table also depends on the type of data being represented and how much you care about normalization. Some relationships can easily be represented by one table; others need to be done in multiple smaller tables, especially when you have a mix of one-to-one, one-to-many, and many-to-many type relationships in your dataset.
Not an answer to exact question…
Why were you convinced to put your smaller tables into one large one?
What you were doing is called “Vertical Partitioning” and can actually be very useful, depending on your situation. With many large TEXT or BLOB fields, a vertical partition can keep your more queried data physically together and faster to access.
Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. Different physical storage might be used to realize vertical partitioning as well; storing infrequently used or very wide columns on a different device, for example, is a method of vertical partitioning. Done explicitly or implicitly, this type of partitioning is called “row splitting” (the row is split by its columns). A common form of vertical partitioning is to split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static. Creating a view across the two newly created tables restores the original table with a performance penalty, however performance will increase when accessing the static data e.g. for statistical analysis
Consider what you need to do with the table. If the table is purely for achiving, you would never need to change its structure or anything. If you need it for datamining, you would expect to change its structure. Try for example doing an alter table on a copy of it now. Expect this function to drop in performance once you reach a level where temp tables are getting to big to be stored in memory.
I have been in the same situation, where the amount of data made me unable to modify the structure of the database. What you should do RIGHT NOW is to ask someone to create a database on a machine (i.e. an EC2 instance) with the amount of data you expect to have in two years. Just have them create bogus data in the same table format. Try working with this table and decide whether the performance is acceptable. If it is not acceptable, you need to change things as soon as possible.
If I were you, I would consider testing Greenplum or (GridSQL if you do not have the money to spend). Both are based on PostgreSQL and use many computers to work together.