Home » Php » PHP fastest way to register millions of records in MYSQL

PHP fastest way to register millions of records in MYSQL

Posted by: admin July 12, 2020 Leave a comment

Questions:

I have to register millions of page views in my DB, I’m looking for the best solution to decrease the server load.

1. Actual solution: check if is unique and register in “raw” table and “optimized” table

// script
$checkUnique = mysqli_query( $con, "SELECT FROM rawTable
         WHERE datatime = '$today' AND ip = '$ip'
         ORDER BY datetime DESC LIMIT 1" );
mysqli_query( $con, "INSERT INTO rawTable ( id, datetime, url, ip, ua )
         VALUES ( NULL, '$now', '$url', '$ip', '$ua' )" );
if( mysqli_num_rows( $checkUnique ) == 0 ) {
    mysqli_query( $con, "INSERT INTO optimizedTable ( id, day, total )
                         VALUES ( NULL, '$today', 1 )" ); }
else{
    mysqli_query( $con, "UPDATE optimizedTable SET total = total + 1
            WHERE day = '$today' ORDER BY day DESC LIMIT 1"; }

2. Register the views only in “raw” table, and then with cronjob populate the “optimized” table

// script
mysqli_query( $con, "INSERT INTO rawTable ( id, datetime, url, ip, ua, alreadyOptimized )
         VALUES ( NULL, '$now', '$url', '$ip', '$ua', 0 )" );

// cronjob -> check if is unique, populate mysql tables +
//         change column alreadyOptimized from 0 to 1 in raw table

3. Register the raw views in a txt or csv file, and then populate the mysql tables with cronjob

// script
$file = fopen("file.txt", "w");
fwrite($file, "$now,$url,$ip,$ua\n");

// cronjob -> check if is unique, populate mysql tables + delete rows from txt/csv file

What is the best (lightest and fastest) way? Are there any better solutions?

PS: The server load is caused by the select query to check if the views are unique

How to&Answers:

Manually selecting to check if record exists is the worst thing you can do – it can (and will) produce false results. There’s a time lag between MySQL and any process connecting to it. The only proper way is to place UNIQUE constraint and simply just INSERT. That’s the only way to be 100% certain your DB won’t contain duplicates.

The reason this is interesting for your use case is that it cuts your code down by 50%. You don’t have to SELECT first, therefore you get rid of a huge bottleneck.

Use INSERT IGNORE or INSERT INTO .. ON DUPLICATE KEY UPDATE if you need to update the existing record.

Your unique constraint would be compound index on datetime, ip columns. To even further optimize this, you can create a binary(20) column in your table and have it contain a sha1 hash of datetime, ip combination. Using triggers, you can create the hash before inserting, making the whole process invisible to actual person inserting into table.

If insert fails, record exist. If insert succeeds, you’ve done what you wanted to. No SELECT being used should yield performance. After that, if it’s still slow – it’s simply the limit of I/O of the server you use and you need to look for optimizations on hardware level.

Answer:

None of the answers given so far are anywhere near close to “fastest”.

A single IODKU (INSERT .. ON DUPLICATE KEY UPDATE ..) replaces all of the steps given. However, it is unclear what the PRIMARY KEY should be. Some hint at a “date” + IP, some hint at a “datetime” + IP. But what if the user uses two different browsers ($ua) from the same IP? Or comes from a different page ($url)?

Chunk the data as a way to avoid system impact. That is, do not process one row at a time. And do not throw a million rows at the table all at once. The former is sloooow — typically ten times as slow as some form of batching. The latter will have severe impact on the target table.

If you suddenly have a million-row batch of values to insert/increment, preprocess it. That is, boil it down to counts per unique key before updating the real data. This decreases the impact on the real table, though it possibly has some overall “system” impact. But, furthermore, chunk the data — say 1000 rows at a time — to copy into the real table. More on Chunking.

If you have hundreds or thousands (but not millions) of ‘rows’ coming in every second, then there are a couple of options. First, do they all come from a single source? Or are they coming from multiple clients?

From a single source — Gather a thousand rows, combine them, then build a single IODKU to do them all. (Note how to use the VALUES pseudo-function.)

From multiple sources — ping-pong a pair of tables. Collect raw info in one table from all the clients. Another thread processes the other table a la chunking for puts the data into the real table. Then this thread flips the tables using a single, atomic, RENAME TABLE; the clients will be oblivious to it. More on high speed ingestion.

Meanwhile, you should normalize at least the $ua, since they are bulky and highly repetitious. That last link shows a 2-sql method for efficient bulk normalization.

Another note: The target table should have the “unique” key for IODKU be the PRIMARY KEY. If you currently have an AUTO_INCREMENT, move it out of the way by turning it into INDEX instead of PRIMARY KEY. (Yes, that does work.) The rationale is to make the UPDATE part of IODKU faster by not going through a secondary key, and not having a second UNIQUE key to check.