The company I work for creates applications for the Blackberry platform.
We’ve been working on a proprietary “analytics system” that allows us to embed code within our applications and have the applications report back some stats to our central servers every time they’re run. Currently, the system works ok; however it’s only in beta with 100-200 hits per hour. The “hits” are sent to the servers without a problem. We’ve built a very solid API to handle the acceptance and storage of the hits (in a MySQL DB). We’ve tested the load and we should be able to accommodate hundreds of thousands of hits per hour without a problem. That’s not really a problem.
The problem is showing the stats. We’ve built a display panel similar to Mint’s (haveamint.com), it shows the hits over each hour, the past days, months, weeks, years…etc. The fist version ran straight queries pulling data from the hits table and interpreting it on the fly. That didn’t work for very long. Our current solution is that the hits are “queued” for processing and we have a cron come through every 5 minutes taking the hits and sorting them into “caches” for each hour, day, week, month, year…etc. This works amazing and it’s incredibly scalable; however, it only works for 1 timezone. Since the entire company has access to this, we’re dealing with a few hundred users in various timezones. What I define as “Today” in San Jose is MUCH different than what my colleague in London defines as Today. Since the current solution is only cached to 1 timezone, it’s a nightmare for anyone who’s checking the data outside of our timezone.
Our current plan to fix this is to create caches for every timezone (40 in total); however, that would mean that we’re multiplying the amount of data by 40…that’s terrible to me and given that the caches can be very large, multiplying it just sounds like a bad idea; plus, when we go to process the queue, it’s going to take a lot more CPU time to put them in 40 different caches.
Any one else have a better idea of how to solve this problem?
(Sorry for such a long question..it’s not exactly easy to explain. Thanks all!)
The solution you are proposing has too much redundancy. I would suggest you store the data in at least 30-minute buckets instead of hourly and the time zone be normalized to UTC.
With 30-minute buckets, if a user requests hourly data for 1 – 2PM from -4.5 UTC you can fetch data for 5:30 – 6:30PM from your system and show that. If you store data in one-hour increments you can’t service requests to users in time zones with N + 0.5 hour differences.
For daily numbers you would need to aggregate 48 half-hour slots. The slots to pick would be determined by the user’s time zone.
It gets interesting when you get to annual data because you end up having to aggregate 17,520 half-hour buckets. To ease that computation I would suggest you get the pre-aggregated annual data per UTC time and the subtract aggregate data for the first for 4.5 hours of the year and add aggregate data for the first 4.5 hours of the next year. This will essentially shift the whole year by 4.5 hours and the work is not that much. Working from here, you can tweak the system further.
EDIT: Turns out Kathmandu is +5.45 GMT so you would need to store the data in 15-minute buckets instead of 30-minute buckets.
EDIT 2: Another easy improvement is around aggregating annual so you don’t have to add 17,520 buckets each time and without requiring one aggregate per country. Aggregate the annual data from Jan 02 – Dec 30. Since the maximum time-zone difference between any two countries is 23 hours, this means that you can take the annual data (Jan 02 – Dec 30) and add a few buckets before and after as appropriate. For example for a -5 UTC timezone you would add all buckets on Jan 01 after 0500, all buckets on Dec 31, and on Jan 01 the following year up to 0500 hours.
When designing software that touches multiple timezones, I’d say to always store your date/times in UTC with another field for the original timezone and have a function that takes the time and converts it to and from UTC/timezone. You’ll save yourself a lot of trouble to handle the different cases of day switch, daylight savings, people looking at stats from a country from the other side of the earth and so on….
In your case, having the caches in UTC and just adjusting the requests to be converted in UTC should help. Don’t store a stat as being “today”, store it for hours 00:00:00UTC to 23:59:59UTC and when someone asks for the stats for today in New York, do the conversion.
As far as I can see, you are looking for the storage part of a data warehouse system here (your reports would be the front-end).
Actually, the way commercial systems are doing it, is the cache you described: Preaggregate your tables and create caches of them. The only way to accelerate your queries is to make the database system do less for them. This means less data, which in turn means less time spent in iterating the data or less data in the indices.
That said, I would either propose the “40 cache solution” (are there really more than 24 timezones). You should be able to trivially parallelize the sorting queue by creating copies of the data.
Another way to do this, would be to cache at hour granularity and then aggregate the hours into days (or 30 minutes if your timezones require this). This means you cache at a finer granularity than your daily cache but at a coarser granularity than the original data.
this kind of data is usually stored using round-robin or circular databases. check this http://www.shinguz.ch/MySQL/mysql_20070223.html and this http://techblog.tilllate.com/2008/06/22/round-robin-data-storage-in-mysql/ to know how they work and how to implement it under MySQL