I am looking for the best way to retrieve the next and previous records of a record without running a full query. I have a fully implemented solution in place, and would like to know whether there are any better approaches to do this out there.
Let’s say we are building a web site for a fictitious greengrocer. In addition to his HTML pages, every week, he wants to publish a list of special offers on his site. He wants those offers to reside in an actual database table, and users have to be able to sort the offers in three ways.
Every item also has to have a detail page with more, textual information on the offer and “previous” and “next” buttons. The “previous” and “next” buttons need to point to the neighboring entries depending on the sorting the user had chosen for the list.
Obviously, the “next” button for “Tomatoes, Class I” has to be “Apples, class 1” in the first example, “Pears, class I” in the second, and none in the third.
The task in the detail view is to determine the next and previous items without running a query every time, with the sort order of the list as the only available information (Let’s say we get that through a GET parameter
?sort=offeroftheweek_price, and ignore the security implications).
Obviously, simply passing the IDs of the next and previous elements as a parameter is the first solution that comes to mind. After all, we already know the ID’s at this point. But, this is not an option here – it would work in this simplified example, but not in many of my real world use cases.
My current approach in my CMS is using something I have named “sorting cache”. When a list is loaded, I store the item positions in records in a table named
name (VARCHAR) items (TEXT) offeroftheweek_unsorted Lettuce; Tomatoes; Apples I; Apples II; Pears offeroftheweek_price Tomatoes;Pears;Apples I; Apples II; Lettuce offeroftheweek_class_asc Apples II;Lettuce;Apples;Pears;Tomatoes
items column is really populated with numeric IDs.
In the detail page, I now access the appropriate
sortingcache record, fetch the
items column, explode it, search for the current item ID, and return the previous and next neighbour.
array("current" => "Tomatoes", "next" => "Pears", "previous" => null );
This is obviously expensive, works for a limited number of records only and creates redundant data, but let’s assume that in the real world, the query to create the lists is very expensive (it is), running it in every detail view is out of the question, and some caching is needed.
Do you think this is a good practice to find out the neighbouring records for varying query orders?
Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete?
In programming theory, is there a name for this problem?
Is the name “Sorting cache” is appropriate and understandable for this technique?
Are there any recognized, common patterns to solve this problem? What are they called?
Note: My question is not about building the list, or how to display the detail view. Those are just examples. My question is the basic functionality of determining the neighbors of a record when a re-query is impossible, and the fastest and cheapest way to get there.
If something is unclear, please leave a comment and I will clarify.
Starting a bounty – maybe there is some more info on this out there.
Here is an idea. You could offload the expensive operations to an update when the grocer inserts/updates new offers rather than when the end user selects the data to view. This may seem like a non-dynamic way to handle the sort data, but it may increase speed. And, as we know, there is always a trade off between performance and other coding factors.
Create a table to hold next and previous for each offer and each sort option. (Alternatively, you could store this in the offer table if you will always have three sort options — query speed is a good reason to denormalize your database)
So you would have these columns:
- Sort Type (Unsorted, Price, Class and Price Desc)
- Offer ID
- Prev ID
- Next ID
When the detail information for the offer detail page is queried from the database, the NextID and PrevID would be part of the results. So you would only need one query for each detail page.
Each time an offer is inserted, updated or deleted, you would need to run a process which validates the integrity/accuracy of the sorttype table.
I have an idea somewhat similar to Jessica’s. However, instead of storing links to the next and previous sort items, you store the sort order for each sort type. To find the previous or next record, just get the row with SortX=currentSort++ or SortX=currentSort–.
Type Class Price Sort1 Sort2 Sort3 Lettuce 2 0.89 0 4 0 Tomatoes 1 1.50 1 0 4 Apples 1 1.10 2 2 2 Apples 2 0.95 3 3 1 Pears 1 1.25 4 1 3
This solution would yield very short query times, and would take up less disk space than Jessica’s idea. However, as I’m sure you realize, the cost of updating one row of data is notably higher, since you have to recalculate and store all sort orders. But still, depending on your situation, if data updates are rare and especially if they always happen in bulk, then this solution might be the best.
once_per_day add/delete/update all records recalculate sort orders
Hope this is useful.
I’ve had nightmares with this one as well. Your current approach seems to be the best solution even for lists of 10k items. Caching the IDs of the list view in the http session and then using that for displaying the (personalized to current user) previous/next. This works well especially when there are too many ways to filter and sort the initial list of items instead of just 3.
Also, by storing the whole IDs list you get to display a
"you are at X out of Y" usability enhancing text.
By the way, this is what JIRA does as well.
To directly answer your questions:
- Yes it’s good practice because it scales without any added code complexity when your filter/sorting and item types crow more complex. I’m using it in a production system with 250k articles with “infinite” filter/sort variations. Trimming the cacheable IDs to 1000 is also a possibility since the user will most probably never click on prev or next more than 500 times (He’ll most probably go back and refine the search or paginate).
- I don’t know of a better way. But if the sorts where limited and this was a public site (with no http session) then I’d most probably denormalize.
- Yes, sorting cache sounds good. In my project I call it “previous/next on search results” or “navigation on search results”.
I always cache the lists of results, and the results themselves separately. If anything affects the results of a list query, the cache of the list results is refreshed. If anything affects the results themselves, those particular results are refreshed. This allows me to update either one without having to regenerate everything, resulting in effective caching.
Since my lists of results rarely change, I generate all the lists at the same time. This may make the initial response slightly slower, but it simplifies cache refreshing (all the lists get stored in a single cache entry).
To answer your questions specifically:
- Yes, it’s a fantastic idea to find out the neighbours ahead of time, or whatever information the client is likely to access next, especially if the cost is low now and the cost to recalculate is high. Then it’s simply a trade off of extra pre-calculation and storage versus speed.
- In terms of performance and simplicity, avoid tying things together that are logically different things. Indexes and data are different, are likely to be changed at different times (e.g. adding a new datum will affect the indexes, but not the existing data), and thus should be accessed separately. This may be slightly less efficient from a single-threaded standpoint, but every time you tie something together, you lose caching effectiveness and asychronosity (the key to scaling is asychronosity).
- The term for getting data ahead of time is pre-fetching. Pre-fetching can happen at the time of access or in the background, but before the pre-fetched data is actually needed. Likewise with pre-calculation. It’s a trade-off of cost now, storage cost, and cost to get when needed.
- “Sorting cache” is an apt name.
- I don’t know.
Also, when you cache things, cache them at the most generic level possible. Some stuff might be user specific (such as results for a search query), where others might be user agnostic, such as browsing a catalog. Both can benefit from caching. The catalog query might be frequent and save a little each time, and the search query may be expensive and save a lot a few times.
I’m not sure whether I understood right, so if not, just tell me 😉
Let’s say, that the givens are the query for the sorted list and the current offset in that list, i.e. we have a
$query and an
A very obvious solution to minimize the queries, would be to fetch all the data at once:
list($prev, $current, $next) = DB::q($query . ' LIMIT ?i, 3', $n - 1)->fetchAll(PDO::FETCH_NUM);
That statement fetches the previous, the current and the next elements from the database in the current sorting order and puts the associated information into the corresponding variables.
But as this solution is too simple, I assume I misunderstood something.
There are as many ways to do this as to skin the proverbial cat. So here are a couple of mine.
If your original query is expensive, which you say it is, then create another table possibly a memory table populating it with the results of your expensive and seldom run main query.
This second table could then be queried on every view and the sorting is as simple as setting the appropriate sort order.
As is required repopulate the second table with results from the first table, thus keeping the data fresh, but minimising the use of the expensive query.
Alternately, If you want to avoid even connecting to the db then you could store all the data in a php array and store it using memcached. this would be very fast and provided your lists weren’t too huge would be resource efficient. and can be easily sorted.
- Specials are weekly
- We can expect the site to change infrequently… probably daily?
- We can control updates to the database with ether an API or respond via triggers
If the site changes on a daily basis, I suggest that all the pages are statically generated overnight. One query for each sort-order iterates through and makes all the related pages. Even if there are dynamic elements, odds are that you can address them by including the static page elements. This would provide optimal page service and no database load. In fact, you could possibly generate separate pages and prev / next elements that are included into the pages. This may be crazier with 200 ways to sort, but with 3 I’m a big fan of it.
?sort=price include(/sorts/$sort/tomatoes_class_1) /*tomatoes_class_1 is probably a numeric id; sanitize your sort key... use numerics?*/
If for some reason this isn’t feasible, I’d resort to memorization. Memcache is popular for this sort of thing (pun!). When something is pushed to the database, you can issue a trigger to update your cache with the correct values. Do this in the same way you would if as if your updated item existed in 3 linked lists — relink as appropriate (this.next.prev = this.prev, etc). From that, as long as your cache doesn’t overfill, you’ll be pulling simple values from memory in a primary key fashion.
This method will take some extra coding on the select and update / insert methods, but it should be fairly minimal. In the end, you’ll be looking up
[id of tomatoes class 1].price.next. If that key is in your cache, golden. If not, insert into cache and display.
- Do you think this is a good practice to find out the neighboring records for varying query orders? Yes. It is wise to perform look-aheads on expected upcoming requests.
- Do you know better practices in terms of performance and simplicity? Do you know something that makes this completely obsolete? Hopefully the above
- In programming theory, is there a name for this problem? Optimization?
- Is the name “Sorting cache” is appropriate and understandable for this technique? I’m not sure of a specific appropriate name. It is caching, it is a cache of sorts, but I’m not sure that telling me you have a “sorting cache” would convey instant understanding.
- Are there any recognized, common patterns to solve this problem? What are they called? Caching?
Sorry my tailing answers are kind of useless, but I think my narrative solutions should be quite useful.
The problem / datastructur is named bi-directional graph or you could say you’ve got several linked lists.
If you think of it as a linked list, you could just add fields to the items table for every sorting and prev / next key. But the DB Person will kill you for that, it’s like GOTO.
If you think of it as a (bi-)directional graph, you go with Jessica’s answer. The main problem there is that order updates are expensive operations.
Item Next Prev A B - B C A C D B ...
If you change one items position to the new order A, C, B, D, you will have to update 4 rows.
Apologies if I have misunderstood, but I think you want to retain the ordered list between user accesses to the server. If so, your answer may well lie in your caching strategy and technologies rather than in database query/ schema optimization.
My approach would be to serialize() the array once its first retrieved, and then cache that in to a separate storage area; whether that’s memcached/ APC/ hard-drive/ mongoDb/ etc. and retain its cache location details for each user individually through their session data. The actual storage backend would naturally be dependent upon the size of the array, which you don’t go into much detail about, but memcached scales great over multiple servers and mongo even further at a slightly greater latency cost.
You also don’t indicate how many sort permutations there are in the real-world; e.g. do you need to cache separate lists per user, or can you globally cache per sort permutation and then filter out what you don’t need via PHP?. In the example you give, I’d simply cache both permutations and store which of the two I needed to unserialize() in the session data.
When the user returns to the site, check the Time To Live value of the cached data and re-use it if still valid. I’d also have a trigger running on INSERT/ UPDATE/ DELETE for the special offers that simply sets a timestamp field in a separate table. This would immediately indicate whether the cache was stale and the query needed to be re-run for a very low query cost. The great thing about only using the trigger to set a single field is that there’s no need to worry about pruning old/ redundant values out of that table.
Whether this is suitable would depend upon the size of the data being returned, how frequently it was modified, and what caching technologies are available on your server.
So you have two tasks:
- build sorted list of items (SELECTs with different ORDER BY)
- show details about each item (SELECT details from database with possible caching).
What is the problem?
PS: if ordered list may be too big you just need PAGER functionality implemented. There could be different implementations, e.g. you may wish to add “LIMIT 5” into query and provide “Show next 5” button. When this button is pressed, condition like “WHERE price < 0.89 LIMIT 5” is added.