Home » Php » Intelligent MySQL GROUP BY for Activity Streams

Intelligent MySQL GROUP BY for Activity Streams

Posted by: admin November 30, 2017 Leave a comment

Questions:

I’m building an activity stream for our site, and have made some decent headway with something that works pretty well.

It’s powered by two tables:

stream:

  • id – Unique Stream Item ID
  • user_id – ID of the user who created the stream item
  • object_type – Type of object (currently ‘seller’ or ‘product’)
  • object_id – Internal ID of the object (currently either the seller ID or the product ID)
  • action_name – The action taken against the object (currently either ‘buy’ or ‘heart’)
  • stream_date – Timestamp that the action was created.
  • hidden – Boolean of if the user has chosen to hide the item.

follows:

  • id – Unique Follow ID
  • user_id – The ID of the user initiating the ‘Follow’ action.
  • following_user – The ID of the user being followed.
  • followed – Timestamp that the follow action was executed.

Currently I’m using the following query to pull content from the database:

Query:

SELECT stream.*,
   COUNT(stream.id) AS rows_in_group,
   GROUP_CONCAT(stream.id) AS in_collection
FROM stream
INNER JOIN follows ON stream.user_id = follows.following_user
WHERE follows.user_id = '1'
  AND stream.hidden = '0'
GROUP BY stream.user_id,
     stream.action_name,
     stream.object_type,
     date(stream.stream_date)
ORDER BY stream.stream_date DESC;

This query actually works pretty well, and using a little PHP to parse the data that MySQL returns we can create a nice activity stream with actions of the same type by the same user being grouped together if the time between the actions isn’t too great (see below example).

Current Stream Output Example

My question is, how do I make this smarter? Currently it groups by one axis, “user” activity, when there are multiple items by a particular user within a certain timeframe the MySQL knows to group them.

How can I make this even smarter and group by another axis, such as “object_id” so if there are multiple actions for the same object in sequence these items are grouped, but maintain the grouping logic we currently have for grouping actions/objects by user. And implementing this without data duplication?

Example of multiple objects appearing in sequence:

Multiple Objects Appearing in Sequence

I understand solutions to problems like this can get very complex, very quickly but I’m wondering if there’s an elegant, and fairly simple solution to this (hopefully) in MySQL.

Answers:

My impression is you need to group by user, as you do, but also, after that grouping, by action.

It looks to me like you need a subquery like this:

SELECT *, -- or whatever columns
   SUM(actions_in_group) AS total_rows_in_group,
   GROUP_CONCAT(in_collection) AS complete_collection
   FROM
     ( SELECT stream.*, -- or whatever columns
          COUNT(stream.id) AS actions_in_user_group,
          GROUP_CONCAT(stream.id) AS actions_in_user_collection
       FROM stream
       INNER JOIN follows
       ON stream.user_id = follows.following_user
       WHERE follows.user_id = '1'
         AND stream.hidden = '0'
       GROUP BY stream.user_id,
            date(stream.stream_date)
     )
   GROUP BY object_id,
            date(stream.stream_date)
   ORDER BY stream.stream_date DESC;

Your initial query (now the inner one) groups by user, but then the user groups are regrouped by identical actions – that is, identical products bought or sales from one seller would be put together.

Questions:
Answers:

Some observations about your desired results:

Some of the items are aggregated (Jack Sprat hearted seven sellers) and others are itemized (Lord Nelson chartered the Golden Hind). You probably need to have a UNION in your query that pulls together these two classes of items from two separate subqueries.

You use a fairly crude timestamp-nearness function to group your items … DATE(). You may want to use more sophisticated and tweakable scheme… like this, maybe

  GROUP BY TIMESTAMPDIFF(HOUR,CURRENT_TIME(),stream_date) DIV hourchunk

This will let you group stuff by age chunks. For example if you use 48 for hourchunk you’ll group stuff that’s 0-48 hours ago together. As you add traffic and action to your system you may want to decrease the hourchunk value.

Questions:
Answers:

Over at Fashiolista we’ve opensourced our approach to building feed systems.
https://github.com/tschellenbach/Feedly
It’s currently the largest open source library aimed at solving this problem. (but written in Python)

The same team which built Feedly also offers a hosted API, which handles the complexity for you. Have a look at getstream.io There are clients for PHP, Node, Ruby and Python.
https://github.com/tbarbugli/stream-php
It also offers support for custom defined aggregations, which you are looking for.

In addition have a look at this high scalability post were we explain some of the design decisions involved:
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html

This tutorial will help you setup a system like Pinterest’s feed using Redis. It’s quite easy to get started with.

To learn more about feed design I highly recommend reading some of the articles which we based Feedly on:

Questions:
Answers:

We have resolved similar issue by using ‘materialized view’ approach – we are using dedicated table that gets updated on insert/update/delete event. All user activities are logged into this table and pre-prepared for simple selection and rendering.

Benefit is simple and fast selection, drawback is little bit slower insert/update/delete since log table has to be updated as well.

If this system is well design – it is a wining solution.

This is quite easy to implement if you are using ORM with post insert/update/delete events (like Doctrine)