I will make an application with a lot of similar items (millions), and I would like to store them in a MySQL database, because I would like to do a lot of statistics and search on specific values for specific columns.
But at the same time, I will store relations between all the items, that are related in many connected binary-tree-like structures (transitive closure), and relation databases are not good at that kind of structures, so I would like to store all relations in Neo4j which have good performance for this kind of data.
My plan is to have all data except the relations in the MySQL database and all relations with
item_id stored in the Neo4j database. When I want to lookup a tree, I first search the Neo4j for all the
item_id:s in the tree, then I search the MySQL-database for all the specified items in a query that would look like:
SELECT * FROM items WHERE item_id = 45 OR item_id = 345435 OR item_id = 343 OR item_id = 78 OR item_id = 4522 OR item_id = 676 OR item_id = 443 OR item_id = 4255 OR item_id = 4345
Is this a good idea, or am I very wrong? I haven’t used graph-databases before. Are there any better approaches to my problem? How would the MySQL-query perform in this case?
Few thoughts on this:
I would try modelling your Neo4j domain model to include the attributes of each node in the graph. By separating your data into two different data stores you might limit some operations that you might want to do.
I guess it comes down to what you will be doing with your graph? If for example want to find all the nodes connected to a specific node, whose attributes (ie name, age.. whatever) are certain values, would you first have to find the correct node ID in your MySQL database and then go into Neo4j. This just seems slow and overly complicated when you could do all this in Neo4j. So the question is will you need the attributes of a node when you traversing the graph?
Will your data change or is it static? By having two separate data stores it will complicate matters.
Whilst generating statistics using a MySQL database might be easier than doing everything in Neo4j, the code required to traverse a graph to find all the nodes that meet a defined criteria isn’t overly difficult. What these stats are should drive your solution.
I can’t comment on the performance of the MySQL query to select node ids. I guess that comes down to how many nodes you will need to select and your indexing strategy. I agree about the performance side of things when it comes to traversing a graph though.
This is a good article on just this: MySQL vs. Neo4j on a Large-Scale Graph Traversal and in this case, when they say large, they only mean a million vertices/nodes and four million edges. So it wasn’t even a particularly dense graph.
Relational databases can handle graph structures. Some of them can even handle them moderately elegantly (as elegantly as a relational database gets!).
The key to general graph handling in relational databases is the recursive common table expression (RCTE), which basically lets you iteratively (not recursively, despite the name) expand a query over a set of rows, by combining a query which selects a root set of rows and a query which defines the neighbours of rows selected so far. The syntax is a bit clunky, but it’s general and powerful.
RCTEs are supported in PostgreSQL, Firebird, SQL Server, and apparently in DB2. Oracle has a different but equivalent construct; i have read that recent versions support proper RCTEs. MySQL does not support RCTEs. If you aren’t wedded to MySQL, i would urge you to consider using PostgreSQL, which is basically a much better database all round.
However, it sounds like you don’t need to support general graphs, just trees. In that case, there are more specific options open to you.
One is the classic but rather mindbending nested sets.
A simpler one is to store a path with each row: this is a string which represents the row’s position in the tree, and has the property that the path for a node is a prefix of the path for any subnode, which lets you very efficiently do various queries about ancestry (“is node A a child of node B?”, “what is node A and node B’s lowest common ancestor?”, etc). For example, you could construct a path for a row by walking the tree from the root, and joining the IDs of the rows encountered on the way with slashes. This is simple to construct, but does take care to maintain if you rearrange the tree. With a path column, you can restrict a query to a given tree simply by adding
and path like '23/%', where
23 is the root’s ID.
So, although a graph database is probably the best way to store and query graph data, it is not the only option, and i would suggest you weigh the advantages of using one against the advantages of having all your data in a single database.
I’m mostly with Binary Nerd on this, but would like to add a variation. You could store the live data in Neo4j and then extract the data you need for statistics/reporting and put into MySQL. For searches I’d go with the Neo4j-Lucene integration if that fits your needs.
You can improve the query by using IN:
SELECT * FROM items WHERE item_id IN (45, 345435, 343, 78, 4522, 676, 443, 4255, 4345)
It is also not entirely true that relational databases are bad at storing tree structures. Certainly MySQL is missing some functionality which would make it easier, but most other databases support it well. Oracle has
CONNECT BY. Most of the mainstream RDBMS have some form of recursive queries – MySQL being a notable exception. Perhaps you could take a look at PostgreSQL and see if that meets your needs?