As the complexity and amount of data increases, new database structures that offer different solutions from relational databases are growing in popularity. One of these is the graph database. At Voxxed Days Bucharest, Michael Hackstein is talking about graph databases, and how to overcome the limits of scalability that they can suffer from. We asked him about them.
What is a node, edge and property in a graph database?
This is easiest explained in comparison to a relational database, because the actual storage of data is quite familiar. The difference is more within the queries you perform on the data.
In short, a node or vertex is comparable to one data-row in a relational table. A property is one data-field attached to this node. An edge allows us to store a relation between two nodes (many-to-many style).
In a bit more detail:
Each node consists of many arbitrary properties, just like a data-record has arbitrary values stored. The major difference is that a node is typically “schema-free”. This means the database does not enforce that certain properties exist on all nodes. It even allows us to use different data-types in the same property for different nodes.
In a relational table it is guaranteed that all rows of a table always have each defined attribute (maybe null) – plus it is always the same datatype.
Furthermore properties can be nested JSON objects, which are arbitrary complex.
An edge in a graph database is comparable to one record in a many-to-many relation. It connects two nodes to one another and also contains a “direction”. One node is always the source, the other is always the target.
It is highly dependent on the database implementation as to how complex edges can be. Some implementations only have simple attributes on the edges, others allow edges to contain arbitrary complex properties.
What are the limitations of modern graph databases when it comes to scale?
The speciality of graph databases comes in to play when you want to store a large amount of relations between your data-rows. For this use-case, they typically outperform relational databases. Especially if it comes to queries where it is unclear how many relations need to be “joined”. My favourite example is: assuming Alice and Bob don’t have direct contact, how many people are there connecting Alice and Bob?
If we now scale a graph database, assuming that the dataset is too large to be stored on a single machine, I have to store some part of the data on one machine, and another part of the data on the other machine. If I now have relations switching between those parts, my query has to switch servers several times. This drastically reduces performance, as the each network switch is time-costly.
How can you overcome the limits?
There are certain tricks to overcome those limits: good indexing and data-locality. Visit my talk for the details 😉
Can you combine graph databases with relational databases?
Sure, in the sense of polyglot persistence, you should use the right tool for the job.
If you have some “graphy” part in your data, use a graph database to store it. It is much easier to query the graph database for graphs. This is the case even if you can store the data in the same way in a relational database.
If you have a classical, well defined datamodel, that is unlikely to change, store this part in a relational database.
And if you have non-relational data with a frequently changing schema it is probably a good idea to store this part in a document store.
Then you will get the most out of your datastores and the database will most-likely not be your performance bottleneck any more.
For more, see Michael’s talk at Voxxed Days Bucharest.