lessCode.net
Links
AFFILIATES
« Disappearing Scrollbar in Edge | Main | Project.json Nuget Package Management »
Thursday
Jun222017

Graph Databases and Neo4j

My latest consulting project made heavy use of the graph database product Neo4j. I had not previously had an opportunity to look at graph databases, so this was a major selling point in accepting the gig. I had also suffered through some major pain with relational databases via object-relational mapping layers (ORMs), so I was keen to experience a different way of managing large-scale data sets in financial systems.

I won’t (can’t) for various reasons provide a lot of detail about the specific use cases on this project, but I’ll put out some thoughts on the general pros and cons of graph databases based on several months of experience with Neo4j. I think I’m going to be using (and recommending) it a lot more going forward.

What is a graph database?

A graph essentially consists of only two main types of data structure: nodes and relationships. Nodes typically represent domain objects (entities), and relationships represent how those entities, well, how they are related to each other. In graph databases, both nodes and relationships are first-class concepts. Compare this to relational databases, where rows (in tables) are the core construct and relationships must be modeled with foreign keys (row attributes) and JOIN-ed when querying, or to document-store databases, where whole aggregates model entities and contain references to other aggregates (which usually must be loaded with separate queries and in their entirety in order to “dereference” a related data point). Depending on your use cases these queries can be complex and/or expensive.

Why graphs?

Performance. If your data is highly connected (meaning that the relationships between entities can be very fluid, complex and are equally or more important than the actual entities and properties), a graph will be a more natural way to model that data. Typical examples of highly connected data sets can be found in recommendation engines, security/permission systems and social network applications. Since all relationship “lookups” are essentially constant-time operations (as opposed to index- or table-scans), even very complex multi-level queries perform extremely efficiently if you know which nodes to start with. Graph folks term this “index-free adjacency”. Anecdotally, over the course of the last few months, with a database of the order of a hundred million nodes/relationships, executing queries that sometimes traversed tens of thousands of nodes across up to a dozen levels of depth, I don’t think I saw any queries take more than about 100ms to complete. For large result sets the overhead of transferring and processing results client-side was by far biggest bottleneck in the application.

Query Complexity. Queries against graph data can be a lot more concise and expressive than an equivalent SQL query that joins across multiple tables (especially in hierarchical use-cases).

What are the downsides?

Few or no schema constraints. If you’re used to your database protecting you against storing data of the wrong type, or forcing proper referential integrity according to a strict data model definition, graph databases are unlikely to make you happy. Even a stupid mistake like storing the string version of a number in a property that should be numeric can lead to a lot of head-scratching, since the graph will happily allow you to do that (but won’t correctly match when querying). At first I thought this was a deal-breaker, but it’s not as big a problem as you’d think if you’re following test-driven disciplines and institute some automated tools for periodic sanity/consistency checks. You’re not waiting for a customer to find your query/ORM-mapping bugs at run-time anyway, right? Besides, a typical major pain point in RDMBS/ORM-based systems are schema upgrades as your rigid data model changes, especially when relations subtly change shape between versions.

Hardware requirements. Optimally, in production you’re going to want a dedicated multi-node cluster where each node has enough RAM to potentially keep your entire graph in memory, so budget accordingly.

Neo4j

Neo4j is one of many graph databases (this technology is very hot right now and new ones seem to pop up every week). It is widely used by many large corporations and they are well ahead of the alternatives in terms of market share. I don’t want to give a full detailed review of Neo4j in this post, but their main selling points are:

ACID compliance. Many NoSQL technologies sometimes can only offer “eventual consistency”.

High availability clustering. Neo4j recently introduced a new form of HA called “Causal Clustering”, based on the Raft Protocol, which I haven’t yet had a chance to evaluate. At first I read this as “Casual Clustering” and had visions of nodes deciding arbitrarily on a whim whether or not they wanted to join or pull out of the pool or respond to queries… The older form of clustering replicates the graph across multiple physical nodes, and nodes elect a “master” which will be the primary target of all write operations. Non-master nodes can be load-balanced for distributed read operations.

Integrated query language (Cypher). This is a very elegant and succinct (compared to SQL) pattern-matching language for graph queries. Statements match patterns in the graph and then act on the resulting sub-graph (returning, updating or adding nodes and/or relationships).

Web interface. The interface is very nice and offers great visualizations of graph results. You can issue and save Cypher queries, perform admin-level operations and inspect the “shape” of your graph (i.e. what node labels and relationship types you currently have).

Flexible APIs. There is an HTTP REST interface, but they also now offer a TCP binary protocol called BOLT which is much more efficient.

Native Graph Model. Neo4j touts the purity of the graph model over other alternatives which attempt to mix and match different paradigms (eg. documents, key/value store and graph). I don’t have any direct experience of such “multi-model” databases, but I plan to compare Neo4j with other graph database products on a project in the future, so stay tuned.

Extensibility. You can add your own procedures and extensions in Java (very similar to CLR stored procedures in SQL Server).

Support. I can only speak to the enterprise-level experience, but Neo4j are very quick to respond to support questions and really stick with you until the problem is resolved.

On the whole, I think graph databases in general, and Neo4j in particular, have a bright future ahead.

Anyone else using graphs yet?

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>