Graph Databases and Neo4j

My latest consulting project made heavy use of the graph database product Neo4j. I had not previously had an opportunity to look at graph databases, so this was a major selling point in accepting the gig. I had also suffered through some major pain with relational databases via object-relational mapping layers (ORMs), so I was keen to experience a different way of managing large-scale data sets in financial systems.

I won’t (can’t) for various reasons provide a lot of detail about the specific use cases on this project, but I’ll put out some thoughts on the general pros and cons of graph databases based on several months of experience with Neo4j. I think I’m going to be using (and recommending) it a lot more going forward.

What is a graph database?

A graph essentially consists of only two main types of data structure: nodes and relationships. Nodes typically represent domain objects (entities), and relationships represent how those entities, well, how they are related to each other. In graph databases, both nodes and relationships are first-class concepts. Compare this to relational databases, where rows (in tables) are the core construct and relationships must be modeled with foreign keys (row attributes) and JOIN-ed when querying, or to document-store databases, where whole aggregates model entities and contain references to other aggregates (which usually must be loaded with separate queries and in their entirety in order to “dereference” a related data point). Depending on your use cases these queries can be complex and/or expensive.

Why graphs?

Performance. If your data is highly connected (meaning that the relationships between entities can be very fluid, complex and are equally or more important than the actual entities and properties), a graph will be a more natural way to model that data. Typical examples of highly connected data sets can be found in recommendation engines, security/permission systems and social network applications. Since all relationship “lookups” are essentially constant-time operations (as opposed to index- or table-scans), even very complex multi-level queries perform extremely efficiently if you know which nodes to start with. Graph folks term this “index-free adjacency”. Anecdotally, over the course of the last few months, with a database of the order of a hundred million nodes/relationships, executing queries that sometimes traversed tens of thousands of nodes across up to a dozen levels of depth, I don’t think I saw any queries take more than about 100ms to complete. For large result sets the overhead of transferring and processing results client-side was by far biggest bottleneck in the application.

Query Complexity. Queries against graph data can be a lot more concise and expressive than an equivalent SQL query that joins across multiple tables (especially in hierarchical use-cases).

What are the downsides?

Few or no schema constraints. If you’re used to your database protecting you against storing data of the wrong type, or forcing proper referential integrity according to a strict data model definition, graph databases are unlikely to make you happy. Even a stupid mistake like storing the string version of a number in a property that should be numeric can lead to a lot of head-scratching, since the graph will happily allow you to do that (but won’t correctly match when querying). At first I thought this was a deal-breaker, but it’s not as big a problem as you’d think if you’re following test-driven disciplines and institute some automated tools for periodic sanity/consistency checks. You’re not waiting for a customer to find your query/ORM-mapping bugs at run-time anyway, right? Besides, a typical major pain point in RDMBS/ORM-based systems are schema upgrades as your rigid data model changes, especially when relations subtly change shape between versions.

Hardware requirements. Optimally, in production you’re going to want a dedicated multi-node cluster where each node has enough RAM to potentially keep your entire graph in memory, so budget accordingly.


Neo4j is one of many graph databases (this technology is very hot right now and new ones seem to pop up every week). It is widely used by many large corporations and they are well ahead of the alternatives in terms of market share. I don’t want to give a full detailed review of Neo4j in this post, but their main selling points are:

ACID compliance. Many NoSQL technologies sometimes can only offer “eventual consistency”.

High availability clustering. Neo4j recently introduced a new form of HA called “Causal Clustering”, based on the Raft Protocol, which I haven’t yet had a chance to evaluate. At first I read this as “Casual Clustering” and had visions of nodes deciding arbitrarily on a whim whether or not they wanted to join or pull out of the pool or respond to queries… The older form of clustering replicates the graph across multiple physical nodes, and nodes elect a “master” which will be the primary target of all write operations. Non-master nodes can be load-balanced for distributed read operations.

Integrated query language (Cypher). This is a very elegant and succinct (compared to SQL) pattern-matching language for graph queries. Statements match patterns in the graph and then act on the resulting sub-graph (returning, updating or adding nodes and/or relationships).

Web interface. The interface is very nice and offers great visualizations of graph results. You can issue and save Cypher queries, perform admin-level operations and inspect the “shape” of your graph (i.e. what node labels and relationship types you currently have).

Flexible APIs. There is an HTTP REST interface, but they also now offer a TCP binary protocol called BOLT which is much more efficient.

Native Graph Model. Neo4j touts the purity of the graph model over other alternatives which attempt to mix and match different paradigms (eg. documents, key/value store and graph). I don’t have any direct experience of such “multi-model” databases, but I plan to compare Neo4j with other graph database products on a project in the future, so stay tuned.

Extensibility. You can add your own procedures and extensions in Java (very similar to CLR stored procedures in SQL Server).

Support. I can only speak to the enterprise-level experience, but Neo4j are very quick to respond to support questions and really stick with you until the problem is resolved.

On the whole, I think graph databases in general, and Neo4j in particular, have a bright future ahead.

Anyone else using graphs yet?


Project.json Nuget Package Management

Recently we've been looking at the new approach to Nuget package management in .NET projects, dubbed Project JSON. This initiative started in the ASP.NET world, but seems to be becoming the preferred method for .NET projects to declare Nuget package dependencies. The new approach changes the way that packages are resolved and restored, and has some very useful advantages.


The larger context for looking at this was that we're starting to push several of our core framework assemblies into a separate source control repository, separate from the rest of the applications and modules that make use of those core components to create shipping products. The core components use Semantic Versioning ("product" releases are versioned according to Marketing!), are delivered as Nuget packages internally to our developers, and are no longer branched along with the Main codebase – they follow a “straight-line” of incremental changes.

We expect to gain a few benefits from this. First, developers will not have to spend compute cycles compiling the core components. Second, there will be a natural barrier to changing the core components. This may be viewed as an advantage or a disadvantage depending on your outlook, but the idea is that over time the core components should become and remain very stable (in terms of rate-of-change, not code quality), and "ship" on their own cadence. If core projects are all simple project references alongside the application code in the same solution, it's very easy and tempting to make small (sometimes breaking) changes to these components while working on something else, without necessarily fully considering the larger impact of those changes (or independently verifying the changes with new unit tests).

Perhaps a core architecture team will be responsible for the roadmap of core framework changes, and the resulting packages will be treated like any other third-party dependency by application and product developers. It's important that changes to the "stable" core components are carefully planned and tested, since many other components (and possibly customer-developed extensions) depend upon them. The applications and business modules that ultimately form shipping products are less "stable", in the sense that they can change quite frequently as customer requirements change.

However, there are also some fundamental implications to doing this. Developers still need to be able to step into core component code when debugging. Nuget packages support a .symbols.nupkg file alongside the actual binary package, and build servers like TeamCity can serve up those symbols to Visual Studio automatically when stepping into code from the package.

In addition, released software should be resistant to casually taking new (or updated) Nuget dependencies once shipped, since this can complicate "hotfixes" and customer upgrades. The Main branch, of course, representing the next major shipping version, could update to the latest dependencies more freely.

Nuget today

By default, Nuget manages package dependencies via a packages.config XML file that gets added to the project the first time you add a reference to a Nuget package. The name and version of the dependency (and, recursively, any dependencies of that package) is recorded in the packages.config file, and Nuget will use this information during package restore (which will usually take place just before compiling), to obtain the right binaries, targets and propeties to include in the compilation. In addition to the entries in packages.config, references are also added to your project file (e.g. the .csproj file for C# projects). Some packages are slightly more complex than simply adding references to one or more managed assemblies, however (for example those that may include and P/Invoke native libraries). In these cases, even more "cruft" is added to the project file, to manage the insertion of special targets and properties into the MSBuild process so that the project can compile correctly and leave you with something runnable in your bin folder.

Resolved packages are usually stored in the source tree in a packages folder next to the solution file, and you usually want to exclude these from source control to avoid bloating the codebase unnecessarily.

Another snag to watch out for is potential version conflicts between projects that call for different versions of the same dependency. In the packages.config file, it's possible to declare that you want an exact version of a dependency, but by default you're saying "I want any version on or after this one", so if multiple projects ask for different versions, Nuget is forced to add an app.config file to your project with the appropriate binding redirects, to allow for the earlier references to be resolved at runtime to the latest version.

When you update a dependency to a newer version, every packages.config file that refers to the old version, and every project file that contains references to assemblies from those packages, and every app.config that has binding redirects for those assemblies, needs to be updated. This produces a lot of churn in source control and actually can take a while on large solutions with many dependencies.

Clearly there are a lot of moving parts to getting this all to work, and Nuget actually does a great job in managing it all, but it can get quite fragile if you have many Nuget references - even more so if you also have manually-maintained targets and properties in a project file for other custom build steps, or if you have a combination of managed and native (e.g. .vcxproj) files that require Nuget references. Nuget doesn't yet do well with crossing the managed-native boundary with these references, so you're usually just using Nuget to install the right package, and then manually maintaining the references yourself in the project files. In this world it's easy to make a mistake when editing files that Nuget is also trying to keep straight during package updates.

Nuget future

With the new approach, things are a lot simpler and cleaner. There's really only one file to worry about – a JSON file called project.json that replaces your packages.config file, and contains the same package name and version specifications. The neat thing is that this one file is all that Nuget needs to resolve dependencies – in your project file you no longer even need assembly references to the individual libraries delivered by the Nuget package, nor any special .targets or .props file references. At package restore time, Nuget will build a whole dependency graph of the packages that will be used in the build process, and will inject the appropriate steps into the executed build commands as necessary. You'll get warnings or errors if there are potential version conflicts for any packages.

Package files are no longer downloaded to the packages folder next to your solution – instead by default they're cached under the user's profile folder, so packages now only ever really need to be downloaded one time, and you don't have to care about excluding the packages folder from source control, because it's no longer in your source tree anyway.

Another neat advantage is that your project.json file only really needs to contain top-level dependencies – any packages that those dependencies in turn depend upon do not need to be explicitly mentioned in the project.json file, but the restore process will still correctly resolve them.

But by far the biggest win for our scenario is project.json's Floating Versions capability. This allows us to define the "greater than or equal to" version dependency with a wildcard (e.g. 1.0.*), such that during package restore, Nuget will use the latest available (1.0.X) package in the repository (by default, a specific version specification calls for an exact match). Now, when our core components are updated by the core team (at least for non-breaking changes), literally nothing (not even the project.json file) needs to change in our application and product codebase – a recompile will pick up the new versions from TeamCity automatically. Of course, when we branch Main for a specific release, we will likely update the version specifications in those branches to remove the wildcard and "fix" on a specific "official" version of our core dependencies, so that when we work on release branches to fix bugs, we can't inadvertently introduce a new core package dependency.

The new approach is available in Visual Studio 2015 Update 1. Unfortunately there's no automated way yet (that I know of) to upgrade existing projects, so I've been doing the following, based on Oren Novotny's advice:

  1. Add an "empty" project.json file to the project in Visual Studio.

  2. Delete the packages.config file.

  3. Unload the project in Visual Studio and edit it. Remove any elements that refer to Nuget .targets or .props files, and remove any reference elements with HintPaths that point to assemblies under the old packages folder.

  4. Reload the project.

  5. Build the project. Watch a bunch of compilation errors fly by, but these will help to identify which Nuget packages to re-reference. On large codebases it's possible to end up with redundant references as code is refactored, so this step has actually helped us to cleanup some dependencies.

  6. Right-click the project and "Manage Nuget packages". Add a reference to one of the packages for which there was a compilation error. If you actually know the dependencies between Nuget packages you can be smart here and start with the top-level ones.

  7. Repeat from 5. until the build succeeds.

For a large codebase this is a lot of mechanical work, but so far the benefits have been quite positive. I may attempt to put together a VSIX context menu to automate the conversion if I get a chance, but I suspect that the Project JSON folks are already working on that and will beat me to it...


BUILD 2016: C# to get F#-like Pattern Matching

This is interesting, and should help to make C# more concise. Scroll to around the 24 minute mark.


Bookmarks in Windows Workflow Foundation

There’s an important fix in .NET 4.6 related to Workflow Foundation’s bookmarking features. The mechanisms by which workflow instances are bookmarked, suspended and resumed are fairly complex, and I don’t think the details are very well documented. Having fought through some issues myself, and received some help via MSDN forums (thanks Jim Carley) and bug reports to Microsoft, here’s how I understand bookmarks to work.

First, what is a bookmark? As its name suggests, it’s a way to remember where you were in the workflow, so that you can continue where you left off if your workflow idles (and is subsequently evicted from the runtime) for some reason.

As it transpires, there seem to be two types of bookmark. There are those that arise from the enforcement of the ordering of the operations that define the workflow (Jim calls these “protocol bookmarks”). The Receive activity is a classic example of this type of bookmark, and you can “resume” them by name (indeed, that’s how the runtime ultimately “invokes” the workflow operation due to a message coming into the service endpoint). There are also bookmarks that are not so obvious when looking at a workflow design – they’re introduced by the workflow runtime for its own housekeeping purposes (“internal bookmarks”). The Delay activity is an example of this – the runtime creates a bookmark so that the instance can resume automatically at the appropriate time. The Pick activity is another, less obvious example; here the runtime needs a bookmark that gets resumed once the Trigger of a PickBranch completes, so that it can subsequently schedule the Action for that branch. Even if the trigger also contains protocol bookmarks, or if the Action part does nothing, the internal bookmark is still required.

When a message for a protocol bookmark (i.e. Receive) comes in, if there’s no active protocol bookmark that matches it, ordinarily the caller will see an “out-of-order” FaultException. However, if there are also active internal bookmarks, the runtime will wait to see if that internal bookmark gets resumed (which could potentially cause the protocol bookmark that we’re actually trying to resume to become active, in which case we can just resume it, transparently to the caller). If the internal bookmark does not get resumed, eventually a TimeoutException will be returned to the caller. Consider a pre-4.6 workflow such as the following (WorkflowA):




The protocol for this workflow is simple – we want StepA to happen first, then DoSomething (which for our purposes is assumed to not create an internal bookmark at any point), and then StepB to happen only after all that. For a particular instance of this workflow, let’s consider some operation call sequences:

WorkflowA Happy Path

WorkflowA Premature

WorkflowA Competing Creators

I think these are all fairly intuitive and work as expected.

Now consider a workflow like this (WorkflowB):




Here we have a choice between two operations before the workflow instance completes. Here are our three test call sequences for this workflow:


WorkflowB Happy Path

WorkflowB Premature

WorkflowB Competing Creators

Our first two scenarios actually work the same way in both workflows, but WorkflowB times out for the third case.

This is not intuitive and certainly not ideal, since a caller would have to wait for the timeout to expire, and wouldn’t actually be told the truth about the failure even after that.

Unfortunately with content-based correlation, this scenario is likely to happen if two separate callers attempt to open a workflow for the same correlated identity at about the same time. The first caller will succeed and create the workflow instance, the second caller will timeout. Ideally you’d want the second caller to immediately get the “Out-of-Order” exception, but you can’t guarantee that in the presence of internal bookmarks.

So, don’t use activities that use internal bookmarks. Right?

Well, they’re pretty much unavoidable if you want to have an extensible workflow framework. If you want to give domain experts the full range of activities with which to develop their workflows, banning Delay, Pick, and State are serious hindrances. Pick is especially useful for a “four-eyes” class of workflows whereby a request for an action to be performed must be reviewed and approved (or rejected) by a different user.

So, to the solution.

With 4.6 – the above workflows behave the same way as they do pre-4.6, by default. However, you can now take control of what the runtime does when it encounters a message for a Receive for which there is no active bookmark, while internal bookmarks are present. The new setting can be placed in the app.config (or web.config) file under appSettings:

<add key="microsoft:WorkflowServices:FilterResumeTimeoutInSeconds" value="60"/>

With this setting, you can specify how long the timeout should be in our “Competing Creators” scenario above. If you set it to zero, you won’t actually get a TimeoutException at all – instead you’ll get the same “Out-of-Order” FaultException you’d get in the “Premature” scenario. This seems to solve the problem nicely.


64-bit Numerical Stability with Visual C++ 2013 on FMA3-capable Haswell Chips (_set_FMA3_enable)

An interesting diversion this week, when we noticed regression tests failing on our TeamCity build server. The differences were mostly tiny, but warranted a closer inspection since there were no significant code changes to explain them, and the tests were still passing on most developer PCs. The only thing that had really changed was that we’d moved the TeamCity build agents to newer hardware (with later-model Haswell chipsets).

We were able to boil the differences in behaviour down to a simple 3-line C++ program (essentially a call to std::exp), and, sure enough, the program gave a different result when compiled in VS2013 and executed on the new Haswell CPU. Any other combination of C++ runtime and chip (and any x86 build) gave us our “expected” answer. Obviously you can’t have your build server and your developer PCs disagree when it comes to floating-point calculations without chaos (or games of baseline whack-a-mole between developers when some of them are running newer PCs than others).

We tried forcing the rounding mode with _controlfp_s, but just ended up with different differences. We tried the Intel C++ compiler (which was slightly more stable but still off between Haswell and older CPUs).

As it turns out, the VC++ 2013 runtime uses FMA3 instructions for some transcendental functions (std::exp included), when available at runtime, and this was the difference for us. After disabling this behaviour in the runtime with _set_FMA3_enable(0), our tests started passing on both types of CPU, with no other code or baseline changes necessary.

Thanks to James McNellis for pointing us in the right direction on the FMA3 optimizations.

James also noted that the FMA3 optimizations are much faster, so at some point we will experiment with enabling those, and update our baselines, but for now we can move on with stable numbers between builds.