Bookmarks in Windows Workflow Foundation

There’s an important fix in .NET 4.6 related to Workflow Foundation’s bookmarking features. The mechanisms by which workflow instances are bookmarked, suspended and resumed are fairly complex, and I don’t think the details are very well documented. Having fought through some issues myself, and received some help via MSDN forums (thanks Jim Carley) and bug reports to Microsoft, here’s how I understand bookmarks to work.

First, what is a bookmark? As its name suggests, it’s a way to remember where you were in the workflow, so that you can continue where you left off if your workflow idles (and is subsequently evicted from the runtime) for some reason.

As it transpires, there seem to be two types of bookmark. There are those that arise from the enforcement of the ordering of the operations that define the workflow (Jim calls these “protocol bookmarks”). The Receive activity is a classic example of this type of bookmark, and you can “resume” them by name (indeed, that’s how the runtime ultimately “invokes” the workflow operation due to a message coming into the service endpoint). There are also bookmarks that are not so obvious when looking at a workflow design – they’re introduced by the workflow runtime for its own housekeeping purposes (“internal bookmarks”). The Delay activity is an example of this – the runtime creates a bookmark so that the instance can resume automatically at the appropriate time. The Pick activity is another, less obvious example; here the runtime needs a bookmark that gets resumed once the Trigger of a PickBranch completes, so that it can subsequently schedule the Action for that branch. Even if the trigger also contains protocol bookmarks, or if the Action part does nothing, the internal bookmark is still required.

When a message for a protocol bookmark (i.e. Receive) comes in, if there’s no active protocol bookmark that matches it, ordinarily the caller will see an “out-of-order” FaultException. However, if there are also active internal bookmarks, the runtime will wait to see if that internal bookmark gets resumed (which could potentially cause the protocol bookmark that we’re actually trying to resume to become active, in which case we can just resume it, transparently to the caller). If the internal bookmark does not get resumed, eventually a TimeoutException will be returned to the caller. Consider a pre-4.6 workflow such as the following (WorkflowA):




The protocol for this workflow is simple – we want StepA to happen first, then DoSomething (which for our purposes is assumed to not create an internal bookmark at any point), and then StepB to happen only after all that. For a particular instance of this workflow, let’s consider some operation call sequences:

WorkflowA Happy Path

WorkflowA Premature

WorkflowA Competing Creators

I think these are all fairly intuitive and work as expected.

Now consider a workflow like this (WorkflowB):




Here we have a choice between two operations before the workflow instance completes. Here are our three test call sequences for this workflow:


WorkflowB Happy Path

WorkflowB Premature

WorkflowB Competing Creators

Our first two scenarios actually work the same way in both workflows, but WorkflowB times out for the third case.

This is not intuitive and certainly not ideal, since a caller would have to wait for the timeout to expire, and wouldn’t actually be told the truth about the failure even after that.

Unfortunately with content-based correlation, this scenario is likely to happen if two separate callers attempt to open a workflow for the same correlated identity at about the same time. The first caller will succeed and create the workflow instance, the second caller will timeout. Ideally you’d want the second caller to immediately get the “Out-of-Order” exception, but you can’t guarantee that in the presence of internal bookmarks.

So, don’t use activities that use internal bookmarks. Right?

Well, they’re pretty much unavoidable if you want to have an extensible workflow framework. If you want to give domain experts the full range of activities with which to develop their workflows, banning Delay, Pick, and State are serious hindrances. Pick is especially useful for a “four-eyes” class of workflows whereby a request for an action to be performed must be reviewed and approved (or rejected) by a different user.

So, to the solution.

With 4.6 – the above workflows behave the same way as they do pre-4.6, by default. However, you can now take control of what the runtime does when it encounters a message for a Receive for which there is no active bookmark, while internal bookmarks are present. The new setting can be placed in the app.config (or web.config) file under appSettings:

<add key="microsoft:WorkflowServices:FilterResumeTimeoutInSeconds" value="60"/>

With this setting, you can specify how long the timeout should be in our “Competing Creators” scenario above. If you set it to zero, you won’t actually get a TimeoutException at all – instead you’ll get the same “Out-of-Order” FaultException you’d get in the “Premature” scenario. This seems to solve the problem nicely.


64-bit Numerical Stability with Visual C++ 2013 on FMA3-capable Haswell Chips (_set_FMA3_enable)

An interesting diversion this week, when we noticed regression tests failing on our TeamCity build server. The differences were mostly tiny, but warranted a closer inspection since there were no significant code changes to explain them, and the tests were still passing on most developer PCs. The only thing that had really changed was that we’d moved the TeamCity build agents to newer hardware (with later-model Haswell chipsets).

We were able to boil the differences in behaviour down to a simple 3-line C++ program (essentially a call to std::exp), and, sure enough, the program gave a different result when compiled in VS2013 and executed on the new Haswell CPU. Any other combination of C++ runtime and chip (and any x86 build) gave us our “expected” answer. Obviously you can’t have your build server and your developer PCs disagree when it comes to floating-point calculations without chaos (or games of baseline whack-a-mole between developers when some of them are running newer PCs than others).

We tried forcing the rounding mode with _controlfp_s, but just ended up with different differences. We tried the Intel C++ compiler (which was slightly more stable but still off between Haswell and older CPUs).

As it turns out, the VC++ 2013 runtime uses FMA3 instructions for some transcendental functions (std::exp included), when available at runtime, and this was the difference for us. After disabling this behaviour in the runtime with _set_FMA3_enable(0), our tests started passing on both types of CPU, with no other code or baseline changes necessary.

Thanks to James McNellis for pointing us in the right direction on the FMA3 optimizations.

James also noted that the FMA3 optimizations are much faster, so at some point we will experiment with enabling those, and update our baselines, but for now we can move on with stable numbers between builds.


Bug? Content-Based Correlation in Windows Workflow Foundation

This one caused me quite a bit of grief this week – maybe I can save someone else some pain. When configuring content-based correlation in the Workflow (4.0/4.5) designer, I think there’s a subtle bug in the dialog that allows you to choose the correlation key and XPath expression used to establish correlation on a Receive or SendReply activity.


It manifests itself when the workflow engine attempts to apply the correlation query at runtime:

A correlation query yielded an empty result set. Please ensure correlation queries for the endpoint are correctly configured.


It only occurs under the following conditions:

  • The message or parameter that contains the correlation is a complex DataContract (i.e. not a primitive value type).
  • That DataContract type is derived from another base class DataContract.
  • The correlation property comes from the base class.
  • The base class DataContract and the derived class DataContract are in different namespaces.


Under these conditions, the Add Correlation Initializers dialog sets the namespace of the property as being defined in the namespace of the derived class, not the base class.

Fortunately the fix is easy – you can manually edit the XAML to refer to the namespace of the base class instead, e.g:

<XPathMessageQuery x:Key="key1">
    <x:String x:Key="xg0"></x:String>
    <x:String x:Key="xgSc"></x:String>

Did Verisign revoke their own certificate?



Google is shutting down Reader

Wait, what? This is Google’s best product, and they’re shit-canning it?

This has to rank up there with Fox’s decision to cancel Firefly