Wintellect  
 

I have vague memories of watching the Apollo moon mission in 1969 at age 6. I remember my father sitting with me on the sofa and telling me how history was being made as we watched Neil Armstrong take his first step on the moon. Later, we watched in high anticipation during the radio silence awaiting re-entry followed by the splashdown at sea of the space capsule. So here I am, a dad myself, sitting in front of my computer watching live streaming video on the internet with my two kids telling them much the same thing. In case you just "tuned in"… the fun is just beginning at NASA as pictures and other data begin arriving from the Phoenix Mars Lander which just touched down this evening (May 25). Although such an event can feel like background news because it doesn't seem nearly as extraordinary by today's technological standards or cinematic simulations. Sometimes we just have to slap ourselves a bit though and remember that this is really another planet we just landed on… and it is truly anything but ordinary. The information gained from Mars' geology and atmosphere will almost certainly prove invaluable as we study and compare our own planet in hopes of preserving its hospitability for humans many millennia from now. The very first pictures are up at: http://www.nasa.gov/mission_pages/phoenix/main/index.html

Vista SP1 for the 64-bit version of the OS has fallen on my computer, and it can’t get up! I’ve been without my primary computer now for several days due to the release of SP1 for Vista. The symptom is an infinite loop of installations and failures of SP1 (the specific details seem to vary by people reporting to be afflicted with this problem).

Nick White at Microsoft declares that the problem “affects a small number of customers in unique circumstances” on the Windows Vista blog here:

http://windowsvistablog.com/blogs/windowsvista/archive/2008/02/19/update-on-windows-vista-sp1-prerequisite-kb937287.aspx

but Google turns up nearly 36,000 hits for the pretty specific phrase of “configuring updates: stage 3 of 3 vista ultimate 64 sp1”; so I have to believe there are a few more customers being negatively impacted by this than just moi.

This blog was helpful in resolving the infinite patch installation loop in case you happen to get caught by this nasty thing:

http://forums.microsoft.com/technet/showpost.aspx?postid=2873378&isthread=false&siteid=17&authhash=7f2056e7cf93af49d9d8704602bc97be8b4c437b&ticks=633390464558901682&sb=0&d=1&at=7&ft=11&tf=0&pageid=5

It doesn’t “solve” the problem, but it will at least make your computer useable again for awhile so that you can backup your important files, etc. In my case whenever I install a new IIS feature, the pattern of installing/failing SP1 starts all over as soon as I reboot my machine. Everything having to do with attempting to resolve this problem seems to take hours of time (SP1 itself takes nearly an hour to install and then another hour after it aborts to rewind itself). KB937287 has been installed on my machine 5 times now! After SP1 fails in its installation, KB937287 shows back up again in Windows Update. KB937287 is apparently a pre-requisite patch to prepare your computer for the slaughter that is SP1.

I spent nearly two hours on an IM with Microsoft tech support this evening where they took remote control of my computer and declared that the problem was with my Norton 360 antivirus interfering with the service pack installation (even though my anti-virus had already been disabled per the SP instructions). They asked me to remove my anti-virus and re-install SP1 again and then scheduled me for a call back tomorrow. Unfortunately the results were an unsuprising repeat of the “configuring updates: state 3 of 3“ hell that was my weekend (it’s like groundhogs day all over again)!  At this point I’m thinking I would be better off simply burning my computer to the ground and reinstalling the OS without the automatic patch feature disabled until MS can get this sorted out. In point of fact, this is exactly what many customers seem to be blogging that they did. It may only be affecting a small number of customers, but the impact to those minority customers seems severe. 

In my last post (Caller Impersonation for WCF Services Hosted Under IIS Appears Broken), I laid out my rationale for why I felt that the security of services impersonating a caller when hosted under IIS was broken. To be responsible, I feel it necessary to follow-up my previous assertion by noting that such a configuration is not a best-practice, even though many corporate staff developers may be tempted to secure their intranet services this way.

Allowing a service to impersonate the caller’s identity requires that the caller have a high degree of trust in the service that he or she is interacting with. Of course it’s very unusual for a user to be aware of the services their applications interact with, so this is a completely unrealistic expectation. When a service is configured to impersonate its callers, it could potentially perform clandestine operations within a network using authorizations granted to the caller’s identity without the caller being aware.

To further illuminate the security vulnerabilities of callers trusting services to impersonate their identities, it’s important to note that most services are focused on a specific area of business while application users typically have much broader and deeper sets of authorizations. A malicious service’s reach is magnified when the impersonated user’s identity happens to be trusted by other entities or business domains unrelated to the service.

Smaller organizations often take for granted that local intranet services are not doing malicious things using the caller’s identity; however, it’s likely not the best choice for security conscious intranet service writers to make. Even if you have complete trust in your own internal services, consider the potential for their unintended leveraged use by a malicious third party. In my opinion, it’s much safer for a service to act with its own identity rather than to impersonate the caller’s. Services can be run using an identity that has been granted the least privilege necessary for it to perform its functions. The “less is more” approach to security is considered a best practice, and should be followed whenever practical.

To summarize, even though I still consider caller impersonation under IIS to be defective, some may consider this to be an important security feature! Although a bug is still a bug no matter how much lipstick we slap on it, I have to agree that this particular bug might actually force better security choices to be made.

There is a security feature of WCF services hosted under IIS that I find poorly implemented. In all honesty, it appears to be broken and non-compliant with its intended purpose. If you’re developing services for use in the intranet environment, then it’s quite reasonable for you to expect that a service can impersonate your Windows identity while it performs its work. After all, security personnel have no doubt already established your intranet authentication and authorization policies for corporate assets, and internal service security should be able to fit within this established paradigm.

Unfortunately, WCF will throw an erroneously worded exception about your attempt to use anonymous access when the IIS hosted service that performs caller impersonation is set to require a Windows identity (presumably a bug of assuming that “not equal to 1” identity means “equal to 0” identities). The reason why you likely had to read the first sentence more than once is that the error message complains about your use of Anonymous when you turn Windows security on, but turning Windows off and Anonymous on makes the error about using Anonymous go away! Confused by the apparent contradiction? Well, so was I when I first encountered it.

In my opinion, the recommendations from Microsoft for working around this limitation are completely idealized and wholly unrealistic. Thanks very much to Wenlong Dong for explaining the problem in his blog entry “Impersonation with Double Identities”; unfortunately, his blog post never addresses the fact that the two identities should be being used for different work in the transportation and consumption of a message (or perhaps he simply disagrees with my assertions).

Here's the logic for my assertions:

When a postal employee delivers a letter we can make some reasonable assumptions about the security of the letter’s transportation. It is safe to assume that the post office took measures to ensure that our letter was not stolen, snooped, or tampered with during their possession. This is transport security.

Now the postal person has his or her own identity. My postal delivery person’s name is Ralph (a pseudonym I’m using to protect Ralph’s true nefarious identity). I have my own identity, and my name is Paul (which is not a pseudonym, just in case you were wondering). Although Ralph seems like an okay person to me, I’m not ready to invite him into my home whenever he delivers letters addressed to me. I just want Ralph to leave my letters and then go on to his next delivery as quickly as he can. I would be extremely irritated (you might even say that I would “go postal”) if Ralph routinely handcuffed himself to my letters and insisted upon helping me open them.

My friend Russ, (a character I borrowed from Justin Smith’s WCF book), waits in line at a very busy U.S. Post Office. It’s the holiday season and he tires quickly of waiting so long so he decides to send my letter via UPS instead. While driving to the UPS office, Russ discovers that there is a Federal Express office on the way with no waiting line! Russ changes his mind again and immediately decides to send it to me via FedEx. No matter which carrier Russ decides to use, he reasonably expects that his letter to me will be secured until it can be delivered. To differentiate themselves from their competitors, some carriers may have stronger security than others; however, they all perform the same generic service of transportation.

It turns out that Russ is somewhat paranoid that the Department of Homeland Security might be interested in reading his private messages about family protected secret soup recipes and he decides to send his letter to me using some very strong public-private key encryption so that only I can read his message. Russ can now feel confident that if his letter is somehow intercepted that its contents for cooking a very delicious mulagatani will remain confidential. In short, Russ has applied message level security to his communication with me, while the mail carrier has applied transportation security.

If you agree with my analogies then you’re almost certainly hungry. You also likely agree that there are two separate and distinct identities that perform work during the lifetime of a message. There is the identity of the letter transportation carrier (e.g. Federal Express, U.P.S., or U.S.P.S.) and there is the identity of the message recipient. Both identities are responsible for their own distinct roles, and neither is responsible for accomplishing the work of the other.

Wenlong’s blogged solutions do not make a distinction between the postal worker’s identity and the recipient’s identity—they are just two non-descript identities stuffed into a single container and WCF cannot seem to figure out which one it should use for the message security.

The number one recommendation in Wenlong’s blog is to forget the entire matter altogether, and change your IIS security settings to “Anonymous”.  In simple terms, Microsoft is asking us to turn transport security off and then depend entirely on message security (or no security). When making this decision, keep in mind that your network administrator cannot easily monitor or prove that message security actually exists or immediately prove its absence if it inadvertently gets turned off. Message security also does not protect the host from malicious anonymous traffic. Administrators likely won’t agree to this, and they may come along at some future time and lock your sweet little anonymous service down as a non-compliant and potentially hazardous risk to their corporate assets. As developers generally do not have carte blanche in networks to do whatever they want, and network administrators already have established security policies which work well for existing internal “asmx” web services, the “solution” of using anonymous is simply dead-on-arrival. Wenlong warns that Microsoft recommends against our use of multiple identities; however, I would argue that good security is generally layered and it certainly makes good sense to me that we would want to secure both our transports and our messages.

The second recommendation for working around this limitation is to have the service impersonate its own account instead of the caller’s account (as if the two could somehow be considered equivalent). Of course, impersonating the user of the service and impersonating the same account for all users of a service are entirely different security models, and it’s unlikely that such a solution would make the network administrators much happier than the “message security only” approach.

This leaves us with the last proposed solution of removing the transport identity from the evaluationContext.Properties["Identities"] collection. Ah… but there are two identities in the collection and you have a 50/50 chance of removing the right one! The problem here is that placing the two identities into the same collection without any way to determine the role that the identities are supposed to play is completely silly. Creating a production application that has to guess which identity to use seems like a very brittle and potentially hazardous approach to security—but it’s the only one we seem to have available when impersonation of a Windows caller is a requirement. In point of fact, why do we have a collection of identities anyway if we have no way of determining the purposes of the collection’s occupants?

This “double identity” problem existed with the .net 3.0 framework, and I’m sorry to report that it still remains in 3.5.  If your network administrators aren't too concerned about an IIS endpoint being set to allow anonymous access, Sanjay Antony's blog entry may be useful.

The following are tips for testing Windows Workflow Foundation instances that contain delay activities (timers) when used in conjunction with a passivation store.  This list of tips is certainly not exhaustive, but I believe that I’ve accumulated enough useful techniques to warrant sharing with others.

1.       Purge all rows from your InstanceState table that may contain passivated workflows of the type you’re testing.

2.       In order to support your ability to conveniently blow away InstanceState rows, each developer should have his or her own private persistence database. If possible, avoid using a shared database for this type of testing.

3.       For testing (and perhaps even production) consider merging your passivation store with your application database to help facilitate referential integrity. You will undoubtedly have an associative table to map workflow instance ID’s against your application’s key field(s). This is not Microsoft's recommended production deployment strategy, but it is a useful configuration to utilize during development and debugging.

4.       It’s impractical to test timers with intervals that are set to days, weeks or months—so allow the ability in your application design to modify these intervals in a data driven fashion.

5.       Don’t test with time intervals that are too small. The default poling interval for determining expired timers in need of service is 2 minutes. See #10 below to adjust. 

6.       Don’t bump the poling interval for expired workflows down too low as you can create scenarios where you are poling faster than you are processing workflows or where you don’t have enough time to operate the debugger before the next timer cycle begins.

7.       Be aware of the multi-threaded nature of the code you are working with. Timers will cause passivated workflows to rehydrate and begin execution upon timer expiration while you may be busy debugging other workflow instances. If you haven’t removed rows from your InstanceState table you may find yourself simultaneously debugging multiple workflow instances—which is okay if that was your intention, but it generally is not.

8.       Add this.WorkflowInstanceId to your diagnostic output (Trace.WriteLine and friends) so that you’re aware of which workflow instance is responsible for the diagnostic output being generated. When multiple workflows begin executing at once, you’ll be glad you have this. You may also consider adding the managed thread id as well.

9.       If you’re workflow operates on a shared resource, use a SychronizedScope container activity to protect access to it. This is true even for workflows that do not use delay activities, but it is mentioned here for completeness.

10.   You can adjust the timer poling interval using the LoadIntervalSeconds attribute of the System.Workflow.Runtime.Hosting.SqlWorkflowPersistenceService configuration entry.

 

I was surprised to receive a "Property value is not valid" error dialog when assigning a Ruleset to a Policy activity. There was nothing particularly unusual about this action other than the fact that the workflow itself seems to be pressing the limits of what my poor 1.7 Mhz Centrino Duo CPU with 2-gig of RAM can handle (350+ activities, 1000+ rules). The dialog box wasn't my biggest surprise though… upon clicking the details button, I discovered that the source of the error was being reported was an "OutOfMemoryException!" egads! This would seem to imply that the WF designer is catching and eating all exceptions from this property setting--and assuming that any and all exceptions are a "Property value is not valid" error.  Tisk, tisk , tisk! Bad code monkey! Bad!

Workflow Foundation (WF) catches the unhandled exceptions of any workflow instance that it’s charged with running. Upon catching the unhandled exception of a workflow instance, WF terminates it and raises a WorkflowTerminated event where it generously includes the exception in the event arguments. At first glance this seemed like a reasonable approach. After all, one doesn’t want a sloppy and poorly crafted workflow taking down an entire service and jettisoning hundreds of other smoothly executing workflows. <tears> Unfortunately my initial enthusiasm was soon dampened by a series of conversations with Jeffrey Richter.  Jeffrey convinced me that my initial thoughts were dead wrong, and that the design chosen by the WF team was inherently deficient in this area. </tears>

 

The basic problem boils down to this: WF isn’t impervious to the perils of unhandled exceptions any more than other managed code that we write. The fact that we’ve placed the WF runtime in charge of managing multiple workflows doesn’t excuse it from this priority rule; to the contrary, it would seem to exacerbate the problem. The fact that the runtime runs all of workflow instances in its own AppDomain seems to seal the deal. At the end of the day, workflows are compiled and executed as machine instructions under the management and control of the CLR. An unhandled exception to a workflow is exactly the same as an unhandled exception to a C# program. The perils of catching all unhandled exceptions are well known and can be found in Jeffrey’s CLR via C# book and many other sources, so they aren’t repeated here.

 

It would seem then that the only choice available to the WF team to have written a more resilient workflow runtime then would have been to have created a separate AppDomain for each workflow instance so that AppDomains with unhandled exceptions could be terminated without impacting other workflows or the WF runtime. This leaves me feeling slightly nauseas because the performance implications of adding this level of overhead would have undoubtedly been (dare I say) very noticeable; however, I can think of no other option with managed code to deal with unhandled exceptions and still play-by-the-rules. While I was no “fly-on-the-wall” when WF was being designed, I can well imagine that the workflow team was forced to choose performance over the other harder-to-measure and easier-to-brush-under-the-carpet objectives. And that is the end of tonight’s tale of the unhandled workflow exception.