I have heard some rumors of the production level App Streaming service (radesvc.exe) dying at runtime.  In the reported failure, the administrator has configured the service for automatic restart to work past the issue and I have suggested that this is only masking the problem, don’t do that!  The streaming service, like most NT services, should never die and I’d much rather cure the root cause than work around the issue.

The realities of “real users” and “production use” sometimes necessitate doing things that aren’t ideal in a theoretical sense so this advice cannot always be followed, which brings us to this post where I will bring vision to the perils and values of configuring the streaming service for automatic restart.

Put your FSFD programmer hat on

You wear this hat when you’re writing kernel mode code.  You write the file system filter code for the App Streaming isolation system and this code has two primary purposes; file system filtering and process monitor for sandbox management.

As a FSFD writer, you are never allowed to die or the entire machine will turn blue.  Today’s post is not about kernel mode things dying, its about application level things dying.

Put your NT Service programmer hat on

You wear this hat, you think you’re powerful because you run with “higher privilege”; higher than mere apps.  You may even be considered part of the “system”, but from the perspective of the kernel code, you’re a mere app too and as a class, all of you are untrustworthy. When a service dies, the machine does not turn blue, but it is still bad!

What does the service do?

Among other things, it is responsible for launching all isolation sandboxes and placing applications into the sandbox for execution.  Here’s a chart that brings some color to this description.  What isn’t drawn in the below is that the service talks to the FSFD to define sandboxes and launch applications into sandboxes.

What does the File System Filter Driver do

The FSFD hangs out and implements file system redirection – the layers of glass for the file system.  It is also responsible for managing which applications are in the isolation spaces; yes, that’s plural on purpose. On a given machine, especially on a XenApp server, the FSFD can easily be tracking 500 isolation spaces.  Consider that there is state data for each of these.  It isn’t large, but it exists and the code that keeps track of this actually does it in a balanced binary tree, which seems like overkill until you get large number of isolation spaces.

In the service, you also have state data for each sandbox.  Here though the state data is allocated per-thread.  Put differently, each sandbox gets a thread and this thread and only this thread is used for communication with the kernel mode code.  In this way, a few things are achieved.

  1. The streaming service doesn’t have to have complicated logic to manage its sandbox state
  2. The kernel code can gate who it’s willing to talk to based on the thread of the creator
  3. When the FSFD has work for the service to do, the service “always” wakes up in the right state.

For computer science stuff, these are all positive actions.

The negative actions

The service isn’t supposed to die without a graceful shutdown and it should only close gracefully if it isn’t managing any sandboxes.  In practice, “non scheduled” terminate happens all the time during development and recent reports show, it can also happen during production.

The FSFD tolerates service death.  Why?  Primarily it does this because it doesn’t have any other choice.

If the service dies, the kernel code, being all powerful isn’t surprised by this action – it “observes” that the service has died, but there isn’t a whole bunch it can do about it.

Consider an example

You have isolated applications up.  Let’s say you have 10 of them running, from 5 different profiles.  This means that you have 10 applications running in 5 different sandboxes.

The service dies…

The applications are still running, but they have lost their support network.

Let’s say that the application now issues a DIRECTORY ENUMERATION on stuff in the isolated space.  Normally, the FSFD gathers information from the service to satisfy this request.  This is how the FSFD “LIES” to the application to tell it that things are present that aren’t really present.   In this case though, the service is “gone”, so what does the FSFD do?  Answer: It does the best it can and “falls back” to AIE style N layer directory merge.  The directory enumeration is satisfied, but the files that are there via a lie will not be included in the directory enumeration results?  What effect does this have on the application?  Don’t know – depends on the app, but in general the results are bad.

If the application issues a file open, you’ll satisfy it based on the things you can answer without the help of the streaming service.  This means that if the file is really present in the cache, the file open will succeed and if it isn’t, it won’t, or execution will drop down to a lower layer in the layers of glass to answer the file operation.

Will this work for the application?  Maybe.  Ideally, you’d like to terminate the applications, but terminating applications when users have stuff running and haven’t saved their work is considered bad form.

New sandboxes are launched

Recall that new sandboxes cannot be created without the help of the streaming service, so here it is a given that the service has been restarted.  When the service loads, it contacts the FSFD to register itself.  The kernel code says “nice to have you back” – but there isn’t a dag gone thing it can do to help the orphaned sandboxes from the previous run of the service.  All the “app level” state data is “gone” and there’s no way to put it back together again.

New launches though can be handled.  When created, the FSFD notes who the service is and will communicate with this “new” instance of the streaming service to manage the “new” sandboxes.

During development this is cool!

When developing the code, if you are the NT Service writer, this is really really cool because you can write code, debug it, terminate the debugger (which unloads the service), change the code, compile it again, run it (which loads the service) and the FSFD will just plain deal with all of this.  Very fast for development; no reboots needed and you can even do all this stuff from a visual development environment like MS Visual Studio.

During PRODUCTION this is not as cool!

Being willing to take on new sandboxes means that auto-restarting the service can seem like a good idea.  The thing this overlooks is that the orphaned sandboxes are, well they don’t have their support network and without the streaming service, directory enumeration and file opens are not going to occur correctly unless the streaming cache is completely full.

Put your ADMINISTRATOR hat on

What should you do?  Answer: Treat death of the streaming service with caring detail.  It should be investigated and fixed.  The Citrix support team will love this – Joe said we should report service death rather than restarting the service.  My response, the service should not DIE unless you kill it!  I’m pretty sure service team already has the report, so I’m really writing for the next person and hopefully by the time you read this, we’ll already have it fixed….

How to work around.

Above said, if you get in this situation, run one app from each profile with “-e” populated RadeRunSwitches.  This will fully populate the streaming cache and will minimize cases where the application will fail a file open or directory enumeration.  Next – Turn “-e” off as it will command a full extract on EVERY App Launch and you don’t want that.  Next step – get the service fixed.  In the mean time, you can auto-restart the service to get new sandboxes created, but just be sure you aren’t using the auto-restart to hide a problem that really needs to be investigated.

Before people ask, I already have feelers out to the people that have seen the service die.  Hate to have this happen with production code, but the correct answer is to research the problem and make the fix.  Hopefully readers of this post will appreciate the open nature to acknowledge a bug that isn’t widely seen.

Joe Nord

Citrix Systems Product Architect – Application Streaming.