Since 2011, when Thomas Berger first published his blog post on ports and threads related to the PVS stream, it has become a common understanding that the total number of concurrent streaming capabilities should be calculated like this:
For best performance,
“# of ports” x “# of threads/port” = “max clients”.
Dan Allen echoed that theory with a successful, real-world example.
At the end of his post, Thomas wrote:
“It doesn’t really matter if you achieve the sufficient amount by increasing the number of ports or the number of threads per port, but our lab testing has shown the best StreamProcess performance is attained when the threads per port is not greater than the number of cores available on the PVS server. Don’t worry if your PVS server doesn’t have enough cores. You’ll just see a higher CPU utilization, but CPU utilization has never been a bottleneck for PVS.”
Is that really the case here?
Recently, I had a chance to talk with our PVS Sr. Architect Jeff Pinter-Parsons who led me to believe that we need to reevaluate the details surrounding this leading practice.
Per our discussion, PVS is actually not using a standard threading model where each client gets it own port/thread much like a TFTP server does. Instead PVS has a listener for each port that receives a request and dumps it on a port specific thread pool. The threads in the pool process each request, one per thread. If there are more threads than cores, the leftover threads simply block. Adding more threads than CPU logical core is not going to help on performance.
I decided to test the theory in a real production environment. The approach used to test was booting 500 Windows 7 target devices within 5 minutes via the Delivery Controller (reconfigured, default behavior will boot 10 target devices per minute) on one PVS server (HA disabled). PVS server perfmon data was captured together with the first 10 target device boot time and last 10 target device boot time for each test. The PVS server (version 184.108.40.206) is a VM and has 8 vCPUs with 32 GB RAM.
The baseline implemented the previous recommendation to tweak the ports to 36 (actually 33 since the first 3 are reserved) with 40 threads for about 1400 target devices. 33 x 40 = 1320. Sounds like a good number?
Here are the test results.
|\||Ports Range||Threads per port||Average Boot time first 10 VDIs||Average Boot time last 10 VDIs|
|Baseline||6910 to 6945||40||49.8||83.2|
|Test 1||6910 to 6945||8||28.2||60.2|
|Test 2||6910 to 6968||8||29.4||48.9|
|Test 3*||6910 to 6968||8||36.1||92.5|
* The last Test 3, it is using the same configuration as Test 2 with the exception to change the I/O limits to 0 on both local and remote, which is another rumour widely used.
The results clearly show that adding more threads per port is not helping while adding more ports helps. The average boot time decreased 30%-45% after changing the thread to match the logical CPU number.
For a typical 1000 target devices per PVS server situation, the following configuration is recommended to be a baseline to start with:
3-6 vCPUs (I will explain in more details below), 32 GB RAM (depends on the vDisk numbers)
Streaming Port re-configured from 6910 to 6968 (default 6910 – 6930).
Threads per port set to match the vCPU number.
Leave the rest advanced options to be unchanged.
About the vCPU number, we get a lot of feedback from both internal and external. A simple explanation will be “It depends.” Here are some scenarios:
If we’re working on a 12 core box that has 2 NUMA nodes, I’d much rather cap my PVS servers at 6 vCPUs.
if we’re deploying PVS on an older (very common) 6 core box with 2 NUMA nodes, 3 vCPUs would be my preference.
For more information about the NUMA nodes, please check here.
Any feedback is highly appreciated!
P.S. I get a lot of questions regarding the 10 boot actions per minute settings. Here are the steps to change it:
- Select Hosting from Citrix Studio.
- Right click the Host connection and choose Edit Connection.
- Select the Advanced tab on the left.
- You will find the option to change the “Maximum new actions per minute” there. The default value is 10 that means it will only boot 10 machines in a 1 minute.
Principal Consultant, Citrix Consulting