On a recent NetScaler project a fairly simple task turned out be a bit more complicated than we anticipated. After bit of troubleshooting we discovered the reason and eventually a solution was found thanks to Kadda Toumi, Citrix Escalation Engineer who very helpfully assisted me.
For virtually all NetScaler and/or Access Gateway projects DNS is a crucial part of the deployment, hence we started off creating a virtual DNS server pointing to a number of backend DNS servers. Below you will find the example configuration as most of you have configured it already many times.
add server DNS01 10.10.10.1
add server DNS02 10.10.10.2
add service DNS01 DNS01 DNS 53
add service DNS02 DNS02 DNS 53
add lb vserver vServer_DNS DNS 10.10.10.3 53 -persistenceType SOURCEIP -lbMethod ROUNDROBIN -cltTimeout 120
bind lb vserver vServer_DNS DNS01
bind lb vserver vServer_DNS DNS02
add lb monitor mon_dns DNS -query domain.name.int -queryType Address -LRTM ENABLED
bind lb monitor mon_dns DNS02
bind lb monitor mon_dns DNS01
add dns nameServer vServer_DNS
At this point we have a fully functional virtual DNS server which serves the system and in case you want to use the full layer 2 VPN through the Access Gateway plugin, it will serve as a local DNS server for these connections.,. However… First thing you always do is a functional test, right? So that’s what I did, in this case querying the windows domain name of the environment. To my surprise it didn’t work as expected. See the outcome below.
Test with dig command without parameters. This utilizes the locally configured DNS server of your NetScaler system, in our case the virtual DNS server vServer_DNS we just created. For some reason we see that the reply is apparently truncated and dig would like to retry in TCP mode.
root@ns# dig domain.name.int
;; Truncated, retrying in TCP mode.
First thing we did then is to check whether a direct query to the backend DNS servers yields a different outcome and as you see below it did.
root@ns# dig @10.10.10.2 domain.name.int
;; Truncated, retrying in TCP mode.
; <<>> DiG 9.6.1-P3 <<>> @10.10.10.2 domain.name.int
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19831
;; flags: qr aa rd; QUERY: 1, ANSWER: 16, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;domain.name.int. IN A
;; ANSWER SECTION:
domain.name.int. 600 IN A 10.10.10.4
domain.name.int. 600 IN A 10.10.10.5
domain.name.int. 600 IN A 10.10.10.6
domain.name.int. 600 IN A 10.10.10.7
domain.name.int. 600 IN A 10.10.10.8
domain.name.int. 600 IN A 10.10.10.9
domain.name.int. 600 IN A 10.10.10.10
[.. many many more entries here ..]
;; Query time: 0 msec
;; SERVER: 10.10.10.2#53(10.10.10.2)
;; WHEN: Thu Jul 12 11:02:56 2012
;; MSG SIZE rcvd: 290
To find out what went wrong with our configuration we used the only troubleshooting resource you’ll ever need: Wireshark 😉 We started a trace and reproduced the problem we were seeing.
First thing to check is whether or not the communication flow looks as expected. We expected to see the following:
- NSIP -> DNS VIP (10.10.10.3)
- SNIP -> DNS Server (10.10.10.2)
- DNS Server (10.10.10.2) -> SNIP
- SNIP -> NSIP
Well, the communication flow itself looks fine which meant we had to “dig” (no pun intended) a little deeper. In the response packet (step 3/4), we could see the response being sent to the client, and the ‘Message Truncated’ flag was set. When the client receives this, it tries to connect over TCP so it can get the full response. It then sent a SYN to the NetScaler vServer IP, and the NetScaler resets the connection, as there is no TCP vServer listening on that IP and port. Fallback to TCP usually happens only for unusually large DNS packages and is rare circumstance. In our case, the response was exceptionally large.
We could just add a second DNS virtual server with protocol type DNS_TCP here and would be done. However, while this would be a solution for a vServer for external use, internally that’s not a solution.
NetScaler 9.3 allows you to configure a virtual DNS server and expects it to be of type UDP. Furthermore, only one virtual DNS server can be configured While this makes sense as the virtual server already takes care of high availability and performance when multiple services are bound, it can be a problem in scenarios like this when a larger than usual DNS response is sent by the DNS server, so large that a TCP connection becomes necessary.
To change the default behavior of a NetScaler system we utilize the nsapimgr command to tell the NetScaler kernel to override kernel values we’d like to modify.
root@ns# nsapimgr enable_vpn_dnstruncate_fix
root@ns# nsapimgr enable_vpn_dns_override
After firing off this commands everything worked as we needed. What’s left to do now? Well, as always with nsapimgr commands you should test them before you try them in production environments and once you’re satisfied with the results you should make them persistent. What does that mean? As these commands set kernel flags of the running NetScaler kernel, these settings are gone once you reboot the device. Hence, we need to tell the NetScaler boot process which nsapimgr commands to apply on a reboot (e.g. after power outage, firmware upgrade). There are various ways to accomplish that task, the most common way is to write the commands to the rc.netscaler file on the flash drive of your device. This is how you do it (see CTX109261)
echo “/netscaler/nsapimgr enable_vpn_dnstruncate_fix” >> /nsconfig/rc.netscaler
echo “/netscaler/nsapimgr enable_vpn_dns_override” >> /nsconfig/rc.netscaler
There are other ways and to accomplish persistent nsapimgr commands and you should choose them depending on the nature of the task you have been given. Some more details are outlined in this blog article about the three different files you can place special commands in.
Note: Due to increased demand by our customers for TCP enabled system DNS servers, this feature has been added to NetScaler 10.0 code. On system level we added the option to configure a DNS vServer of type UDP_TCP.