We automate for a reason.
At Citrix we seek to automate both the provisioning of complex product deployments and the execution of system and interoperability tests on those deployments. Automated provisioning of Citrix products preserves test engineer time for actual testing as opposed to routine set-up. Automated test execution allows routine regression testing to be done early and often. Both are valuable. Both help us achieve our goals of:
- increasing product quality (through early, frequent and cheap regression testing, which in turn allows precious and valuable human effort to be directed towards more high-value testing)
- increasing efficiency (by reducing the time and effort to get valuable quality feedback to engineers)
Over the years we have identified a number of critical success factors for this kind of test automation:
Automation capabilities or services must be self-service and available on-demand, via both GUI and API. This promotes take-up of automation services – we do not want to build expensive automation assets that nobody uses, or that only a handful of people in the test team use. Every engineer should have easy access to these productivity tools so that they can test their code or system integrations early and often.
Engineers will only use automation if they trust it to work. Engineers need repeatable test or repro deployments that are verified as working.
Engineers will only use automation if it is reliable, highly-available and fault-tolerant.
If our automation assets are to remain useful then the ability for engineers to rapidly adapt them or extend them is key. The architecture of our automation and the APIs it offers are of paramount importance.
Case Study – Resiliency in Citrix Automated Hypervisor Provisioning
XenRT is a system widely used within Citrix for automated provisioning of hypervisors, VMs and CloudPlatform instances on internal hardware infrastructure. It has a GUI and rich APIs (accessibility and extensibility), it self-tests the deployments it makes (trustworthiness) and is architected for reliability.
The core of XenRT is a scheduler that maps job requests (think “Give me a XenServer 6.5 pool and a bunch of Windows VM’s”) onto available lab hardware. It books that hardware out, provisions the requested deployment onto the bare metal and passes the access details back to the requestor. It also allows the user to browse a huge library of automated test cases for Citrix products, and to select and run them. These valuable services, used by developers and testers alike, depend on the availability of physical machines. XenRT has hardware in three different geos, most of it split between a lab on the west-coast US and a lab in the UK. The user is abstracted from this hardware – he or she submits a job request, XenRT takes care of the rest. XenRT is a very widely used system – it recently ran its one millionth test job.
A few months ago we put in a mechanism to make XenRT more resilient to failures in its central infrastructure.
Previously, we had a single database and scheduler in our lab in Santa Clara. We’d toyed with mirroring the database to the UK lab, and it was there as a last resort, but we didn’t have a decent failover plan.
With XenRT services becoming ever more business-critical, we needed to improve the situation, and took several steps to give us better failover capability:
- Synchronous replication of the database. This gave us the additional benefit of being able to run read-only queries outside of the US, improving performance and hence user experience.
- NetScaler GSLB (global server load balancing). This makes NetScaler a DNS server which monitors the health of the XenRT scheduler and database in both UK and US, pointing users at their nearest available service.
- API calls on the XenRT schedulers to monitor their health, which NetScaler can use to mark hardware as available or offline as appropriate.
The failover is still a manual process (a database failover is expensive, so we don’t want to do it unnecessarily), but it’s now very quick.
NetScaler has proven to be an essential tool in this, solving many of the challenges we’ve faced. Many XenRT services, such as downloading XenServer builds, now have NetScaler GSLB in front of them to give out the nearest available provider.
This architecture proved its worth in a recent system outage in the US – the failover worked as planned, and apart from reduced machine capacity was entirely transparent to our users, thus meeting the key goal of high-availability and fault-tolerance.
(With thanks to John Dilley, XenRT Architect).