Quite some time since has passed since I’ve shared XenServer tips with our Blog community. Oddly enough, I stumbled across a buried draft in my files. It was dated 30-JUN-2014 and titled, “A hat-tip”. This was intended to provide a solution to a client who requested that I provide a reason and resolution to an issue they had experienced.
Because this tip might be very helpful, I’ve wiped the dust off and cleaned up the content. In advance, I would like to say “Thank You.” As, with the release of XenServer 6.5 recently behind us, and with the title of this article referencing FSCK, please direct any inquiries, requests, or comments for non-EXT filesystems to:
http://discussions.citrix.com/forum/1289-feature-requests-and-suggestions/

If you wish to contribute to open source XenServer, take a look here:

http://xenserver.org/overview-xenserver-open-source-virtualization/source-code.html

Setting the Stage

XenServer uses the EXTended filesystem (EXT), which has been around since the late 1990s, for its Domain0. At installation time, you can also elect to use EXT for the local storage repository, instead of the default LVM scheme. It is only natural, based on the kernel in my humble opinion, that EXT2/3/4 still remains synonymous with any Linux-based distribution. Like NTFS, it is a journaling filesystem with some pretty cool perks. The resiliency in the face of data recovery/integrity is one of my favorite two features, because it isn’t as CPU intensive as alternative filesystems.
So, as a XenServer administrator, you may or may not know that the EXT filesystem – by default – is set to tune itself every 180 days or after detecting an unclean dismount, etc.

Here’s What That Means to You

At some point – be it power failure or finally getting around to run updates – a XenServer administrator might notice a longer than normal boot process. This lengthy boot process is likely the result of the EXT acting because the system has recognized that more than 180 days have passed. Or, in the prime example, power was lost to a facility and everything had to be powered on from the last “unclean” state.
At this point, we need to picture a 10TB, EXT-based storage repository coming online whose filesystem was uncleanly mounted due to the power failure. XenServer’s Dom0 will detect the problem, and determine that a file system check is required.

The Problem

Many administrators would never notice the FSCK (file system check) checking disk integrity before finally loading the kernel. In the example I’ve provided, FSCK might never complete its job of checking the filesystem. As a result, FSCK left our system in a READ-ONLY state. This is hardly usable for XenServer purposes, right?
Here is the Cause of This Issue:
– The kernel, or dom0, has a default amount of memory allocated to it in /boot/extlinux.conf
– When you have command-line access to a XenServer, well, welcome to Dom0! You are now in user space!
– Any command that you, or the kernel, etc. execute causes the system to utilize this memory, and if needed, the 512MiB of swap space: ping, ssh, cat, ls, etc.

The Solution.

Since the storage repository in question was a 10TB, EXT-based filesystem, FSCK required more memory to do its job. Looking at /boot/extlinux.conf in the fallback kernel we increased the XE and XE SERIAL kernel memory for dom0 from 752M to 2048M (see below, in bold, for dom0_mem=):

Before: /boot/extlinux.conf

label xe
# XenServer
kernel mboot.c32
append /boot/xen.gz mem=1024G dom0_max_vcpus=8 dom0_mem=752M,max:752M watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M cpuid_mask_xsave_eax=0 console=vga vga=mode-0x0311 — /boot/vmlinuz-2.6-xen root=LABEL=root-qyaputvx ro xencons=hvc console=hvc0 console=tty0 quiet vga=785 splash — /boot/initrd-2.6-xen.img

label xe-serial
# XenServer (Serial)
kernel mboot.c32
append /boot/xen.gz com1=115200,8n1 console=com1,vga mem=1024G dom0_max_vcpus=8 dom0_mem=752M,max:752M watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M cpuid_mask_xsave_eax=0 — /boot/vmlinuz-2.6-xen root=LABEL=root-qyaputvx ro console=tty0 xencons=hvc console=hvc0 — /boot/initrd-2.6-xen.img

After: /boot/extlinux.conf

label xe
# XenServer
kernel mboot.c32
append /boot/xen.gz mem=1024G dom0_max_vcpus=8 dom0_mem=2048M watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M cpuid_mask_xsave_eax=0 console=vga vga=mode-0x0311 — /boot/vmlinuz-2.6-xen root=LABEL=root-qyaputvx ro xencons=hvc console=hvc0 console=tty0 quiet vga=785 splash — /boot/initrd-2.6-xen.img

label xe-serial
# XenServer (Serial)
kernel mboot.c32
append /boot/xen.gz com1=115200,8n1 console=com1,vga mem=1024G dom0_max_vcpus=8 dom0_mem=2048M watchdog_timeout=300 lowmem_emergency_pool=1M crashkernel=64M@32M cpuid_mask_xsave_eax=0 — /boot/vmlinuz-2.6-xen root=LABEL=root-qyaputvx ro console=tty0 xencons=hvc console=hvc0 — /boot/initrd-2.6-xen.img

So, with the amount of memory allocated to dom0, which FSCK will use as a user space program, we reboot and presto —
FSCK completed, everything was in-tact, and XenServer was ready to be back in production.

Why Not Use the Fallback Kernel?

In this case, I would have — leaving the primary boot kernel, all alone. But, in this case the system needed to be back online as soon as possible. Rebooting into the fallback kernel, changing its dom0 values, rebooting back into the fallback kernel, letting FSCK scan 10TB of data, undoing the changes, and rebooting into the primary kernel would not have been a speedy recovery.
More importantly, for this customer, Dom0’s memory allocation had intended to be increased during the next maintenance window, so yes — two support tickets resolved in one!

I hope this information is of benefit to our community if you find yourself in this situation.

I hope this information is of benefit to our community if you find yourself in this situation.

–jkbs | @xenfomation | XenServer.org Blog