VMware Data Recovery: Consider the Blocksize of the VDR Datastore

I implemented recently VMware Data Recovery for a quick and dirty backup of a few virtual machines. After checking the backup, the most virtual machines ran properly but one virtual machine failed with the following error message: "Unable to access file <unspecified filename> since it is locked", combined with an error message inside the VDR log: "Trouble reading files, error -3942 (delete snapshot failed)".

After some investigation I discovered, that the backup snapshot hung on the VM (and was not visible in the Snapshot Manager), further one disk of that VM was still assigned to the VDR appliance! So I tried first to remove the snapshot with the known trick (first create a new one, afterwards perform a "Delete All"), but again I got that message above ("Unable to access file..."). Then it crossed my mind that I have to remove the disk first from the VDR appliance, but a simple reboot of the appliance didn't help, so I removed the disk from the properties of the VM, which worked! (take care in this step not to delete the disk file ;) ). After another reboot of the VDR the disk wasn't visible to the appliance anymore.

Now I investigated why the disk wasn't removed from the VDR appliance and so the removal of the snapshot failed. Then I remembered that the VDR appliance does a hot add of the virtual disk, which has to be backed up (as well as VCB in hot-add mode with a Helper-VM).

In the past I ran already into the issue, that snapshots of thin provisioned disks can not be created in datastores, which have not the proper block size. So I checked the disk size of the failed VM, it was 270GB - the blocksize of the datastore in which the VDR appliance resided in had only 1 MB (256GB max. filesize). So this seems to be the problem. A short search in the VMware communities confirmed my guess, so I moved the VDR appliance into another datastore which a larger blocksize.

Conclusion: Always deploy your VMware Data Recovery appliance oder VCB Helper-VM into a datastore with an adequate blocksize!


Cannot register your CapacityIQ Appliance: Another CIQ is registered to VC

During my CapacityIQ hands-on I reinstalled my CapacityIQ VA. The problem is that I have forgotten to uninstall / unregister the appliance inside vCenter server first.

After reinstallation I wanted to register the CapacityIQ appliance again in vCenter, which failed with the following error: "Another CIQ is registered to VC. Unregister it or register with force flag from CLI."


Normally you are able to unregister the appliance inside it's CLI. For it you have to log on as user "ciqadmin" and simply run the command "ciq-admin unregister". But without the origin VA this is not possible anymore :)

As stated by the message above,  there is also a possibility to force the registration of the CIQ appliance:

"ciq-admin register --vc-server <vc_server_ip> --force --user 'company_domain\username'
--password 'password_with_special_characters' "

Note: If you have issues with the hostname you should use the "--use-ip" parameter:

"ciq-admin register --vc-server <vc_server_ip> --use-ip
--force --user <vc_username> --password <vc_password>"

Afterwards the new CapacityIQ appliance was registered again with my vCenter!


Remove Orphaned Extensions / Plugins in vCenter

During my lab tests with CapacityIQ, vShield Zones and Appspeed I haven't uninstalled all extensions in the vCenter before the destruction of the virtual appliances, so some orphaned entries are shown inside the plugin tabs.

I watched around how I get rid of these entries, and I found the following article in the Malaysia VMware Community Blog: click here

They had the same issue with the Nexus 1000v Plugin. It worked for me too, I only had to select the respective plugin name.

One  addition: you have to restart your vCenter service afterwards to accomplish the changes.


ESX 4.x: Network Issues (Transmit Timed Out)

Yesterday I had a weird network issue at one of our customers:

An ESX Host has lost the connection to the vCenter server and was marked as "disconnected". The host was running as well as the virtual machines (checked via console), a reconnect to the vCenter was not possible. Further investigations obtained that even some VMs had sporadic network issues.

After exploring the logfiles of the ESX host, I obtained the following messages:

Mar  6 17:31:16 ESX01 vmkernel: 15:07:33:10.284 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:16 ESX01 vmkernel: 15:07:33:10.284 cpu2:4245)BUG: warning at vmkdrivers/src26/vmklinux26/vmware/linux_net.c:3235/netdev_watchdog() (inside vmklinux)

Mar  6 17:31:17 ESX01 vmkernel: 15:07:33:11.285 cpu3:4235)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:18 ESX01 vmkernel: 15:07:33:12.287 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:19 ESX01 vmkernel: 15:07:33:13.287 cpu5:4231)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:20ESX01 vmkernel: 15:07:33:14.289 cpu1:4234)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:21 ESX01 vmkernel: 15:07:33:15.291 cpu2:4245)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:22 ESX01 vmkernel: 15:07:33:16.293 cpu5:4244)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

Mar  6 17:31:23 ESX01 vmkernel: 15:07:33:17.294 cpu2:4244)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic3: transmit timed out

So it seems to be that sth. was wrong with one NIC of the host. The date / time matched with the disconnect message inside the vCenter logs. The NIC is dedicated to virtual machines (among three other NICs) and not to the Service Console, so this seemed not be the reason for the disconnected ESX host.

After a short search in the VMware KB I found the following patch: http://kb.vmware.com/kb/1017458. It was released at 3th March 2010.

The description shows the following:

On some systems under heavy networking and processor load (large number of virtual machines), some NIC drivers might randomly attempt to reset the device and fail.

Then I obtained that even the virtualized vCenter server was running on this ESX host, so the vCenter itself had a network problem, too.  This seems to be the reason, why the ESX host lost its connection. The curious thing is, that the other 3 ESX Hosts seems to be connected and even a RDP connection to the vCenter was possible.

Conclusion: I suggest, that due to high network I/O load the NIC got that error described above. In the VMware communities some other people reported the same issue, so I highly recommend to install the patch.

The reason for the sporadic behaviour is the following: the loadbalancing of the respective vSwitch is set to "IP Hash" in conjunction with a etherchannel configuration on the physical switches, to improve overall performance of a VM by using all NICs instead of one NIC in "Port ID" mode. So everytime the respective NIC was contacted, the packets got lost.

If you ask yourself now, why this failure was not detected by the ESX host, consider the following: due to the failover setting "Link Status Only", these logical failures are not detected, only phyiscal link-down failures. The other option "Beacon probing" would detect even such a logical failure, but can not be used in conjunction with IP hash load-balancing and Etherchannel because of possible network flapping errors (see KB1012819). So you have to decide wether you want performance or a better failure detection.


Powershell: How to Connect to your vCenter in a Secure Way

Powershell in combination with the VI Toolkit are a very nice way to automate tasks in your virtual infastructure. While the most actions are executed via the vCenter, it is very important not to connect with a clear-text password somewhere in your scripts. If you schedule tasks or run scripts without user interaction, this may be not so easy. In the following I describe in a short way how this can be done:

First we have to store our password encrypted. I have done this with the help of  this function, which asks for username and password and store them in a xml file (I have found the source on a website, which name I can't remember). The password is encrypted and can only be decrypted with the same user name. Further I have determined that even the computer, on which this file was created initially has to be the same (in a clustered vCenter environment, my scripts never ran on the failover node, because the password can't be decrypted).

function Export-PSCredential {

param ( $Credential = (Get-Credential), $Path = "connect.xml" )
# Look at the object type of the $Credential parameter
switch ( $Credential.GetType().Name ) {
# It is a credential, so continue
PSCredential            { continue }
# It is a string, so use that as the username and prompt for the password
String                          { $Credential = Get-Credential -credential $Credential }

# other cases
default                         { Throw "You must specify a credential object to export to disk." }
}
# Create temporary object to be serialized to disk
$export = "" | Select-Object Username, EncryptedPassword

# Give object a type name which can be identified later
$export.PSObject.TypeNames.Insert(0,’ExportedPSCredential’)

$export.Username = $Credential.Username
# Encrypt SecureString password using Data Protection API
# Only the current user account can decrypt this cipher
$export.EncryptedPassword = $Credential.Password | ConvertFrom-SecureString
# Export using the Export-Clixml cmdlet
$export | Export-Clixml $Path
Write-Host -foregroundcolor Green "Credentials saved to: " -noNewLine
# Return FileInfo object referring to saved credentials
Get-Item $Path
}
Export-PSCredential

Afterwards you get a xml file with the following content (example):

<Objs Version="1.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
<Obj RefId="RefId-0">
<MS>
<S N="Username">mydomain\vcuser</S>
<S N="EncryptedPassword">
01000000d08c9ddf0115d1118c7a00c04fc297eb01000000df2723e2eb2d0f4eb26a1eff9db804b20000000002000000000003660000a8000000100000008201280235881db0152bdcc0cb1e63290000000004800000a000000010000000156ae1be1fb12ede6abee57e1f30cdd2180000006cd8dd44b0178d020ab951125b692e4364d4043c466eaa79140000004dab9e7cf9de199ab0b998160c7686
</S>
</MS>
</Obj>
</Objs>

Store the xml file in the profile of the respective user (for example).

In the following is shown how the connection and authentication in your script has to be done:

function Import-PSCredential {
param ( $Path = "connect.xml" )
# Import credential file
$import = Import-Clixml $Path

# Test for valid import
if ( !$import.UserName -or !$import.EncryptedPassword ) {
Throw "Input is not a valid ExportedPSCredential object, exiting."
}
$Username = $import.Username

# Decrypt the password and store as a SecureString object for safekeeping
$SecurePass = $import.EncryptedPassword | ConvertTo-SecureString
# Build the new credential object
$Credential = New-Object System.Management.Automation.PSCredential $Username, $SecurePass

Write-Output $Credential
}

$Cred = Import-PSCredential 'c:\documents and settings\vcuser\connect.xml'
$Server = Connect-VIServer myvCenter.example.com -credential $Cred


Powershell: Collect ESX Logs and Summarize them in HTML

Recently I have written a little powershell script to obtain vmkwarning logfiles from all ESX hosts in a vCenter. The output is written in a table to a html file, grouped by cluster.  In each cluster the logfiles are merged, so in case of troubleshooting it is much easier to compare / combine the host with each other. In this script the logfile is completly shown, it is also possible to reduce it to the last 10 entries per host.

The script can be modified for other logfiles as vmkernel, hostd etc.

$outputFile = 'c:\test\host_log.html'
$output = @()
$output += '<html><head></head><body>'
$output += '<style>table{border-style:none;border-width:0px;font-size:8pt;background-color:#ccc;width:100%;}th{text-align:center;}td{background-color:#fff;width:20%;border-style:dotted;border-width:1px;}body{font-family:verdana;font-size:8pt;}h1{font-size:14pt;}h2{font-size:12pt;}</style>'
$output += '<h1>Hostlogs</h1>'
$output += '<p>List of all vmkwarning Logfiles:</p>'
$output += '<table>'
$Clusters = (Get-Folder 'ESX-Clusters')
ForEach ($Cluster in $Clusters)
{
$output += '<tr>'
$output += '<th>'
$output += $Cluster.Name
$output += '</th>'
$output += '</tr>'
ForEach ($ESXHost in ($Cluster | Get-VMHost | sort Name))
{
$output += '<tr>'
$output += '<td>'
$entries += (get-log -VMHost (Get-VMHost $ESXHost)vmkwarning).Entries
}
$entries_sorted = $entries | sort
ForEach ($entry in $entries_sorted)
{
$output += $entry
$output += '<br/>'
}
$output += '</td>'
$output += '</tr>'
}
$output += '</table>'
$output += '</body></html>'
$output | Out-File $outputFile -Force
ii $outputfile

The result can be viewed here (Click to enlarge):


I think this environment has a little storage issue ;)


Don't mess with Volume Names (Keep it Unique!)

In the last days I was confronted several times with the same problem, so I decided to write a few lines about this issue:

After an upgrade of ESX 3.x to ESX 3.5 U5 or ESX 4.0 U1  suddently some Datastores were missing even if the corresponding LUN IDs are shown with esxcfg-mpath or esxcfg-scsidevs.
After watching the log files the following messages can be found:

WARNING: SCSI: 2747: Failed to register target vmhba1:0:11 with PSA framework: Already exists
vmkernel: 0:01:00:39.500 cpu1:1041)WARNING: ScsiDevice: 4651: Can't add device
vml.02000b000060030d9047485747455358766d66733053414e6d656c'.

A device with that uid already exists.

The cause of this problem is the naming of the volume on storage-side: If the first 12  characters are not unique (more precisely: ESX needs only the first 8, but VCB needs the first 12 chars.  So, to be sure, keep the first 12 unique) VMware ESX may gernerate the same vml ID for the different volumes and hides them with the messages above (seems to depend on the storage vendor, too).

This issue is not new and already known from VCB (review the following article, written by me at that time: click). VMware adopted this in ESX 4.0, and finally in ESX 3.5 U5 with the latest patches from december, too!

There is no real workaround, you have to unmap the volumes, rename it and remap it afterwards (combined with a downtime of the VMs if you can't use Storage VMotion).

So take care of this in the future!


How to force to mount a datastore with vSphere without resignature

I have just discovered an interesting article at the VMware knowledgebase, "Force mounting a VMFS datastore residing on a snapshot LUN results in the error: Cannot change the host configuration"

It describes how a datastore can be mounted and unmounted, while not changing its UUID. This has to be done e.g. if a LUN is discovered as snapshot LUN by the ESX host. In the SAN guide published by VMware this is also mentioned in page 74:

1. Log in to the vSphere Client and select the server from the inventory panel.

2. Click the Configuration tab and click Storage in the Hardware panel.

3. Click Add Storage.

4. Select the Disk/LUN storage type and click Next.

5. From the list of LUNs, select the LUN that has a datastore name displayed in the VMFS Label column and

click Next.

The name present in the VMFS Label column indicates that the LUN is a copy that contains a copy of an

existing VMFS datastore.

6. Under Mount Options, select Keep Existing Signature.

7. In the Ready to Complete page, review the datastore configuration information and click Finish.

But there are some pitfalls: If a datastore was already mounted by an ESX host inside a cluster, the vCenter is aware of this and hides the datastores in the "Add storage" dialog. Reason:

"When one ESX 4.0 host force mounts a VMFS datastore residing on a LUN which has been detected as a snapshot, an object is added to the datacenter grouping in the vCenter database to represent that datastore. When a second ESX 4.0 host attempts to do the same operation on the same VMFS datastore, the operation fails because an object already exists within the same datacenter grouping in the vCenter database. Since an object already exists, vCenter Server does not allow mounting the datastore on any other ESX host residing in that same datacenter."

Further it may happen that you have some freespace left on the volume presented by the storage. In this case after the point 6 of the description above you are asked to create a new VMFS datastore (as second partition) or to expand the existing one. There is no other possibility and if you don't want to do this you can't complete the wizard.

So you are forced to this as described in the knowledge base article above in workaround B (By connecting directly to the ESX host service console):

1. Log in as root to the ESX host which cannot mount the datastore using an SSH client.

2. Run the command:

esxcfg-volume -l

The results appear similar to:

VMFS3 UUID/label: 4b057ec3-6bd10428-b37c-005056ab552a/ TestDS

Can mount: Yes

Can resignature: Yes

Extent name: naa.6000eb391530aa26000000000000130c:1 range: 0 - 1791 (MB)

Record the UUID portion of the output. In the above example the UUID is 4b057ec3-6bd10428-b37c-005056ab552a.

Note: The Can mount value must be Yes to proceed with this workaround.

3. Run the command:

esxcfg-volume -M <UUID>

Where the <UUID> is the value recorded in step 3.

Note: If you do not wish the volume mount to persist a reboot, the -m switch can be used instead.


Restart

After a long time I have updated my blog with the an actual wordpress, further I changed the whole theme (may change again, because with some things I'm not so happy at the moment).
I decided to switch the blog language completely to english, to reach more people. The most interesting articles will be translated from time to time, the rest will be archived or perhaps, removed.
I hope (again) I will have more time in the future to publish more interesting articles.


XenServer 5.5: Issues during the Installation of new Windows VMs

In the acutal Version of Citrix's XenServer, a few problems may occur during the installation of Windows-based Virtual Machines.
It may happen, that during the setup of Windows the VM freezes at the identification of the new hardware. The whole issue was already discussed in a thread of the Citrix Knowledge Center. Many users had the same issues, they discovered that everytime the same hardware plattform was used: Intel Core i7 with it's new CPU architecture Nehalem (Intel Xeon 55xx series may be affected, too).
The exact cause is Intel's EPT Feature (Extended Page Tables), the actual hypervisors use it to gain a considerable performance push (depending on the workload). Details can be read here.

Citrix released a patch for this issue in the meantime: Hotfix XS55E004

There is also a workaround, if still needed:

On every XenServer the file

/boot/extlinux.conf

has to be edited. Behind the parameter

"dom0_mem=752M"

the following paramter has to be inserted:

"hap=0"

After saving the file, the XenServer has to be restarted.

This patch / workaround fixes also a problem with XenMotion, which may also occur on Nehalem-based hosts: Link.