I spotted an issue in my vSphere infrastructure this weekend just past. I noticed that one of the main development boxes was showing the dreaded question, redo log out of space, retry or abort?
As it turned out VMware Data Recovery Manager had taken a snapshot as part of it’s back up routine and had failed when trying to remove it. This coupled with a scheduled SQL maintenance plan caused the delta files for the snapshot to grow to over 250GB in little over 12 hours.
I eventually overcame the issue by adding an extent to the out of space VMFS datastore, this gave me an extra 160GB with which to play the logs back in. I then used the very handy SnapVMX utility to tell me how much space was required to replay the delta files. Luckily for me it only required 20 GB as sometimes it can require as much as the size of the original disk. After the snapshot was merged I did a bit of Storage vMotion and reworked the datastore to get rid of the extent (I’m not a fan of using them)
This particular incident was unfortunately unavoidable, it happened at a weekend, was due to VMDR’s failure to remove a snapshot it had created and unfortunately clashed with a disk intensive operation. It did get me thinking though, although I am careful with snapshots and there usage who else in the organisation is not? how do we mitigate this potential risk?
Snapshots are a handy feature, I generally only use them for short periods of time, usually to provide a rollback when patching or changing configurations. Misuse or mismanagement of snapshots can quite quickly lead to problems, something that a recent blog article from VMware Support deals with quite effectively. Entitled ESX Anatomy 101 it’s a must read for anyone trying to gain a good basic understanding of how VMware snapshot work.
I myself have taken a two pronged attack to preventing snapshots causing problems. The first approach is to schedule a very basic PowerShell script that I found on the blog site of Axiom Dynamics. This simple little script queries your vCenter server for all current snapshots and then sends an email detailing them. A simple but effective means of keeping an eye on snapshots across the virtual infrastructure.
The second more proactive approach is to use a vCenter alarm at the data centre level to alert when a VM is running from a snapshot. This alarm simply involves emailing a warning when any snapshot is larger than 2GB. This handy video taken from VMware Knowledgebase article 1018029 describes in detail how to set this up, the KB itself also provides step by step instructions.
There are a number of alternatives available for reporting on snapshots.
Alan Renouf’s Snapshot Reminder – A Powershell script that integrates with AD to send the creator of a snapshot a little reminder when the snapshot is over 2 weeks old.
Alan Renouf’s vCheck Daily Report – Another Powershell script that reports on a large number of areas within the virtual infrastructure. One of those areas includes snapshots
RVTools – A very handy .Net application by Rob de Veij that can be used to query your virtual infrastructure for just about everything. You will notice in the screenshot below the vSnapshot tab which should help you identify those rogue snapshots.
In summary, everyone who works with snapshots should have an understanding of their usage and limitations. Obviously you can’t always rely on people to do things right, we are only human after all. As a safeguard ensure you have some level of reporting and alerting in place to help you prevent those annoying and time consuming out of space issues occurring.