Perl Toolkit: Check ESX(i) host time

I had an issue recently where a single ESXi host’s clock was incorrect. The administrator had never set the clock initially, so NTP never kept it in sync cause it was too far off to begin.

Since I’ve got a large number of hosts and the idea of clicking to each one through VI Client and checking the configuration tab, I immediately turned to PowerCLI. Naturally, one of Luc‘s scripts was the top search result.

That solved my immediate need to check the hosts, but I also wanted to setup some general monitoring. Since my monitoring infrastructure is compromised, primarily, of a linux Nagios host, that means PowerCLI couldn’t help. So, I did the next best thing and ported Luc’s script to perl.

Below is the result of that porting. It can also be run from vMA for reporting via email or another mechanism.

Continue Reading »

ESX
ESXi
Perl

Comments (0)

Permalink

Cacti: Monitor protocol statistics for NetApp volumes

Update 2011-07-10:  Due to a template export error with Cacti, the import was failing for a lot of people. I apologize for taking so long to fix the templates, however they should be fixed now. Thank you to everyone who pointed out the errors and the fix in the comments.


I have made no secret that I use two applications daily to monitor my infrastructure: Nagios and Cacti. I have created a fair number of scripts (and hopefully publishing more soon) to help Nagios monitor the different parts of the infrastructure, however I haven’t published many of my Cacti scripts previously.

One of the most useful is the config that I use to monitor the different protocol stats for volumes. I created an indexed query so that the single script, and accompanying XML file, are capable of monitoring all the volumes, and I can select which graphs to create for each volume. The polling script is loosely based off of the multi-protocol realtime volume statistics script that I created some time ago.

Download the updated template and script(s) here.

Some examples…

Total Operations, Latency
Cacti Volume Total Operations  Cacti Volume Total Latency
CIFS Operations, Latency
Cacti Volume CIFS Operations  Cacti Volume CIFS Latency
NFS Operations, Latency
Cacti Volume NFS Operations  Cacti Volume NFS Latency
iSCSI Operations, Latency
Cacti Volume iSCSI Operations  Cacti Volume iSCSI Latency

NetApp
Perl

Comments (29)

Permalink

Nagios: Checking for abnormally large NetApp snapshots

My philosophy with Nagios checks, especially with the NetApp, is that unless there are extenuating circumstances then I want all volumes (or whatever may be being checked) to be checked equally and at the same time. This means I don’t want to have to constantly add and remove checks from Nagios as volumes are added, deleted and modified. I would much rather have one check that checks all of the volumes and reports on them en masse. This means I don’t have to think about the check itself, but rather, only what it’s checking.

One of the many things that I regularly monitor on our multitude of NetApp systems is snapshots. We have had issues, especially with LUNs, where the snapshots have gotten out of control.

In order to prevent this, or at least hope that someone is watching the screen…, I wrote a quick script that checks to see if the total size of snapshots on a volume exceed the snap reserve. Since not all of our volumes have a snap reserve, I also put in the ability to check the size of the snaps against the percentage of free space left in the volume.

This last measure is a little strange, but I think it works fairly well. Take, for example, a 100GB volume. If it is 50% full (50GB), there is no snap reserve and the alert percentage is left at the default of 40% free space, then the alert will happen when snapshots exceed about 15GB. “But that’s not 40% of the free space”, I hear you saying. Ahhh, but it is…you see as the snapshot(s) grow, there is less free space, which means that it takes a larger percentage as the free space shrinks. So at 15GB of snapshots, there would be 35GB of free space, and 40% of 35GB is 14GB.

This causes the alerts to happen earlier than you may expect at first. You can adjust this number to be a percentage of the total space in the volume if you like…however, why not just set a snap reserve at that point? I chose to make the script this way in order to attempt to keep a little more free space in the volume, while not making a snap reserve mandatory.

One last word…please keep in mind this script does not check for a volume being filled, you should have other checks for that. This merely checks to see if snapshots have exceeded a threshold of space in the volume to prevent them from taking up too much space.

Bring on the Perl…

Continue Reading »

NetApp
Perl

Comments (5)

Permalink

Monitoring for orphaned snapshots left by SMVI

NetApp’s SnapManager for Virtual Infrastructure (SMVI) is a great product, but it’s messy. If it encounters the any error, it seemingly forgets to delete the virtual machine snapshots from the Virtual Infrastructure before dying.

To prevent many orphans (I’ve seen as many as 20 on a single virtual machine) from happening, I created a quick Nagios check that simply alerts when it sees them.

This script is very elementary. It very simply uses a regex to check for any snapshots that match the default SMVI naming convention. For each one it finds, a counter is incremented. If any are found, the script returns an error to Nagios, which causes an alert to be sent.

#!/usr/bin/perl -w
#
# check_vi_smvi_snapshots.pl - written by Andrew Sullivan, 2010-06-16
#
# Please report bugs and request improvements at http://get-admin.com/blog/?p=1059
#
# A simple script to look for snapshots that match the name pattern that smvi uses.
# We are merely pulling a list of all snapshots, searching for the string "smvi" in 
# the name, if it's found, we return a warning condition.  This could lead to a 
# "false" positive if it runs while a snapshot series is still ongoing, but since
# the smvi snaps should be very short lived the condidition will not last unless
# the snap is left.
#
# Example:
#   ./check_vi_smvi_snapshots.pl --server your.esx.host --username you --password secret
#
 
use strict;
use warnings;
 
use FindBin;
use lib "$FindBin::Bin/../";
 
use VMware::VIRuntime;
 
# substitute the location of your nagios perl library
use lib "/usr/lib64/nagios/plugins";
use utils qw(%ERRORS);
 
Opts::parse();
Opts::validate();
 
Util::connect();
 
main();
 
Util::disconnect();
 
sub main {
 
	# the number of smvi snapshots
	my $smviSnaps = 0;
 
	# for setting the type of exit we want
	my $exitCondition = "";
 
	# we need MORs for each of the VMs on the host
	my $VMs = Vim::find_entity_views( view_type => 'VirtualMachine' );
 
	foreach my $vm (@$VMs) {
		if ($vm->snapshot) {			
			foreach my $childSnapshot (@{$vm->snapshot->snapshotInfo->rootSnapshotList}) {
				$smviSnaps += getSnaps($childSnapshot);
			}
 
		} else {
			#print $vm->name . " has no snapshots\n";
		}
	}
 
	if ($smviSnaps > 0) {
		print "WARNING - " . $smviSnaps . " SMVI snapshots exist.\n";
		$exitCondition = "WARNING";
 
	} else {
		print "OK - No SMVI snapshots exist.\n";
		$exitCondition = "OK";
 
	}
 
	Util::disconnect();
	exit $ERRORS{ $exitCondition };
}
 
sub getSnaps {
	my ($snapshotTree) = @_;
	my $snapcount = 0;
 
	# uncomment for debugging
	#print "Found snap: " . $snapshotTree->{name} . "\n";
 
	if ( $snapshotTree->{name} =~ /smvi/ ) {
		$snapcount++;
	}
 
	if ($snapshotTree->childSnapshotList) {
		foreach my $childSnapshot (@{$snapshotTree->childSnapshotList}) {
			$snapcount += getSnaps($childSnapshot);
		}
	}
 
	return $snapcount;
}

I’ve set the check to execute once an hour in my environment, as I don’t feel that granularity finer than that is needed…an hour’s worth of change is ok for an SMVI snapshot for me.

Nagios
NetApp
Perl
Scripting
Virtulization

Comments (5)

Permalink

Really NetApp?!? You didn’t use your own SDK?

So, this post irked me. Not because of the poster or his post (honest Andy, if you ever read this, I have nothing against you or your post! I’m happy to see another VMware/NetApp blogger!), but because of the script he referenced and the problem encountered. He has a good solution, but the problem shouldn’t exist.

You see, I hate RSH. I don’t know why (well, it is quite insecure, and it can require some configuration), but I hate it. SSH is only marginally better in this case…sure it’s secure, but you have to auth each time, and if you don’t (ssh keys), well, it’s only a little better than RSH (comms are encrypted, but compromise of a single account can lead to bad things on many hosts). The script that is referenced, one that NetApp recommends that admins use to verify that their aggregates have enough free space to hold the metadata for the volumes in OnTAP 7.3 (the metadata gets moved from the volumes to the aggregate in 7.3), uses RSH to execute commands that are then parsed in a somewhat rudimentary way to get information.

Sure, it’s effective, but it’s far from graceful…especially when you have a perfectly good and effective SDK at your disposal.

I was kind of bored, so I decided to rewrite the script using the SDK. This is the end result. It reports the same data, but uses the SDK to gather all of the necessary information to make a determination for the user. The new script is significantly shorter (10KB vs 25KB, 380 lines vs 980), and it requires only one login.

Thanks to NetApp for providing their SDK, and I hope that no one over there minds me refactoring…

Continue Reading »

NetApp
Perl

Comments (4)

Permalink

Perl OnTAP SDK: Realtime Multiprotocol Volume Latency

Update 2009-07-21: With some help from Steffen, a bug was found where the script wasn’t returning any values in the result hash when the toaster didn’t return values for certain queries. This caused Perl to print errors when it was trying to do math on non-existent values. Starting at line 273, the script has been updated so that the hash returned by the subroutine that does the ZAPI query has default values of zero, which should eliminate the errors seen by Steffen. Please let me know of any other problems encountered! (and thanks to Steffen for finding this bug!)


My previous post only prints NFS latency for the NetApp as a whole, it doesn’t give any information about a specific volume. Some of my ESX hosts use iSCSI for their datastores, and because the NetApp has many iSCSI clients, looking at iSCSI stats for the filer as a whole didn’t help me very much.

The solution was this script. It is a significantly modified version of the previous script that is capable of showing the realtime latency for all protocols: NFS, CIFS, SAN (which I believe is all block level ops summarized), FCP and iSCSI. It also displays the three different types of operations for each protocol: read, write, and other.

The script, if invoked with nothing more than the connection information, will display the read, write, and “other” latency and operations for the total of all protocols. There is a fourth column as well, which shows the average latency and total operations across all operation types (r/w/o).

This script has proven quite beneficial for me. By monitoring CIFS latency during peak hours on the volume that contains shares for profiles, I have proven that the reason logins can take a significant amount of time is due to the use of high capacity, but very slow, SATA disks and not the network or desktops themselves. I’ve also been able to prove that one of our iSCSI volumes was “slow” due to bandwidth, and not spindle count (interestingly, the problem with this volume is the I/O request size…the app makes large requests which chokes bandwidth before available IOP/s runs out).

The OnTAP SDK is quite powerful, Glenn and I are quickly discovering that anything possible in FilerView and/or DFM/OpsMgr is doable through the SDK.

Continue Reading »

NetApp
Perl

Comments (12)

Permalink

Perl OnTAP SDK: Realtime NFS Latency

Since most of my Virtual Infrastructure runs on NFS datastores I like to keep a very close eye on what’s going on inside the NetApp. I generally use Cacti for long term monitoring of the status of the datastores and the NetApp as a whole.

However, when I want to see what’s going on in less than five minute increments, Cacti is pretty much useless. I wrote this script a while ago so that if I feel that latency is becoming a problem, I can check it right away and see it in frequent intervals.

Most often I use this script when Nagios starts to chirp at me. I use a slightly modified version of this script with Nagios and have it alert when latency gets out of hand. I then use this script to get a good look at what’s going on.

The OnTAP SDK is almost as entertaining to work with as VMware’s…

Continue Reading »

NetApp
Perl

Comments (1)

Permalink

Perl Toolkit: pNIC to vSwitch information

Another itch to scratch: which vSwitch is a pNIC connected to? To solve this simple problem I created a quick perl script…

This script also lets me see the driver in use, connection speed and duplex setting, and the MAC address of the pNIC.

# Sample output:
Adaptor (Driver)        Speed (Duplex)          MAC                     vSwitch
----------------        --------------          ---                     -------
vmnic1 (bnx2)           1000 (Full)             00:00:00:00:00:00       vSwitch0
vmnic0 (tg3)            1000 (Full)             00:00:00:00:00:00       vSwitch0
vmnic3 (tg3)            1000 (Full)             00:00:00:00:00:00       vSwitch1
vmnic2 (bnx2)           1000 (Full)             00:00:00:00:00:00       vSwitch1

Continue Reading »

ESX
Perl
Scripting
vCenter
VMware

Comments (6)

Permalink

Perl Toolkit: Portgroup type information

I wanted get a list of port groups and their type (kernel, console, virtual machine) from a series of hosts, however the only thing I could find that was even close was a POSH script in the VMTN forums that was posted by LucD.

Using that script for inspiration, I essentially duplicated the functionality, but using the perl toolkit. This script gives me an easy to read (and parse…) list of portgroups, the vSwitch they belong to, and the type.

Continue Reading »

ESX
Perl
Scripting
vCenter
VMware

Comments (0)

Permalink

Perl Toolkit: NFS snapshot fix via rCLI

I dislike having to SSH into each host I am responsible for, and I detest having to enable SSH on ESXi (there should be NO reason for me to have to enable it). Because it’s difficult to script applying the NFS snapshot fix to a lot of hosts using the SSH method (and impossible if you don’t enable it on ESXi), I fooled around with the vifs.pl command that is provided with the rCLI.

I discovered that I can pull certain configuration files for the host using the command, modify them, then replace the configuration file…all without having to SSH to the host! vm-help.com has an excellent list of files available using this method.

All of the commands I use in the below script are available when the rCLI is installed (the rCLI also installs the perl toolkit, so all those “sample” scripts are available to us).

My windows scripting skills are non-existent, so I don’t know how to write a wrapper around the rCLI commands like I can with bash, but these same commands will work if you are using rCLI installed on Windows.

Continue Reading »

ESX
Perl
Scripting
VIMA
VMware

Comments (0)

Permalink