SCOM: What’s wrong with my Unix agents? Why are they greyed out?

 

It is quite difficult to work with the Offline Unix agents especially in a large monitoring environment. Though SCOM offers native heartbeat monitor, it is hard to quickly determine whether the computer is actually down or something wrong with the agent configuration.

An Unix agent may be down due to various reasons like issue with SCX process not running or a run as account password got changed or certificate got reset or the computer might be down. SCOM has “UNIX/Linux Heartbeat Monitor”, “WS-Management Run As Account Health ” and “WS-Management Certificate Health” monitors to monitor each of above mentioned criteria and alert for offline agents. But it would be tedious job for support guy to handle multiple alerts for same issue and correlating them to fix the agent which may cost considerable time.

Will it not be easy to have only one alert in case of heartbeat failure with the status of all other monitors in the summary?

But wait, should we also track down the ping status in the alert summary so that the support guy knows what he should do first?

Yes, that’s what we are going to do now using PowerShell. The below script can be scheduled to run in any management server and it logs event in “Operations Manager” event log.

You can create a event collection rule targeting the management server to look for the events and create an alert. The alert will indicate the agent which is offline and details of other monitors related with the issue.

 

001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041 #Import SCOM Module Import-Module OperationsManager  #Establish SCOM Management Group Connection New-SCOMManagementGroupConnection  #Retrieve all Unix Computers which are grey in console (offline) $mc = get-scclass -name Microsoft.Unix.Computer $agents = get-scommonitoringobject -class $mc | where {$_.isavailable -ne 'True'}  #Process Offline Agents foreach ($agent in $agents) {     $maintmode = $agent.InMaintenanceMode     # Ignore Servers in Maintenance     if ($maintmode -eq $false){         #Get Agent Display Name         $agentname = $agent.displayname         #Get Ping Status         $RespondsToPing = Test-Connection -ComputerName $agentname -quiet          #Set Ping Status         if ($RespondsToPing){$pingable = "Pingable"}         else{$pingable = "Not Pingable"}         #Get HeartBeat Monitor Status         $sh = $agent.GetMonitoringStateHierarchy()         $avail_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Availability'}         $hb_mon = $avail_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'Unix/Linux Heartbeat Monitor'}         $hb_mon_state = $hb_mon.item.healthstate          #Get Configuration Rollup Status         if ($hb_mon_state -ne "Success" -and $hb_mon_state -ne "Uninitialized"){             $config_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Configuration'}             $cert_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Certificate Health'}             $runas_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Run As Account Health'}             $cert_mon_state = $cert_mon.item.healthstate             $runas_mon_state = $runas_mon.item.healthstate             $status = "PING_STATUS: $pingable HEARTBEAT_STATUS: $hb_mon_state CERTIFICATE_STATUS: $cert_mon_state, USER_ACCOUNT_STATUS: $runas_mon_state"             write-eventlog -LogName 'Operations Manager' -source 'Health Service Script' -id 1041 -entrytype Error -Category 0 -Message "UNIX SCOM agent on $agentname is not sending a heartbeat $status"         }     } }

 

Happy SCOMing!!!