SCOM: What’s wrong with my Unix agents? Why are they greyed out?


 

It is quite difficult to work with the Offline Unix agents especially in a large monitoring environment. Though SCOM offers native heartbeat monitor, it is hard to quickly determine whether the computer is actually down or something wrong with the agent configuration.

An Unix agent may be down due to various reasons like issue with SCX process not running or a run as account password got changed or certificate got reset or the computer might be down. SCOM has “UNIX/Linux Heartbeat Monitor”, “WS-Management Run As Account Health” and “WS-Management Certificate Health” monitors to monitor each of above mentioned criteria and alert for offline agents. But it would be tedious job for support guy to handle multiple alerts for same issue and correlating them to fix the agent which may cost considerable time.

Will it not be easy to have only one alert in case of heartbeat failure with the status of all other monitors in the summary?

But wait, should we also track down the ping status in the alert summary so that the support guy knows what he should do first?

Yes, that’s what we are going to do now using PowerShell. The below script can be scheduled to run in any management server and it logs event in “Operations Manager” event log.

You can create a event collection rule targeting the management server to look for the events and create an alert. The alert will indicate the agent which is offline and details of other monitors related with the issue.

 

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
#Import SCOM Module
Import-Module OperationsManager 

#Establish SCOM Management Group Connection
New-SCOMManagementGroupConnection 

#Retrieve all Unix Computers which are grey in console (offline)
$mc = get-scclass -name Microsoft.Unix.Computer
$agents = get-scommonitoringobject -class $mc | where {$_.isavailable -ne 'True'} 

#Process Offline Agents
foreach ($agent in $agents) {
    $maintmode = $agent.InMaintenanceMode
    # Ignore Servers in Maintenance
    if ($maintmode -eq $false){
        #Get Agent Display Name
        $agentname = $agent.displayname
        #Get Ping Status
        $RespondsToPing = Test-Connection -ComputerName $agentname -quiet 

        #Set Ping Status
        if ($RespondsToPing){$pingable = "Pingable"}
        else{$pingable = "Not Pingable"}
        #Get HeartBeat Monitor Status
        $sh = $agent.GetMonitoringStateHierarchy()
        $avail_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Availability'}
        $hb_mon = $avail_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'Unix/Linux Heartbeat Monitor'}
        $hb_mon_state = $hb_mon.item.healthstate 

        #Get Configuration Rollup Status
        if ($hb_mon_state -ne "Success" -and $hb_mon_state -ne "Uninitialized"){
            $config_mon = $sh.childnodes | where {$_.item.MonitorDisplayName -eq 'Configuration'}
            $cert_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Certificate Health'}
            $runas_mon = $config_mon.childnodes | where {$_.item.MonitorDisplayName -eq 'WS-Management Run As Account Health'}
            $cert_mon_state = $cert_mon.item.healthstate
            $runas_mon_state = $runas_mon.item.healthstate
            $status = "PING_STATUS: $pingable HEARTBEAT_STATUS: $hb_mon_state CERTIFICATE_STATUS: $cert_mon_state, USER_ACCOUNT_STATUS: $runas_mon_state"
            write-eventlog -LogName 'Operations Manager' -source 'Health Service Script' -id 1041 -entrytype Error -Category 0 -Message "UNIX SCOM agent on $agentname is not sending a heartbeat $status"
        }
    }
}

 

Happy SCOMing!!!

Comments (1)

  1. Tommy says:

    I created a MP for this in case someone is interested.

    Features:
    – Script Rule that runs the PowerShell script from Gowdhaman on ‘a’ Management Server from the All Management Servers Resource Pool
    – Alert Rule that is enabled for the group "Operations Manager Management Server Computer Group" and generates an alert when it finds the created (Warning) events from the PS script on any of the Management Server

    http://www.uploader.gamergun.com/files/1/Monitoring.Pack.SCX.Health.Check.xml

    Greetings

Skip to main content