Health policies
Health policies detect problems with hardware, software and connectivity within or between Alluvio products. There are health policies for four types of problems:
Health policies are defined on the Definitions > Policies page Health tab. A health policy violation is considered a high severity event and always triggers a High alert.
Health policy violation events can be reported on the Reports > Events page and in Event Detail reports. If notifications are configured on the Definitions > Notifications page, alerts are sent by email or SNMP notification to specified addresses.
Alerts are listed in the Current Events display on the Dashboard and also reported in the status section of the Administration > System Information page.
When an ongoing event occurs, information about subsequent events of the same type is merged into the same report so that an Event Detail report includes all on-going events of the event type at the current moment.
Data Source Problem
Data Source problems include:
-
Disconnects - A Alluvio product that has be communicating with the NetProfiler is no longer reachable.
-
Protocol Violations - The NetProfiler is attempting to communicate with another Alluvio product but is not receiving data in the expected format (for example, not time synchronized).
-
Silent Interfaces - An interface on an Alluvio product that has been communicating with NetProfiler has stopped reporting traffic for longer than the number of minutes specified on the Advanced settings page.
-
Sensor Application Matching Problems - A Cascade Sensor that has been reporting to NetProfiler has a problem with its application identification feature.
Click the Advanced settings link to display a page on which you can enable or disable detection of each of these conditions and specify the number of minutes after which an interface is considered silent if it is not sending traffic information to NetProfiler.
Hardware Problem
NetProfiler checks its hardware status every minute. Hardware problems include:
-
Fan Failure - a fan has failed or is missing (applies to physical appliances only)
-
Power Failure - a power supply has failed or is not plugged in. Alerting for this is disabled by default.(applies to physical appliances only)
-
High Temperature - the temperature inside the chassis is critically high (above 100 degrees C).
-
CPU Issue - a CPU is configured incorrectly or has failed on the physical appliance or on virtual or cloud appliances, the provisioned virtual CPU does not meet minimum requirements.
-
Memory Issue - memory is configured incorrectly or has failed (physical appliances) or the provisioned virtual RAM does not meet minimum requirements (virtual or cloud appliances).
Click the Advanced settings link to display a page on which you can enable or disable detection of each of these conditions individually.
Module Problem
A Module Problem event is detected if NetProfiler experiences one or more of the following problems:
-
Process Failure - a system process stops unexpectedly for five minutes or more
-
Module Failure - a module in a multi-module system is unreachable for five minutes or more
-
Time Unsynchronized - NetProfiler is not synchronized with the NTP timing source for five minutes or more
-
System Process Crashed - a system process has crashed within the last ten minutes
-
Flow Limit Breached - the licensed flow limit is exceeded; this is checked every minute
Click the Advanced settings link to display a page on which you can enable or disable detection of each of these conditions individually.
Storage Problem
A Storage Problem event is detected if any of the following conditions or events occur:
-
Disk failed
-
RAID system is rebuilding
-
RAID system is degraded
-
Partition is full
-
Partition is unmounted
-
Partition failed
-
Partition is mounted as read-only
The status of the storage system is displayed on the Administration > System Information page.