|Name:||lemon-sensor-exception||Metric Classes: (1)|
|Summary||LEMON sensor for generating exception metrics|
|Requires||edg-fabricMonitoring-agent >= 2.12.1-2
|Supports||CFG CHK SOD VER|
|Copyright||Copyright (c) 2001 EU DataGrid|
lemon-sensor-exception - v1.2.1-2
Sensor exception is a C++ sensor developed in collaboration between CERN and BARC. The sensor is used for generating exception metrics and launching recovery actions (aka actuators) in response to detected problems. The built in correlation engine allows for alarms to be risen based on the information reported by multiple metrics and has support for mathematical expressions, string comparisons, regular expressions and generating alarms on behalf of other monitored entities. lemon-sensor-exception is an officially supported lemon sensor and is mandatory if you intend to use LAS, the Lemon Alarm System.
The protocol follows the follow format:
|Exception State||Integer (1)||Status of the exception, one of: [ 0 = no exception, 1 = exception detected, -1 = error, 2 = disabled exception]|
|Code||Integer (3)||Code of the corresponding action/message. This may be an automated recovery action, general information or notification for another service.|
The following table outlines all the exception states and codes:
|Exception State||Code||Associated string|
|0||000||No Exception detected|
|1||010||FreeText update for actuators|
|1||015||FreeText update while in pre-alarm mode|
|1||100||Failed to Launch Actuator|
|1||110||Actuator Terminated (exit code == 0)|
|1||115||Actuator Terminated (exit code != 0)|
|1||130||Actuator died <- an actuator may die and the status (reason why) is unknown|
|1||135||Actuator attempts exceeded|
|1||140||Exception present but hasn't reached the minimum number of occurrences for code 000 (pre-alarm)|
|-1||205||Correlation Syntax Error|
|-1||210||API Error (Not able to get the metric value from repository)|
|-1||215||Evaluation Error (Problem in evaluation of comparison)|
|-1||220||Resample error (Involved metric could not be re-sampled within 30s)|
|-1||225||No common entities found from correlation string|
|-1||230||Metric value is older than agent startup|
|-1||235||Index out of bound|
Note: When the exception state is set to 2 (disabled) all the sub codes still apply. However, there is no way to distinguish between code 000 in state 1 and 0 under this mode. If an alarm is present but in a disabled mode you will see 2 000 (correlation string). So to properly determine if an alarm is present when disabled you must also check the free text field to make sure that it is not set to (null).
The following example demonstrates a simple transition of events from no alarm, to running actuators through to final exception for a /tmp full occupancy problem. The correlation for the alarm in this case would be (9104:1 eq /tmp) && (9104:5 > 80). Which reads that if the first field of the system.partitionInfo metric (9104) is equal to /tmp and the fifth field of the same metric is greater then 80% then an alarm is present. The values enclosed in square brackets [,] are the actual value of the metrics at the time of sampling.
|0 000 (null)||No exception present|
|1 105 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5_>_80))||Exception detected running automatic recovery action e.g. tmp cleaning script.|
|1 010 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5_>_80))||A free text update during the period when an actuator is runing. In this example it shows that the occupancy of /tmp has dropped from 90% to 85%.|
|1 110 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5_>_80))||Actuator terminated successfully. Just because the actuator returned successfully doesn't mean that the problem has been rectified.|
1 135 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5_>_80))
|The maximum number of actuator attempts has been exceeded. Raise operator alarm on the LAS console. If no actuators are defined the sensor skips the actuator transition states and goes straight to 1 000 lxs5013:9104:1_eq_/tmp_ &&_lxs5013:9104:4 > 80))|
The lemon-sensor-exception binaries are primarily distributed in RPM format but may also be available as a .tar.gz file as well. The versioning convention used within the lemon project is a standard triplet of integers: MAJOR.MINOR.SUBMINOR-PATCH. Where the changes in the MAJOR number represents incompatible large-scale upgrades which are not backwards compatible with previous versions and may require additional effort and work to integrate them into the current framework. MINOR versions should retain binary compatibility with older versions, and changes in the PATCH level are both forwards and backwards compatible.
All rpm, source tarballs and pkg's providing binary installations are named using the following format 'lemon-sensor-<name>.<version>.<architecture>.<package format> e.g. lemon-sensor-exception-1.2.1-2.arch.rpm. Before installing any software provided by the lemon project please take care to read the documentation and license beforehand
The lemon CVS repository is publicly available using anonymous (pserver) CVS access. To checkout the latest version of lemon-sensor-exception from CVS you need to export the CVSROOT environment variable and checkout the source code using a cvs client. For example:
> cvs checkout lemon/sensors/sensor-exception
Note: UNIX file and directory names are case sensitive. the path to the project CVSROOT must be specified using lowercase characters (i.e. /cvsroot/)
To build the sensor you need to issue the following command from with the sensors directory. E.g:
> make rpm
The build framework should now build an RPM in the build/rpm/RPMS/<architecture> directory.
In order to rebuild an RPM from source you need to download the latest source RPM from the lemon software repository. Once the srpm has be obtained it can be rebuilt using:
> rpmbuild --rebuild lemon-sensor-exception-1.2.1-2.src.rpm
The rpm should be created in your RPMS directory (e.g. /usr/src/redhat/RPMS/noarch/)
Note: Only officially distributed and supported software is provided by lemon software repository. Should the source rpm be missing please contact the maintainer(s) listed above who should be able to provide you with the file.
To install the sensor using the RPM you would issue the command:
By default this will install:
- binaries into /usr/libexec/sensors
- configuration files into /etc/lemon/agent/sensors
- documentation into /usr/share/doc/lemon-sensor-exception/
Configuration of the sensor is done through the lemon-agent using the tabulated configuration file format defined by the lemon-agent documentation. This distribution ships with the following configuration files:
In order to benefit from all the functionality of sensor exception the lemon-agent must:
- be newer then version 2.12.0-1.
- record its samples locally on the machine in its local cache (Configuration path: MSA/LocalCache/Path, default /var/spool/edg-fmon-agent)
- have a sample on demand pipe path defined for use by lemon-host-check and for the sensor to be able to refresh exceptions.
- if smoothing is configured on raw metrics (metrics which contribute towards a correlation) they must have the CacheAll option enabled.
The alarm.exception class which is the only metric class provided by the sensor is used to define new exceptions to be monitored. The metric class supports the following configuration options:
|Correlation||<string>||The correlation option is the power behind sensor exceptions alarm generating capabilities and a mandatory option for all metric instances of the alarm.exception class. This option tells the sensor which metrics are involved in deciding whether an alarm is present and how these metrics should be interpreted.|
|Actuator||[cmdline]||The path to the actuator to run if the correlation string results in being true.|
|MaxRuns||<attempts> [window]||The attempts option allows you to define the maximum number of times to run an actuator consecutively before raising an alarm. The window argument changes the meaning of the attempts slightly be restricting the max number of runs to a specify time window. So for example, if MaxRuns was 3 86400. This would mean that an actuator is allowed to run 3 times in 86400 seconds (1 day). Regardless of whether in between actuator attempts the problem was corrected.|
|Timeout||<seconds>||The maximum amount of time in seconds an actuator is allowed to run for before being killed.|
|ReSampleOffset||<seconds>||Once an actuator has successfully run, sensor exception will attempt to reschedule the metrics involved in the correlation and then reschedule the exception itself x seconds later. This option allows you to change the default delay between requesting the raw metrics and the exception. The default is: 30 seconds|
|MinOccurs||<occurrences>||The minimum occurrences option allows you to specify the number of times an alarm must be detected consecutively on the machine before raising an alarm. During the period when a problem is detected but not in its final alarm state, it is considered to be a pre alarm. This option is useful if you are experience transient alarms.|
|An exception which is considered silent effectively sets the exception state defined in section 2 to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console.|
|Local||[yes|no]||Note: Only for ncm-fmonagent (Quattor
Technically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check.
The basic format of a correlation string is:
|entity_name||Is an optional parameter and should only be used for generating alarms on behalf of other machines/entities. (wildcards '*' are supported).|
|metric_id||Is the id of the metric you wish to check|
|field_position||Is the field within the metric you wish to check. A field_position of 0 represents the whole metric.|
|!=||not equal to|
|> or gt||greater than|
|>= or ge||greater than or equal to|
|< or lt||less than|
|<= or le||less than or equal to|
|range||value is in range A-B|
|!range||value is not in range A-B|
|equal or eq or ==||string equal to|
|nequal or ne or ==||string not equal to|
|regex or re||matches regular expression|
|!regex or !re||does not match regular expression|
|Note: multiple [entity_name]]:<metric_id>:<field_position> combinations can be joined together using &&, || or | operators|
|reference_value||The reference value used with in the comparison. If the reference_value is a string it must be enclosed in single quotes e.g. 'my reference value is'.
Note: The reference value itself maybe another [nodename]:<metric_id>:<field_position> combination. This is useful if you want to join two metrics together on a common field.
The following example demonstrates how to run an actuator when the occupancy of the tmp partition is greater the 80% using the lemon-agent configuration file format. The configuration should be defined in /etc/lemon/agent/metrics/exception.conf. If the actuator fails to correct the problem 3 times consecutively or the actuator is trying to run more then 4 times in 900 seconds an alarm will be raised.
30010 MetricName exception.tmp_full MetricClass alarm.exception Timing 300 30 Parameters Correlation ((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75 MaxRuns 3 900 Timeout 300
For Quattor users the above configuration in pan language would look like:
"/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) );
Note: in the above pan language the timing or period of exception sample is not defined. For exceptions the period field is not mandatory as the ncm-fmonagent component will automatically set a sampling period equal to that of the highest sampling frequency of the metrics involved in the correlation plus a 30 second offset.
The following example demonstrates a more complex example of a correlation by combining the information from several different metrics. In this case the exception (exception.lemon_agent_wrong) is used to check if the agent and its sensors are behaving correctly with regards to the number of resources that they consume and how many error messages of different severities have occurred in the last 300 seconds.
30903 MetricName exception.lemon_agent_wrong MetricClass alarm.exception Timing 300 0 Parameters Correlation 10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0)
So, the correlation reads as:
If the (uptime of the agent (10004:1) is greater then 600 seconds) AND
(the cpu utilisation of the sensors (10004:7) over the last sampling frequency is greater then 10%) OR
(the memory consumed by the sensors (10004:8) is greater then 150 megabytes for machines of architecture type (4109:3) i386 or 600 megabytes for machines of architecture type x86_64) OR
(the number of warning messages (10007:2) recorded over the last sampling frequency is greater the 50) OR
(the number of error messages (10007:3) recorded over the last sampling frequency is greater the 10) OR
(the number of fatal messages (10007:3) recorded over the last sampling frequency is greater the 0) raise an alarm
The following example demonstrates how to join two metrics together using a common field.
The involved metrics are system.networkInterfaceIO metric 9208 which has the following format:
and system.networkInterfaceInfo metric 9200 which has the following format:
To join the two metrics together on the InterfaceName you would do:
Correlation (9200:1 == 9208:1)
The exception sensor also has the ability to generate alarms on behalf of other monitored entities. Note: To raise an alarm on behalf of another entitiy the lemon-agents local cache must contain the metrics recorded on behalf of the other entitiy. In the example below we are using the remote sensors remote.http metric class to check http web servers remotely. The sensor records the HTTP response code e.g. 200, 301, 401, 404 etc.. on behalf of a service name which is configured in an external configuration file. A generic exception is then used to raise an alarm when the response is not equal to 200 or 300 for multiple services.
33000 MetricName exception.http_service_down MetricClass alarm.exception Timing 300 30 Parameters Correlation (*:9501:5 != 200) && (*:9501:5 != 301)
The key difference in this correlation in comparison too previous examples is the inclusion of the <entity_name> option. In this case a wildcard stating that if any entity for which data exists for metric 9501 in the local cache is not equal to 200 or 301 raise an alarm for that entity.
When actuators are executed by the exception sensor there are done so inside a forked process (similar to the way in which the agent runs it sensors) to prevent a blocking actuator from causing the sensor to hang. Any information written to stdout or stderr by the actuator will be recorded in the lemon-agents log file (/var/log/edg-fmon-agent.log). Due to the nature of the system call used inside the sensor when executing actuator shell style conveniences such as redirection, directory globing and pipes are not allowed. The reason for this is because the system call executes the command and then passing each word in the rest of the actuator as an argument. Therefore words such as *, && and | which having special meaning on shells are passed as arguments and not evaluated correctly. To use shell style arguments you must invoke a call to the shell yourself. For example,
Actuator /bin/sh -c \\"/bin/echo 'This is a message from $HOSTNAME' \\"
The example above would simple invoke the shell and cause echo to print out the message substituting the $HOSTNAME for the name of the machine under which the sensor runs. The \\" set of characters are not shell conventions but are required so that the sensor handles the cmdline correctly.
It is also possible with the sensor to pass arguments to the actuator with respect to why the actuator was triggered. For example if you had the correlation ((9104:1 eq '/tmp') && (9104:5 < 80)) the actual value of the first condition can be accessed by passing the actuator the value $act_value_01 and the second can be accessed using $act_value_02 and so on. This allows you to create generic scripts which can be adaptive or perform different actions depending on the cause of the alarm or the severity of the problem.
Actuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\"
The following metric classes are exposed by this sensor:
Description: generic exception class used to record the state of exceptions
The output of the metric uses the following format: (fields are separated by a single space ' ' character)
This package is licensed under the EU Datagrid License outlined below: