lemon-sensor-exception - v1.2.1-2

Name: lemon-sensor-exception   Metric Classes: (1)
Version: 1.2.1
Release 2
Summary LEMON sensor for generating exception metrics
Requires edg-fabricMonitoring-agent >= 2.12.1-2
pcre
Command Line /usr/libexec/sensors/lemon-sensor-exception
Language CPP
Supports CFG CHK SOD VER
Copyright Copyright (c) 2001 EU DataGrid
URL http://cern.ch/lemon
Group System Environment/Base
Maintainers Project Lemon

Sensor exception is a C++ sensor developed in collaboration between CERN and BARC. The sensor is used for generating exception metrics and launching recovery actions (aka actuators) in response to detected problems. The built in correlation engine allows for alarms to be risen based on the information reported by multiple metrics and has support for mathematical expressions, string comparisons, regular expressions and generating alarms on behalf of other monitored entities. lemon-sensor-exception is an officially supported lemon sensor and is mandatory if you intend to use LAS, the Lemon Alarm System.

The sensor uses the lemon alarm protocol for generating alarms and communicating its actions and status to other lemon components such as LAS and lemon-host-check.

The protocol follows the follow format:

Field Name Type Description
Exception State Integer (1) Status of the exception, one of: [ 0 = no exception, 1 = exception detected, -1 = error, 2 = disabled exception]
Code Integer (3) Code of the corresponding action/message. This may be an automated recovery action, general information or notification for another service.
FreeText String Comparison performed.

The following table outlines all the exception states and codes:

Exception State Code Associated string
0 000 No Exception detected
1 000 Exception detected
1 005 FreeText update
1 010 FreeText update for actuators
1 015 FreeText update while in pre-alarm mode
1 100 Failed to Launch Actuator
1 105 Actuator Running
1 110 Actuator Terminated (exit code == 0)
1 115 Actuator Terminated (exit code != 0)
1 120 Actuator killed
1 125 Actuator timeout
1 130 Actuator died <- an actuator may die and the status (reason why) is unknown
1 135 Actuator attempts exceeded
1 140 Exception present but hasn't reached the minimum number of occurrences for code 000 (pre-alarm)
-1 205 Correlation Syntax Error
-1 210 API Error (Not able to get the metric value from repository)
-1 215 Evaluation Error (Problem in evaluation of comparison)
-1 220 Resample error (Involved metric could not be re-sampled within 30s)
-1 225 No common entities found from correlation string
-1 230 Metric value is older than agent startup
-1 235 Index out of bound

Note: When the exception state is set to 2 (disabled) all the sub codes still apply. However, there is no way to distinguish between code 000 in state 1 and 0 under this mode. If an alarm is present but in a disabled mode you will see 2 000 (correlation string). So to properly determine if an alarm is present when disabled you must also check the free text field to make sure that it is not set to (null).

The following example demonstrates a simple transition of events from no alarm, to running actuators through to final exception for a /tmp full occupancy problem. The correlation for the alarm in this case would be (9104:1 eq /tmp) && (9104:5 > 80). Which reads that if the first field of the system.partitionInfo metric (9104) is equal to /tmp and the fifth field of the same metric is greater then 80% then an alarm is present. The values enclosed in square brackets [,] are the actual value of the metrics at the time of sampling.

0 000 (null) No exception present
1 105 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5[90]_>_80)) Exception detected running automatic recovery action e.g. tmp cleaning script.
1 010 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5[85]_>_80)) A free text update during the period when an actuator is runing. In this example it shows that the occupancy of /tmp has dropped from 90% to 85%.
1 110 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5[85]_>_80)) Actuator terminated successfully. Just because the actuator returned successfully doesn't mean that the problem has been rectified.

1 135 ((lxs5013:9104:1[/tmp]_eq_/tmp)_ &&_(lxs5013:9104:5[85]_>_80))

The maximum number of actuator attempts has been exceeded. Raise operator alarm on the LAS console. If no actuators are defined the sensor skips the actuator transition states and goes straight to 1 000 lxs5013:9104:1_eq_/tmp_ &&_lxs5013:9104:4[90] > 80))

The lemon-sensor-exception binaries are primarily distributed in RPM format but may also be available as a .tar.gz file as well. The versioning convention used within the lemon project is a standard triplet of integers: MAJOR.MINOR.SUBMINOR-PATCH. Where the changes in the MAJOR number represents incompatible large-scale upgrades which are not backwards compatible with previous versions and may require additional effort and work to integrate them into the current framework. MINOR versions should retain binary compatibility with older versions, and changes in the PATCH level are both forwards and backwards compatible.

All rpm, source tarballs and pkg's providing binary installations are named using the following format 'lemon-sensor-<name>.<version>.<architecture>.<package format> e.g. lemon-sensor-exception-1.2.1-2.arch.rpm. Before installing any software provided by the lemon project please take care to read the documentation and license beforehand

3.1 Building an RPM from CVS (Concurrent Versions System)

The lemon CVS repository is publicly available using anonymous (pserver) CVS access. To checkout the latest version of lemon-sensor-exception from CVS you need to export the CVSROOT environment variable and checkout the source code using a cvs client. For example:

> export CVSROOT=:pserver:anonymous@isscvs.cern.ch:/local/reps/elfms/
> cvs checkout lemon/sensors/sensor-exception

Note: UNIX file and directory names are case sensitive. the path to the project CVSROOT must be specified using lowercase characters (i.e. /cvsroot/)

To build the sensor you need to issue the following command from with the sensors directory. E.g:

> cd lemon/sensors/sensor-exception
> make rpm

The build framework should now build an RPM in the build/rpm/RPMS/<architecture> directory.

3.2 Building an RPM from the Source RPM

In order to rebuild an RPM from source you need to download the latest source RPM from the lemon software repository. Once the srpm has be obtained it can be rebuilt using:

> rpm -Uvh lemon-sensor-exception-1.2.1-2.src.rpm
> rpmbuild --rebuild lemon-sensor-exception-1.2.1-2.src.rpm

The rpm should be created in your RPMS directory (e.g. /usr/src/redhat/RPMS/noarch/)

Note: Only officially distributed and supported software is provided by lemon software repository. Should the source rpm be missing please contact the maintainer(s) listed above who should be able to provide you with the file.

3.3 Installing the sensor

To install the sensor using the RPM you would issue the command:

> rpm -Uvh lemon-sensor-exception-1.2.1-2.arch.rpm

By default this will install:

  • binaries into /usr/libexec/sensors
  • configuration files into /etc/lemon/agent/sensors
  • documentation into /usr/share/doc/lemon-sensor-exception/

Configuration of the sensor is done through the lemon-agent using the tabulated configuration file format defined by the lemon-agent documentation. This distribution ships with the following configuration files:

In order to benefit from all the functionality of sensor exception the lemon-agent must:

The alarm.exception class which is the only metric class provided by the sensor is used to define new exceptions to be monitored. The metric class supports the following configuration options:

Option Format Description
Correlation <string> The correlation option is the power behind sensor exceptions alarm generating capabilities and a mandatory option for all metric instances of the alarm.exception class. This option tells the sensor which metrics are involved in deciding whether an alarm is present and how these metrics should be interpreted.
Actuator [cmdline] The path to the actuator to run if the correlation string results in being true.
MaxRuns <attempts> [window] The attempts option allows you to define the maximum number of times to run an actuator consecutively before raising an alarm. The window argument changes the meaning of the attempts slightly be restricting the max number of runs to a specify time window. So for example, if MaxRuns was 3 86400. This would mean that an actuator is allowed to run 3 times in 86400 seconds (1 day). Regardless of whether in between actuator attempts the problem was corrected.
Timeout <seconds> The maximum amount of time in seconds an actuator is allowed to run for before being killed.
ReSampleOffset <seconds> Once an actuator has successfully run, sensor exception will attempt to reschedule the metrics involved in the correlation and then reschedule the exception itself x seconds later. This option allows you to change the default delay between requesting the raw metrics and the exception. The default is: 30 seconds
MinOccurs <occurrences> The minimum occurrences option allows you to specify the number of times an alarm must be detected consecutively on the machine before raising an alarm. During the period when a problem is detected but not in its final alarm state, it is considered to be a pre alarm. This option is useful if you are experience transient alarms.
Silent

[yes|no]

An exception which is considered silent effectively sets the exception state defined in section 2 to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console.
Local [yes|no] Note: Only for ncm-fmonagent (Quattor users).

Technically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check.



4.1 Correlation Syntax

The basic format of a correlation string is:

[entity_name]:<metric_id>:<field_position>         <operator>         <reference_value> ...

Where:

Argument Description
entity_name Is an optional parameter and should only be used for generating alarms on behalf of other machines/entities. (wildcards '*' are supported).
metric_id Is the id of the metric you wish to check
field_position Is the field within the metric you wish to check. A field_position of 0 represents the whole metric.
operator One of:
  == equal to
  != not equal to
  > or gt greater than
  >= or ge greater than or equal to
  < or lt less than
  <= or le less than or equal to
  range value is in range A-B
  !range value is not in range A-B
  equal or eq or == string equal to
  nequal or ne or == string not equal to
  regex or re matches regular expression
  !regex or !re does not match regular expression
   
  Note: multiple [entity_name]]:<metric_id>:<field_position> combinations can be joined together using &&, || or | operators
 

 

reference_value The reference value used with in the comparison. If the reference_value is a string it must be enclosed in single quotes e.g. 'my reference value is'.
Note: The reference value itself maybe another [nodename]:<metric_id>:<field_position> combination. This is useful if you want to join two metrics together on a common field.
4.2 Correlation Example 1

The following example demonstrates how to run an actuator when the occupancy of the tmp partition is greater the 80% using the lemon-agent configuration file format. The configuration should be defined in /etc/lemon/agent/metrics/exception.conf. If the actuator fails to correct the problem 3 times consecutively or the actuator is trying to run more then 4 times in 900 seconds an alarm will be raised.

30010	
	MetricName	exception.tmp_full
	MetricClass	alarm.exception
	Timing	300	30
	Parameters
		Correlation	((9104:1 eq '/tmp') && (9104:5 > 80))
		Actuator		/usr/local/sbin/clean-tmp-partition -o 75
		MaxRuns		3 900
		Timeout		300

For Quattor users the above configuration in pan language would look like:

"/system/monitoring/exception/_30010" = nlist(
        "name",         "tmp_full",
        "descr",        "tmp utilization exceeds limit",
        "active",       true,
        "latestonly",   false,
        "importance",   2,
        "correlation",  "((9104:1 eq '/tmp') && (9104:5 > 80))",
        "actuator",     nlist("execve",  "/usr/local/sbin/clean-tmp-partition -o 75",
                              "maxruns", 3,
                              "timeout", 300,
                              "window",  900,							
                              "active",  true)
);

Note: in the above pan language the timing or period of exception sample is not defined. For exceptions the period field is not mandatory as the ncm-fmonagent component will automatically set a sampling period equal to that of the highest sampling frequency of the metrics involved in the correlation plus a 30 second offset.

4.3 Correlation Example 2

The following example demonstrates a more complex example of a correlation by combining the information from several different metrics. In this case the exception (exception.lemon_agent_wrong) is used to check if the agent and its sensors are behaving correctly with regards to the number of resources that they consume and how many error messages of different severities have occurred in the last 300 seconds.

30903
        MetricName      exception.lemon_agent_wrong
        MetricClass     alarm.exception
        Timing	300 0
        Parameters
                Correlation     10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0)

So, the correlation reads as:

If the (uptime of the agent (10004:1) is greater then 600 seconds) AND
(the cpu utilisation of the sensors (10004:7) over the last sampling frequency is greater then 10%) OR
(the memory consumed by the sensors (10004:8) is greater then 150 megabytes for machines of architecture type (4109:3) i386 or 600 megabytes for machines of architecture type x86_64) OR
(the number of warning messages (10007:2) recorded over the last sampling frequency is greater the 50) OR
(the number of error messages (10007:3) recorded over the last sampling frequency is greater the 10) OR
(the number of fatal messages (10007:3) recorded over the last sampling frequency is greater the 0) raise an alarm

4.4 Correlation Example 3

The following example demonstrates how to join two metrics together using a common field.

The involved metrics are system.networkInterfaceIO metric 9208 which has the following format:

Index Name Description Data Type Format Units Scale
1 InterfaceName N/A String %10s N/A N/A
2 NumKBReadTotal N/A Integer %ld N/A 1024
3 NumKBReadAvg N/A Float %.2f N/A 1024
4 NumKBWriteTotal N/A Integer %ld N/A 1024
5 NumKBWriteAvg N/A Float %.2f N/A 1024

and system.networkInterfaceInfo metric 9200 which has the following format:

Index Name Description Data Type Format Units Scale
1 InterfaceName N/A String %32s N/A N/A
2 IPAddress N/A String %32s N/A N/A
3 Mask N/A String %32s N/A N/A
4 Broadcast N/A String %32s N/A N/A
5 Gateway N/A String %32s N/A N/A
6 MAC N/A String %32s N/A N/A
7 MTU N/A Integer %ld N/A N/A
8 Duplex N/A Integer %ld N/A N/A
9 Speed N/A Integer %ld N/A 1024

To join the two metrics together on the InterfaceName you would do:

Correlation (9200:1 == 9208:1)
4.5 Correlation Example 4

The exception sensor also has the ability to generate alarms on behalf of other monitored entities. Note: To raise an alarm on behalf of another entitiy the lemon-agents local cache must contain the metrics recorded on behalf of the other entitiy. In the example below we are using the remote sensors remote.http metric class to check http web servers remotely. The sensor records the HTTP response code e.g. 200, 301, 401, 404 etc.. on behalf of a service name which is configured in an external configuration file. A generic exception is then used to raise an alarm when the response is not equal to 200 or 300 for multiple services.

33000
        MetricName      exception.http_service_down
        MetricClass     alarm.exception
        Timing  300     30		
        Parameters
                Correlation     (*:9501:5 != 200) && (*:9501:5 != 301)

The key difference in this correlation in comparison too previous examples is the inclusion of the <entity_name> option. In this case a wildcard stating that if any entity for which data exists for metric 9501 in the local cache is not equal to 200 or 301 raise an alarm for that entity.

4.6 Actuators

When actuators are executed by the exception sensor there are done so inside a forked process (similar to the way in which the agent runs it sensors) to prevent a blocking actuator from causing the sensor to hang. Any information written to stdout or stderr by the actuator will be recorded in the lemon-agents log file (/var/log/edg-fmon-agent.log). Due to the nature of the system call used inside the sensor when executing actuator shell style conveniences such as redirection, directory globing and pipes are not allowed. The reason for this is because the system call executes the command and then passing each word in the rest of the actuator as an argument. Therefore words such as *, && and | which having special meaning on shells are passed as arguments and not evaluated correctly. To use shell style arguments you must invoke a call to the shell yourself. For example,

Actuator	/bin/sh -c \\"/bin/echo 'This is a message from $HOSTNAME' \\"

The example above would simple invoke the shell and cause echo to print out the message substituting the $HOSTNAME for the name of the machine under which the sensor runs. The \\" set of characters are not shell conventions but are required so that the sensor handles the cmdline correctly.

It is also possible with the sensor to pass arguments to the actuator with respect to why the actuator was triggered. For example if you had the correlation ((9104:1 eq '/tmp') && (9104:5 < 80)) the actual value of the first condition can be accessed by passing the actuator the value $act_value_01 and the second can be accessed using $act_value_02 and so on. This allows you to create generic scripts which can be adaptive or perform different actions depending on the cause of the alarm or the severity of the problem.

Actuator	/bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\"

The following metric classes are exposed by this sensor:

5.1 alarm.exception

Description: generic exception class used to record the state of exceptions

The output of the metric uses the following format: (fields are separated by a single space ' ' character)

Index Name Description Data Type Format Units Scale
1 exceptstate N/A Integer %ld N/A N/A
2 code N/A String %3s N/A N/A
3 freetxt N/A String %256s N/A N/A

This package is licensed under the EU Datagrid License outlined below:

EU DataGrid Software License

Copyright (c) 2001 EU DataGrid. All rights reserved.

This software includes voluntary contributions made to the EU DataGrid. For more information on the EU DataGrid, please see http://www.eu-datagrid.org/.

Installation, use, reproduction, display, modification and redistribution of this software, with or without modification, in source and binary forms, are permitted. Any exercise of rights under this license by you or your sub-licensees is subject to the following conditions:

1. Redistributions of this software, with or without modification, must reproduce the above copyright notice and the above license statement as well as this list of conditions, in the software, the user documentation and any other materials provided with the software.

2. The user documentation, if any, included with a redistribution, must include the following notice: "This product includes software developed by the EU DataGrid (http://www.eu-datagrid.org/)."

Alternatively, if that is where third-party acknowledgments normally appear, this acknowledgment must be reproduced in the software itself.

3. The names "EDG", "EDG Toolkit", and "EU DataGrid Project" may not be used to endorse or promote software, or products derived therefrom, except with prior written permission by hep-project-grid-edg-license@cern.ch.

4. You are under no obligation to provide anyone with any bug fixes, patches, upgrades or other modifications, enhancements or derivatives of the features,functionality or performance of this software that you may develop. However, if you publish or distribute your modifications, enhancements or derivative works without contemporaneously requiring users to enter into a separate written license agreement, then you are deemed to have granted participants in the EU DataGrid a worldwide, non-exclusive, royalty-free, perpetual license to install, use, reproduce, display, modify, redistribute and sub-license your modifications, enhancements or derivative works, whether in binary or source code form, under the license conditions stated in this list of conditions.

5. DISCLAIMER

THIS SOFTWARE IS PROVIDED BY THE EU DATAGRID AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF SATISFACTORY QUALITY, AND FITNESS FOR A PARTICULAR PURPOSE OR USE ARE DISCLAIMED. THE EU DATAGRID AND CONTRIBUTORS MAKE NO REPRESENTATION THAT THE SOFTWARE, MODIFICATIONS, ENHANCEMENTS OR DERIVATIVE WORKS THEREOF, WILL NOT INFRINGE ANY PATENT, COPYRIGHT, TRADE SECRET OR OTHER PROPRIETARY RIGHT.

6. LIMITATION OF LIABILITY

THE EU DATAGRID AND CONTRIBUTORS SHALL HAVE NO LIABILITY TO LICENSEE OR OTHER PERSONS FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, CONSEQUENTIAL, EXEMPLARY, OR PUNITIVE DAMAGES OF ANY CHARACTER INCLUDING, WITHOUT LIMITATION, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, LOSS OF USE, DATA OR PROFITS, OR BUSINESS INTERRUPTION, HOWEVER CAUSED AND ON ANY THEORY OF CONTRACT, WARRANTY, TORT (INCLUDING NEGLIGENCE), PRODUCT LIABILITY OR OTHERWISE, ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Should you have any questions or comments, please contact: Project Lemon