Application's status is periodically checked and alarm is triggered if certain pre-configured conditions (rules) are satisfied.
pinpoint-batch server checks every 3 minutes based on the last 5 minutes of data. And if the conditions are satisfied, it sends sms/email/webhook to the users listed in the user group.
If an email/sms/webhook is sent everytime when a threshold is exceeded, we felt that alarm message would be spammable.
Therefore we decided to gradually increase the transmission frequency for alarms.
ex) If an alarm occurs continuously, transmission frequency is increased by a factor of two. 3 min -> 6min -> 12min -> 24min
NOTICE!
batch was run in the background of pinpoint-web server until v2.2.0 From v2.2.1 it will be dealt with in pinpoint-batch server. Since the batch logic(code) in pinpoint-web will be deprecated in the future, we advice you to transfer the execution of batch to pinpoint-batch server.
1. User Guide
Alarm Rules
SLOW COUNT
Triggered when the number of slow requests sent to the application exceeds the configured threshold.
SLOW RATE
Triggered when the percentage(%) of slow requests sent to the application exceeds the configured threshold.
ERROR COUNT
Triggered when the number of failed requests sent to the application exceeds the configured threshold.
ERROR RATE
Triggered when the percentage(%) of failed requests sent to the application exceeds the configured threshold.
TOTAL COUNT
Triggered when the number of all requests sent to the application exceeds the configured threshold.
SLOW COUNT TO CALLEE
Triggered when the number of slow requests sent by the application exceeds the configured threshold.
You must specify the domain or the address(ip, port) in the configuration UI's "Note..." box
ex) www.naver.com, 127.0.0.1:8080
SLOW RATE TO CALLEE
Triggered when the percentage(%) of slow requests sent by the application exceeds the configured threshold.
You must specify the domain or the address(ip, port) in the configuration UI's "Note..." box
ex) www.naver.com, 127.0.0.1:8080
ERROR COUNT TO CALLEE
Triggered when the number of failed requests sent by the application exceeds the configured threshold.
You must specify the domain or the address(ip, port) in the configuration UI's "Note..." box
ex) www.naver.com, 127.0.0.1:8080
ERROR RATE TO CALLEE
Triggered when the percentage(%) of failed requests sent by the application exceeds the configured threshold.
You must specify the domain or the address(ip, port) in the configuration UI's "Note..." box
ex) www.naver.com, 127.0.0.1:8080
TOTAL COUNT TO CALLEE
Triggered when the number of all requests sent by the application exceeds the configured threshold.
You must specify the domain or the address(ip, port) in the configuration UI's "Note..." box
ex) www.naver.com, 127.0.0.1:8080
HEAP USAGE RATE
Triggered when the application's heap usage(%) exceeds the configured threshold.
JVM CPU USAGE RATE
Triggered when the application's CPU usage(%) exceeds the configured threshold.
SYSTEM CPU USAGE RATE
Sends an alarm when the application's CPU usage(%) exceeds the configured threshold.
DATASOURCE CONNECTION USAGE RATE
Triggered when the application's DataSource connection usage(%) exceeds the configured threshold.
FILE DESCRIPTOR COUNT
Sends an alarm when the number of open file descriptors exceeds the configured threshold.
2. Configuration & Implementation
Alarms generated by Pinpoint can be configured to be sent over email, sms and webhook.
Sending alarms over email is simple - you will simply need to configure the property file. Sending alarms over sms requires some implementation. Read on to find out how to do this. The alarm using webhook requires an webhook receiver service to get webhook messages. You should implement the webhook receiver service - which is not provided by Pinpoint, or You can use the sample project
Few modifications are required in pinpoint-batch and pinpoint-web to use the alarm feature. Add some implementations and settings in pinpoint-batch. Configure Pinpoint-web for user to set an alarm settings.
2.1 Configuration & Implementation in pinpoint-batch
2.1.1) Email configuration, sms and webhook implementation
A. Email alarm service
To use the mailing feature, you need to configure the SMTP server information and information to be included in the email in the batch-root.properties file.
pinpoint.url= #pinpoint-web server url
alarm.mail.server.url= #smtp server address
alarm.mail.server.port= #smtp server port
alarm.mail.server.username= #username for smtp server authentication
alarm.mail.server.password= #password for smtp server authentication
alarm.mail.sender.address= #sender's email address
ex)
pinpoint.url=http://pinpoint.com
alarm.mail.server.url=stmp.server.com
alarm.mail.server.port=587
alarm.mail.server.username=pinpoint
alarm.mail.server.password=pinpoint
alarm.mail.sender.address=pinpoint_operator@pinpoint.com
If you would like to implement your own mail sender, simply replace the SpringSmtpMailSender, JavaMailSenderImpl beans above with your own implementation that implements com.navercorp.pinpoint.web.alarm.MailSender interface.
public interface MailSender {
void sendEmail(AlarmChecker checker, int sequenceCount);
}
B. Sms alarm service
To send alarms over sms, you will need to implement your own sms sender by implementing com.navercorp.pinpoint.batch.alarm.SmsSender interface. If there is no SmsSender implementation, then alarms will not be sent over sms.
public interface SmsSender {
public void sendSms(AlarmChecker checker, int sequenceCount);
}
C. Webhook alarm service
Webhook alarm service is a feature that can transmit Pinpoint's alarm message through Webhook API.
The webhook receiver service that receives the webhook message should be implemented by your own, or use a sample project provided (in this case Slack).
As the webhook alarm service has been available from Pinpoint 2.1.1, You should add column 'webhook_send' in table 'alarm_rule' of pinpoint MYSQL if you updated previous release of Pinpoint 2.2.1.
The pinpoint-batch project is based on spring boot and can be executed with the following command. After build, the executable file is placed under the target/deploy folder of the pinpoint-batch.
2) Ways to improve alarm batch performance The alarm batch was designed to run concurrently. If you have a lot of applications with alarms registered, you may increase the size of the executor's thread pool by modifying pool-size in applicationContext-alarmJob.xml file.
Note that increasing this value will result in higher resource usage.
If there are a lot of alarms registered to applications, you may set the alarmStep registered in applicationContext-alarmJob.xml file to run concurrently.
3) Use quickstart's web Pinpoint Web uses Mysql to persist users, user groups, and alarm configurations.
However Quickstart uses MockDAO to reduce memory usage.
Therefore if you want to use Mysql for Quickstart, please refer to Pinpoint Web's applicationContext-dao-config.xml, jdbc.properties.
3.2 Details on Webhook
3.2.1) webhook receiver sample project
Slack-Receiver is an example project of the webhook receiver. The project can receives alarm of the Pinpoint webhook and sends the message to Slack. If you want more details, see the project repository
3.2.2) The Specification of webhook payloads and the examples
The threshold of value detected by checker during a set time
X
notes
String
The notes in the alarm setting page
O
sequenceCount
Integer
The number of alarm occurence
X
userGroupId
String
The user group id in the user group page
X
userGroupMembers
UserMember[]
Members Info of a specific user group
X
name
String
The name of checker in the alarm setting page
X
type
String
The type of checker abstracted by value detected by checker "LongValueAlarmChecker" type is the abstracted checker type of βSlow Countβ, βSlow Rateβ, βError Countβ, βError Rateβ, βTotal Countβ, βSlow Count To Calleeβ, βSlow Rate To Calleeβ, βError Count To Calleeβ, βError Rate To Calleeβ, βTotal Count to Calleeβ. "LongValueAgentChecker" type is the abstracted checker type of "Heap Usage Rate", "Jvm Cpu Usage Rate", "System Cpu Usage Rate", "File Descriptor Count". "BooleanValueAgentChecker" type is the abstracted checker type of "Deadlock or not". "DataSourceAlarmListValueAgentChecker" type is the abstracted checker type of "DataSource Connection Usage Rate".
X
detectedValue
Integer or DetectedAgent[]
The value detected by checker If βtypeβ is βLongValueAlarmCheckerβ, βdetectedValueβ is Integer type. If "type" is not "LongValueAlarmChecker", "detectedValue" is DetectedAgents[] type.
X
id
String
Member id
X
name
String
Member name
X
email
String
Member email
O
department
String
Member department
O
phoneNumber
String
Member phone number
O
phoneCountryCode
String
Member phone country code
O
agentId
String
Agent id detected by checker
X
agentValue
Integer or Boolean or DataSourceAlarm[]
The value of Agent detected by checker If βtypeβ is βLongValueAgentCheckerβ, βagentValueβ is Integer type. If βtypeβ is βBooleanValueAgentCheckerβ,βagentValueβ is Boolean type. If βtypeβ is βDataSourceAlarmListValueAgentCheckerβ, βagentValueβ is DataSourceAlarm[] type