To help with your backbone alert settings, here is an overview of what is available with some best practices and reasons to use the various types.
First, how does alerting work? When a test finishes running, thresholds for alerts are checked. If the data returned exceeds one of those thresholds, that node for the test and alert type is put into an alert state. When there is a change to that alert state, the node threshold is then looked at as well as the alert state for the other nodes running the test. If the node threshold has been exceeded, the alert is triggered. That change to state is also used when that node then returns data no longer exceeding the threshold to send the Improved alert. Each alert also contains a ‘node threshold’ aspect. This is so you can indicate how many nodes you want to be in an alert state before you are notified of the change in performance. For all node thresholds, I recommend using a percentage of nodes instead of a flat number. That way, with future test changes you will never encounter the node threshold being higher than the number of nodes running the test (which means you would never be alerted). If you are unsure of a level to begin at, I suggest 50%.
For managing performance, we offer 4 different types of alerts. These are Response Time, Transaction Failure, Object Failure and Byte Limit Failure.
Response Time Alerts are to inform you of higher than desired response times. It looks at the total response time for a successful test run and compares that data to your set thresholds. Here you have 2 different options for thresholds, Dynamic and Static. Dynamic calculates the status for a given test at a given node by comparing the current response time average to the average response time experienced over a longer period of time. This mathematical distance can be compared as an absolute number of seconds or a relative percentage. This dynamic nature can be very useful if the nodes you run have drastically different response times or if you are looking to be alerted only on larger consistent events. Static alert values ask for a flat value (in seconds) for Warning and Severe. This is what you would want to select if you have strict SLAs to maintain or simply want to be notified if the response time exceeds a certain amount of time. A response time can be Warning or Severe based on the configured values.
Before configuring thresholds, it is a best practice to let the test run at least for a week or 2 (3 to 4 is even better) to set a decent trend line. It can also be useful to first configure the alerts without destinations. That way you can let them trigger to the log for a while to then review if it is at the levels that would be of most use for your needs. Many times, people will use standard deviation or percentiles (from that trend line) to decide their settings.
When reviewing the response time data, make sure to chart the response time by node. If there is a large gap in the performance by nodes it is recommended to utilize dynamic thresholds. Regardless of what you select, at the beginning go on the higher limits to avoid over alerting.
Transaction Failure Alerts are to inform you of test failures. This looks at tests that were unable to complete and resulted in a fatal error. For this alert type, only the node threshold is configured. The sensitivity of the application and how frequently the test is running are typically reviewed when determining how quickly you want to be notified of failing tests, (some clients want to be notified of every failure, while others not until each node has failed twice).
Object Failure Alerts are to inform you of issues with non-essential objects on the page. It looks at all the objects within a test. Here you can configure an alert to send if a certain number or a percent of objects on the page encounter errors, even though the page itself is able to load. Object failure alerting thresholds will depend on your existing page and how ‘clean’ you want it to remain. I have some clients that want to know if any single object has a failure so they can diagnose and fix it while many others have regularly occurring object failures. I typically recommend starting this configuration at 50%. That way if half the objects on your page fail you can be notified.
Byte limit Failure Alerts are to notify you of changes to page size. It looks only at the tested page size. This allows you to set an upper and lower page weight value (in bytes). Byte Limit Failures are not frequently used however they can be very useful. Do you want to know if there is a change in page weight? Maybe something isn’t loading or is loading twice. If the page has a static weight, maybe add a few thousand bytes for the upper limit and subtract a few thousand for the lower. If you are unsure where to begin, I recommend looking at your current page weight (by charting bytes or looking in a waterfall chart). For the upper limit, double it and for lower, cut it in half.
Each alert type then has the option to alert on consecutive ‘failures’. You can determine if each node needs to exceed the threshold from 1 to 5 times before the node is in an alert state. This can be useful if you do not want to be alerted in a single anomaly in the data, rather only when a continuous issue is occurring. This can also be extremely useful if circumstances require that you run tests on fewer nodes.
Once thresholds are configured, it is time to indicate who should be sent the alert. We handle this through ‘destinations’. A Destination includes the email format and list of recipients. Select from the drop down which level and then which destination. More can be added than is showing. You can also adjust any ‘email subject’ from the destination configuration to something unique at this time. Destinations can also be configured to send Reminders’ during an incident at a set interval between 5 minutes and 12 hours. Make sure the drop down interval selected is at least twice your test frequency AND select the ‘reminder’ box next to the destinations you want to receive them.
While all the above has been on a total test level, we also have introduced Step level alerting for each of those types as well. This is useful for those times where perhaps 3 different teams control the handling of each step in a 3 page script. Or perhaps there is a strict SLA for response time on just step3. Maybe, you are only interested in the landing page after a login. Step level alerting allows you to be notified in that more granular level. Simply click the link for ‘Configure Step Level Alerts...’ found to the right of any alert setting to configure these.
I hope you found this helpful.
Get in the game and win some custom Davis stickers and swag!
Check Dave's post for details.
Friday, August 30, 4:00 p.m. CEST/10:00 a.m. EDT
See what's been improved in metric charts, Report explorer, and Reporting data servers screen.
Friday, August 30, 4:00 p.m. CEST/10:00 a.m. EDT
Socket Timeout 2 Answers
Flex price - what adds to the cost? 3 Answers