Bug #2136

freshness checks are generating stale alerts, even if result was received in time

Added by dnsmichi over 2 years ago. Updated over 2 years ago.

Status:ResolvedStart date:12/01/2011
Priority:UrgentDue date:
Assignee:dnsmichi% Done:

100%

Category:Passive Checks
Target version:Icinga 1.x - 1.6.1
Icinga Version: OS Version:

Description

the recent patch #2027 on the freshness checking on startup is heavily affecting the normal scheduled run under weird circumstances.

[1322755814] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.636009
[1322755815] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756114] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635003
[1322756115] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756414] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;localhost;fuckup;0;connected|state=1 time=8.635272
[1322756415] PASSIVE SERVICE CHECK: localhost;fuckup;0;connected
[1322756475] Warning: The results of service 'fuckup' on host 'localhost' are stale by 0d 0h 1m 0s (threshold=0d 0h 20m 0s).  I'm forcing an immediate check of the service.
[1322756478] SERVICE ALERT: localhost;fuckup;CRITICAL;HARD;1;CRITICAL: whohooo failure
[1322756478] SERVICE NOTIFICATION: demo;localhost;fuckup;CRITICAL;service-notify-by-email;CRITICAL: whohooo failure

as remarked in #2027 description, it could most likely fail.

a tested reverted fix in OMD 0.52 nightly is working, so this actually needs to be pushed an released into 1.6.1 then.

0001-remove-wrong-expiration-calculation-on-freshness-che.patch Magnifier (3.32 KB) dnsmichi, 12/02/2011 09:58 am


Related issues

Related to Core - Feature #1978: read last_program_stop from retention.dat and use that fo... Closed 10/06/2011

Associated revisions

Revision 2208b714
Added by dnsmichi over 2 years ago

remove wrong expiration calculation on freshness checks #2136

this is a revert, and needs a proper fix, but will do for now.

refs #2136

History

#1 Updated by dnsmichi over 2 years ago

  • Target version set to 1.6.1

#2 Updated by dnsmichi over 2 years ago

seeking out into the snippet, it might be that the 60 seconds delay on the alert has something to do with the hardcoded 60 seconds in the addition (missing brackets, different and operator?).

-       if (temp_service->check_type == SERVICE_CHECK_PASSIVE) {
-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }
-       }

requires further investigation though. the patch for removal will be in r1.6

#5 Updated by dnsmichi over 2 years ago

  • Subject changed from freshness checks are generating stale alerts after 60 seconds, even if result was received to freshness checks are generating stale alerts, even if result was received
  • Priority changed from High to Urgent

#6 Updated by dnsmichi over 2 years ago

  • Subject changed from freshness checks are generating stale alerts, even if result was received to freshness checks are generating stale alerts, even if result was received in time

further debug logs (cleaned private stuff)

[1322813318.089879] [016.2] [pid=17065] Checking freshness of service 'foo' on host 'bar'...
[1322813318.089883] [016.2] [pid=17065] Freshness thresholds: service=300, use=300
[1322813318.089887] [016.2] [pid=17065] HBC: 1, PS: 1322812951, ES: 1322812958, LC: 1322813213, CT: 1322813318, ET: 1322813258
[1322813318.089966] [016.1] [pid=17065] Check results for service 'foo' on host 'bar' are stale by 0d 0h 1m 0s (threshold=0d 0h 5m 0s).  Forcing an immediate check of the service...
[1322813318.090017] [016.0] [pid=17065] Scheduling a forced, active check of service 'foo' on host 'bar' @ Fri Dec  2 09:08:38 2011
[1322813318.090059] [016.2] [pid=17065] Scheduling new service check event.

has_been_checked: 1
program_start: 1322812951
event_start: 1322812958
last_check: 1322813213
current_time: 1322813318
expiration_time: 1322813258
Freshness thresholds: service=300, use=300

from the new calculation algorithm

-               if (event_start < program_start + 60 && event_start - last_program_stop < (freshness_threshold * 0.618)) {
-                       expiration_time = event_start + freshness_threshold;
-               }

if (1322812958 < 1322812951 + 60 && 1322812958 - x < 300 * 0.618)
expiration_time = 1322812958 + 300

if (expiration_time < current_time)

if (1322813*258* < 1322813*318*)
this is true

analze the above

if (1322812958 < 1322813011 && 1322812958-x < 185,4)
=>
1322812958-x < 185,4
=>
1322812958-185,4 < x
=>
x > 1322812772,6
=>
x > Fri, 02 Dec 2011 07:59:32 +0000

the overall conclusion is that the freshness threshold with the retention.dat creation date calculation between the event_time and a 61,8 percent freshness threshold will cause an error if

event_start is within

  • program_start+60
  • last_program_stop happened below 300s*61,8%, so not so long ago

actually the problem is not the last_program_stop, but the condition on the program_start and the 60 seconds. this value is just an assumption and therefore not correct for further usage.

#7 Updated by ABauer over 2 years ago

Hi michi,

thanks for figuring this out, in detail!

I ran into the same problem today, after updating to 1.6.

I applied your patch, now everything is fine and running smoothly.

#8 Updated by dnsmichi over 2 years ago

thanks for the feedback!

1.6.1 has been put onto sourceforge, and packagers got a seperate mail for that.

#9 Updated by dnsmichi over 2 years ago

  • Status changed from Assigned to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF