Feature #2182

avoid insane looping through event list when rescheduling checks

Added by dnsmichi over 2 years ago. Updated about 2 years ago.

Status:ResolvedStart date:12/12/2011
Priority:NormalDue date:
Assignee:dnsmichi% Done:

100%

Category:Check Scheduling
Target version:Icinga 1.x - 1.7

Description

needs proper applying, testing, stresstesting.

Revision: 1850
          http://nagios.svn.sourceforge.net/nagios/?rev=1850&view=rev
Author:   ageric
Date:     2011-12-12 14:27:16 +0000 (Mon, 12 Dec 2011)
Log Message:
-----------
core: Avoid insane looping when rescheduling checks

Previously we used to iterate over the entire event queue to
see if the object we were about to reschedule a check for was
already scheduled or not. Since checks are normally rescheduled
just after they've been run, that means we never hit an event
inside the list and so we walk all of it without ever finding
it, causing O(n) worst-case complexity and a probability of
hitting that worst-case complexity of very nearly 100%.

With this patch, we reuse the currently unused nexthash field
of hosts and service objects to stash the next scheduled check
event, providing constant-time access to it and avoiding the
insane looping, which should reduce Nagios' cpu consumption
substantially, since almost 80% of Nagios' user cpu time is
spent on this loop.

Relevant gprof output before patch:
% time  cumulat   self   calls  ms/call ms/call name
 17.24   0.05     0.05   3490   0.01     0.01   schedule_service_check
 17.24   0.10     0.05    1    50.00   261.97   event_execution_loop

Relevant gprof output after patch:
% time  cumulat   self   calls  ms/call ms/call name
 12.20   0.05     0.05       1  50.00  352.26   event_execution_loop
  9.76   0.09     0.04   10565   0.00    0.00   clear_host_macros_r
....
  0.00   0.41     0.00    3594   0.00    0.01   schedule_service_check

>From 17.24% of the time used to less than 0.01 is quite good,
so we stick with that and let it be enough. Testing is ofcourse
required to make sure checks aren't being run extremely frequently,
but that shouldn't be the case since the new check is assigned to
the scheduled object in case we replace it with a forced one.

Reported-by: Mathias Kettner <mk@mathias-kettner.de>
Signed-off-by: Andreas Ericsson <ae@op5.se>

Modified Paths:
--------------
    nagioscore/trunk/base/checks.c
    nagioscore/trunk/include/objects.h

Modified: nagioscore/trunk/base/checks.c
===================================================================
--- nagioscore/trunk/base/checks.c    2011-12-12 14:26:56 UTC (rev 1849)
+++ nagioscore/trunk/base/checks.c    2011-12-12 14:27:16 UTC (rev 1850)
@@ -249,6 +249,12 @@
     log_debug_info(DEBUGL_FUNCTIONS, 0, "run_scheduled_service_check() start\n");
     log_debug_info(DEBUGL_CHECKS, 0, "Attempting to run scheduled check of service '%s' on host '%s': check options=%d, latency=%lf\n", svc->description, svc->host_name, check_options, latency);

+    /*
+     * reset the next_check_event so we know it's
+     * no longer in the scheduling queue
+     */
+    svc->next_check_event = NULL;
+
     /* attempt to run the check */
     result = run_async_service_check(svc, check_options, latency, TRUE, TRUE, &time_is_valid, &preferred_time);

@@ -1656,7 +1662,6 @@
 void schedule_service_check(service *svc, time_t check_time, int options) {
     timed_event *temp_event = NULL;
     timed_event *new_event = NULL;
-    int found = FALSE;
     int use_original_event = TRUE;

     log_debug_info(DEBUGL_FUNCTIONS, 0, "schedule_service_check()\n");
@@ -1683,25 +1688,15 @@

     /* default is to use the new event */
     use_original_event = FALSE;
-    found = FALSE;

-#ifdef PERFORMANCE_INCREASE_BUT_VERY_BAD_IDEA_INDEED
-    /* WARNING! 1/19/07 on-demand async service checks will end up causing mutliple scheduled checks of a service to appear in the queue if the code below is skipped */
-    /* if(use_large_installation_tweaks==FALSE)... skip code below */
-#endif
+    temp_event = (timed_event *)svc->next_check_event;

-    /* see if there are any other scheduled checks of this service in the queue */
-    for(temp_event = event_list_low; temp_event != NULL; temp_event = temp_event->next) {
+    /*
+     * If the service already has a check scheduled,
+     * we need to decide which of the events to use
+     */
+    if(temp_event != NULL) {

-        if(temp_event->event_type == EVENT_SERVICE_CHECK && svc == (service *)temp_event->event_data) {
-            found = TRUE;
-            break;
-            }
-        }
-
-    /* we found another service check event for this service in the queue - what should we do? */
-    if(found == TRUE && temp_event != NULL) {
-
         log_debug_info(DEBUGL_CHECKS, 2, "Found another service check event for this service @ %s", ctime(&temp_event->run_time));

         /* use the originally scheduled check unless we decide otherwise */
@@ -1747,6 +1742,7 @@
         else {
             remove_event(temp_event, &event_list_low, &event_list_low_tail);
             my_free(temp_event);
+            svc->next_check_event = new_event;
             }
         }

@@ -2163,7 +2159,6 @@
 void schedule_host_check(host *hst, time_t check_time, int options) {
     timed_event *temp_event = NULL;
     timed_event *new_event = NULL;
-    int found = FALSE;
     int use_original_event = TRUE;

@@ -2190,24 +2185,15 @@

     /* default is to use the new event */
     use_original_event = FALSE;
-    found = FALSE;

-#ifdef PERFORMANCE_INCREASE_BUT_VERY_BAD_IDEA_INDEED
-    /* WARNING! 1/19/07 on-demand async host checks will end up causing mutliple scheduled checks of a host to appear in the queue if the code below is skipped */
-    /* if(use_large_installation_tweaks==FALSE)... skip code below */
-#endif
+    temp_event = (timed_event *)hst->next_check_event;

-    /* see if there are any other scheduled checks of this host in the queue */
-    for(temp_event = event_list_low; temp_event != NULL; temp_event = temp_event->next) {
-        if(temp_event->event_type == EVENT_HOST_CHECK && hst == (host *)temp_event->event_data) {
-            found = TRUE;
-            break;
-            }
-        }
+    /*
+     * If the host already had a check scheduled we need
+     * to decide which check event to use
+     */
+    if(temp_event != NULL) {

-    /* we found another host check event for this host in the queue - what should we do? */
-    if(found == TRUE && temp_event != NULL) {
-
         log_debug_info(DEBUGL_CHECKS, 2, "Found another host check event for this host @ %s", ctime(&temp_event->run_time));

         /* use the originally scheduled check unless we decide otherwise */
@@ -2253,6 +2239,7 @@
         else {
             remove_event(temp_event, &event_list_low, &event_list_low_tail);
             my_free(temp_event);
+            hst->next_check_event = new_event;
             }
         }

@@ -2881,6 +2868,12 @@

     log_debug_info(DEBUGL_CHECKS, 0, "Attempting to run scheduled check of host '%s': check options=%d, latency=%lf\n", hst->name, check_options, latency);

+    /*
+     * reset the next_check_event so we know this host
+     * check is no longer in the scheduling queue
+     */
+    hst->next_check_event = NULL;
+
     /* attempt to run the check */
     result = run_async_host_check_3x(hst, check_options, latency, TRUE, TRUE, &time_is_valid, &preferred_time);

Modified: nagioscore/trunk/include/objects.h
===================================================================
--- nagioscore/trunk/include/objects.h    2011-12-12 14:26:56 UTC (rev 1849)
+++ nagioscore/trunk/include/objects.h    2011-12-12 14:27:16 UTC (rev 1850)
@@ -383,7 +383,7 @@
     objectlist *hostgroups_ptr;
 #endif
     struct  host_struct *next;
-    struct  host_struct *nexthash;
+    void *next_check_event;
     };

@@ -514,7 +514,7 @@
     objectlist *servicegroups_ptr;
 #endif
     struct service_struct *next;
-    struct service_struct *nexthash;
+    void *next_check_event;
     };

reschedule_checks_oldnewdesign.jpg (939 KB) dnsmichi, 01/27/2012 05:37 pm

reschedule_checks_newdesign.jpg (920 KB) dnsmichi, 01/27/2012 05:37 pm

icinga_without_2183_patch.txt Magnifier (159 KB) dnsmichi, 01/27/2012 07:46 pm

icinga_with_2183_patch.txt Magnifier (170 KB) dnsmichi, 01/27/2012 07:46 pm


Related issues

Related to Core - Feature #2289: add new runtime flag, i.e. --debug in order not to drop p... Rejected 01/27/2012
Related to Core - Bug #2993: duplicate events when scheduling forced host|service check Resolved 08/18/2012
Related to Core - Bug #2676: unify check scheduling replacement logic Resolved 06/11/2012

Associated revisions

Revision 809e6bb3
Added by dnsmichi about 2 years ago

  • core: avoid insane looping through event list when rescheduling checks (Mathias Kettner, Andreas Ericsson) #2182

please refer to https://dev.icinga.org/issues/2182
for a detailed analysis of the problem, the used
algorithms and the revamped logic.

kudos to Mathias and Andreas.

refs #2182

Revision ec9c5e32
Added by dnsmichi over 1 year ago

core: avoid duplicate events when scheduling forced host|service check (Imri Zvik) #2993

previously, we had introduced a hash-like implementation of
host|service->next_check_event directly pointing to the next
scheduled event. this algorithm is being used within
schedule_host|service_check, determing wether to use the
already assigned event, or scheduling a new event. Since we
did not populate the event_data (host or service object) when
adding a new event to the scheduler, this became insame, always
rescheduling a new event, but not discarding the old one.

This has been partly fixed in #2676 with refactoring that detection
and saving the next_check_event accordingly. But on already scheduled
events which were forced (overridden), another bug was unveiled.

Now that we add the reverse pointer from the host|service event_data
back to the newly created event when forcing a check, we will make sure
that those events are checked correctly, and executed/discarded in the
first place, and not always creating a new event, seperated from the rest.

basically, using the previous implementation (with and without the fix
from #2676) we've created an event bomb under various circumstances,
especially when future events would have been overridden (forced checks).
as events usually result in checks, which can result into perfdata, this
could possibly the root cause for #2924 as well, as other users reported
on the portal as well.

http://www.monitoring-portal.org/wbb/index.php?page=Thread&threadID=26352

With the kind patch provided by Imri Zvik, this now works like expected.
Adapted the debug output a bit myself, so with debug_level=272 it will be
easy to see what happens on events scheduling - and not the non-telling
mess before.

kudos to Imri Zvik for the patch.

refs #2993
refs #2676
refs #2182
refs #2924

Revision f32fbf84
Added by dnsmichi over 1 year ago

core: avoid duplicate events when scheduling forced host|service check (Imri Zvik) #2993

previously, we had introduced a hash-like implementation of
host|service->next_check_event directly pointing to the next
scheduled event. this algorithm is being used within
schedule_host|service_check, determing wether to use the
already assigned event, or scheduling a new event. Since we
did not populate the event_data (host or service object) when
adding a new event to the scheduler, this became insame, always
rescheduling a new event, but not discarding the old one.

This has been partly fixed in #2676 with refactoring that detection
and saving the next_check_event accordingly. But on already scheduled
events which were forced (overridden), another bug was unveiled.

Now that we add the reverse pointer from the host|service event_data
back to the newly created event when forcing a check, we will make sure
that those events are checked correctly, and executed/discarded in the
first place, and not always creating a new event, seperated from the rest.

basically, using the previous implementation (with and without the fix
from #2676) we've created an event bomb under various circumstances,
especially when future events would have been overridden (forced checks).
as events usually result in checks, which can result into perfdata, this
could possibly the root cause for #2924 as well, as other users reported
on the portal as well.

http://www.monitoring-portal.org/wbb/index.php?page=Thread&threadID=26352

With the kind patch provided by Imri Zvik, this now works like expected.
Adapted the debug output a bit myself, so with debug_level=272 it will be
easy to see what happens on events scheduling - and not the non-telling
mess before.

kudos to Imri Zvik for the patch.

refs #2993
refs #2676
refs #2182
refs #2924

Conflicts:
Changelog

History

#1 Updated by dnsmichi over 2 years ago

  • Category set to Check Scheduling
  • Status changed from Feedback to Assigned
  • Assignee set to dnsmichi
  • Target version set to 1.7

#2 Updated by dnsmichi about 2 years ago

getting this patch settled needs a bit more information. i've done some pen and paper analysis which needs some notes too.

currently, the scheduler logic does the following

  • loop through the events list to verify that there is an already scheduled event for that host/service check
  • if found, break and set found=TRUE, otherwise found=FALSE

this will run into O(n) given the complete event loop, if nothing is found.

afterwards, if an event was found and not a null ptr, it's then decided how to proceed:

  • take the new event?
  • or just use the old event?

this depends on how the check should be executed. by default, the original event is to be preferred (use_original_event=true).

  • FORCED original event?
    • new event is FORCED, but sooner?
      • use_original_event=FALSE
  • non FORCED original event?
    • new event is FORCED?
      • use_original_event=FALSE
    • new event is non FORCED, but earlier?
      • use_original_event=FALSE
    • new event is older?
      • use_original_event=TRUE

depending on that logic, the scheduling routine then decides, which event will be scheduled.

  • TRUE
    • original event wins
    • free/delete new event
  • FALSE
    • new event wins
    • remove the old event, free old temp_event

#3 Updated by dnsmichi about 2 years ago

so to speak, we loop through the event list once, searching for a duplicated event - there only can be one, such as the above logic describes.

given that relationship, the initial patch in the issue comes to mind.

what if saving the new_event pointer for later, when not using the original event?

so the new_event is used instead of the original event means that the next run should be aware of that, comparing to another scheduled event then. meaning to say, removing the looping with O(n) but instead depending on 1:1 linked events will increase performance.

check if there's another check event scheduled
temp_event = svc->next_check_event

if decided to use the new one, re-assign
svc->next_check_event = new_event

furthermore, upon actually running a scheduled check, the next_check_event must be reset then in order to indicate that it's not in the scheduling queue anymore.

#4 Updated by dnsmichi about 2 years ago

  • File reschedule_checks_oldnewdesign.jpg added
  • File reschedule_checks_newdesign.jpg added

#5 Updated by dnsmichi about 2 years ago

  • File deleted (reschedule_checks_oldnewdesign.jpg)

#6 Updated by dnsmichi about 2 years ago

  • File deleted (reschedule_checks_newdesign.jpg)

#8 Updated by dnsmichi about 2 years ago

  • Subject changed from Avoid insane looping when rescheduling checks to avoid insane looping through event list when rescheduling checks

#9 Updated by dnsmichi about 2 years ago

these profiling tests have been run on a quadcore with 4gb ram on debian sid.

icinga_without_2183_patch.txt

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 21.57      0.11     0.11     3135     0.04     0.04  schedule_service_check
  9.80      0.16     0.05     7328     0.01     0.01  free_check_result
  7.84      0.20     0.04        1    40.00    41.41  xrddefault_read_retention_file_information
  5.88      0.23     0.03     3243     0.01     0.03  run_async_service_check
  5.88      0.26     0.03      487     0.06     0.06  schedule_host_check
  5.88      0.29     0.03                             mmap_fgets
  3.92      0.31     0.02    12024     0.00     0.00  check_time_against_period
  3.92      0.33     0.02     3741     0.01     0.01  clear_volatile_macros_r
  3.92      0.35     0.02       11     1.82     1.82  xsddefault_save_status_data
  3.92      0.37     0.02                             event_execution_loop
  1.96      0.38     0.01   751717     0.00     0.00  skiplist_compare_text
  1.96      0.39     0.01   573625     0.00     0.00  log_debug_info
  1.96      0.40     0.01   259900     0.00     0.00  skiplist_compare_service
  1.96      0.41     0.01    54925     0.00     0.00  skiplist_find_first
  1.96      0.42     0.01    21922     0.00     0.00  grab_macrox_value_r
  1.96      0.43     0.01     5755     0.00     0.00  skiplist_insert
  1.96      0.44     0.01     4220     0.00     0.00  grab_host_macros_r
  1.96      0.45     0.01     3596     0.00     0.00  process_check_result_file
  1.96      0.46     0.01       34     0.29     0.31  xodtemplate_process_config_file
  1.96      0.47     0.01        2     5.00     8.12  xodtemplate_recombobulate_servicegroups
  1.96      0.48     0.01        1    10.00    18.84  pre_flight_object_check
  1.96      0.49     0.01        1    10.00    10.00  xrddefault_save_state_information

icinga_with_2183_patch.txt

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 12.99      0.10     0.10    10419     0.01     0.01  get_raw_command_line_r
  7.79      0.16     0.06       27     2.22     2.22  xsddefault_save_status_data
  7.79      0.22     0.06                             event_execution_loop
  6.49      0.27     0.05                             mmap_fgets
  5.19      0.31     0.04    83374     0.00     0.00  grab_macro_value_r
  5.19      0.35     0.04    26809     0.00     0.00  check_time_against_period
  5.19      0.39     0.04    21015     0.00     0.00  free_check_result
  5.19      0.43     0.04    10429     0.00     0.00  clear_volatile_macros_r
  5.19      0.47     0.04     8827     0.00     0.03  run_scheduled_service_check
  5.19      0.51     0.04        1    40.00    41.34  xrddefault_read_retention_file_information
  3.90      0.54     0.03    10601     0.00     0.00  add_check_result_to_list
  2.60      0.56     0.02   177404     0.00     0.00  skiplist_compare_servicedependency
  2.60      0.58     0.02    61390     0.00     0.00  grab_macrox_value_r
  2.60      0.60     0.02    10809     0.00     0.00  check_for_host_flapping
  2.60      0.62     0.02     8892     0.00     0.01  handle_async_service_check_result
  1.30      0.63     0.01    70390     0.00     0.00  skiplist_find_first
  1.30      0.64     0.01    35008     0.00     0.00  grab_standard_host_macro_r
  1.30      0.65     0.01    14173     0.00     0.00  update_check_stats
  1.30      0.66     0.01    13170     0.00     0.00  add_event
  1.30      0.67     0.01    12184     0.00     0.00  grab_host_macros_r
  1.30      0.68     0.01    10601     0.00     0.00  process_check_result_file
  1.30      0.69     0.01     5362     0.00     0.00  xodtemplate_begin_object_definition
  1.30      0.70     0.01     4448     0.00     0.00  find_timeperiod
  1.30      0.71     0.01       34     0.29     0.59  xodtemplate_process_config_file
  1.30      0.72     0.01       26     0.38     6.25  reap_check_results
  1.30      0.73     0.01        4     2.50     2.50  check_for_orphaned_services
  1.30      0.74     0.01        2     5.00     7.45  xodtemplate_recombobulate_servicegroups
  1.30      0.75     0.01        1    10.00    10.00  pre_flight_circular_check

...

  0.00      0.77     0.00     8940     0.00     0.00  schedule_service_check

looks good but needs more profiling and tests. as well as some extreme tests with masses of rescheduled forced checks from the gui.

#10 Updated by dnsmichi about 2 years ago

  • Status changed from Assigned to Feedback
  • % Done changed from 0 to 100

working on my test stages fine for 3 weeks now.

requires further tests and profiling.

#11 Updated by dnsmichi about 2 years ago

  • Status changed from Feedback to Resolved

everyone devloping on their dev stages must have run that for now, so i expect no real happenings anymore. but everyone not having tested it, please do so.

Also available in: Atom PDF