bind-9.3.1 stops answering queries for nearly a minute

Hello,
we've been noticing for a while periods when our heaviest-loaded name- server stops answering queries. It is running 9.3.1 and handles around 20 million queries per day, i.e. a few hundred per second. It's running on a dual-processor FreeBSD 4 system and compiled without threads.

What we see is a period of 30-60 seconds when the name-server stops answering queries. Monitoring with top shows a sudden spike in CPU activity for named while this is happening. While this basically happens every hour, it is just slightly longer, as if a one-hour timer is started when the activity underlying this finishes.

The results can been seen from the output of trace-nameserver, a local tool which sends a dns query every second for a record the name-server is author- itative for, and plots the results; a 'T' when response took > 800ms otherwise a '.'. Each line corresponds to a minute with the average/maximum response time in msecs seen during that minute listed at the end.
NB queries marked with a 'T' are not counted in these numbers, which means quite a few of those marked with a '.' around 13:09 were slow thereby increasing the average from 2-3ms to 119ms.

12:58: ............................................................ 2/ 9
12:59: ............................................................ 2/ 6
13:00: ............................................................ 2/ 5
13:01: ............................................................ 3/ 15
13:02: ............................................................ 2/ 18
13:03: ............................................................ 2/ 6
13:04: ............................................................ 2/ 7
13:05: ............................................................ 2/ 4
13:06: ............................................................ 3/ 8
13:07: ............................................................ 3/ 6
13:08: ............................................................ 3/ 7
13:09: ....T.TTTTTT.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT........T. 119/787
13:10: ............................................................ 3/ 28
13:11: ............................................................ 2/ 5
13:12: ............................................................ 2/ 10
13:13: ............................................................ 2/ 8
13:14: ............................................................ 3/ 9
13:15: ............................................................ 2/ 5
13:16: ............................................................ 2/ 13
13:17: ............................................................ 2/ 8


Because the period is not exactly hourly, we knew it was not cron-related.
Nor did it seem to be related to zone transfers nor was any noteworthy activity apparent in the query logs.

I finally enabled tracing at level 10 a few minutes before an event was due. A 100+Meg named.run was produced in around 5 minutes.
A quick scroll through identified large numbers of lines of the form:
13:06:20.663 no_references: delete from rbt: 0xc1d6d00 ns1.gpl.com.au A short perl script counted the number of such lines in each second and produced the following, i.e. the time axis is rotated by 90 degrees compared to the output of trace-nameserver.

There is a clear correlation between the figure above and below so I suspect the unresponsiveness to be coming during house-cleaning (deleting from RedBlack trees ?).

We'd like to know if there is something that could be done to prevent such long periods of unresponsiveness. Mind you, the effects would be less important if UNIX resolvers re-ordered the list of name-servers specified in resolv.conf so the fastest is consulted first. This dynamic behaviour is what the Windows resolver apparently does, Win2K and later.

Could smaller cleanups be done every minute ?
Is this a case when threading would help, or would it just mean extra CPUs would be working on garbage-collection ?
Are there any build options which might help, e.g. I've seen the metion of ISC_MEM_USE_INTERNAL_MALLOC for performance, which is slightly different.

I suspect the timeouts we're seeing are a different issue to
1740. [bug] Replace rbt's hash algorithm as it performed badly
with certain zones. [RT #12729]
or several others mentioning bugs fixed wrt RBTs.

cheers,
Danny Thomas


each column in this tables corresponds to a minute, starting from 13:06:00 through 13:10:59. The numbers suddenly increase at 13:09:10, i.e. a few seconds after trace-nameserver starts recording time-outs
06:00 - 07:00 - 08:00 - 09:00 - 10:00 -
06:01 - 07:01 - 08:01 - 09:01 - 10:01 -
06:02 - 07:02 - 08:02 - 09:02 - 10:02 -
06:03 - 07:03 - 08:03 - 09:03 - 10:03 -
06:04 - 07:04 - 08:04 - 09:04 - 10:04 -
06:05 - 07:05 - 08:05 - 09:05 - 10:05 -
06:06 - 07:06 - 08:06 - 09:06 - 10:06 -
06:07 - 07:07 - 08:07 - 09:07 - 10:07 -
06:08 - 07:08 - 08:08 - 09:08 - 10:08 -
06:09 - 07:09 - 08:09 - 09:09 - 10:09 -
06:10 - 07:10 - 08:10 - 09:10 1,141 10:10 4
06:11 - 07:11 - 08:11 1 09:11 1,112 10:11 14
06:12 - 07:12 1 08:12 - 09:12 1,043 10:12 -
06:13 1 07:13 - 08:13 - 09:13 884 10:13 2
06:14 - 07:14 2 08:14 - 09:14 1,369 10:14 5
06:15 - 07:15 1 08:15 - 09:15 1,452 10:15 2
06:16 - 07:16 - 08:16 - 09:16 1,434 10:16 1
06:17 - 07:17 - 08:17 9 09:17 1,756 10:17 3
06:18 2 07:18 - 08:18 6 09:18 1,861 10:18 1
06:19 - 07:19 6 08:19 2 09:19 1,711 10:19 6
06:20 4 07:20 7 08:20 2 09:20 1,543 10:20 -
06:21 - 07:21 - 08:21 6 09:21 1,301 10:21 1
06:22 6 07:22 - 08:22 - 09:22 1,323 10:22 6
06:23 - 07:23 1 08:23 - 09:23 1,441 10:23 1
06:24 6 07:24 1 08:24 6 09:24 1,405 10:24 -
06:25 3 07:25 - 08:25 - 09:25 1,500 10:25 1
06:26 3 07:26 - 08:26 1 09:26 1,535 10:26 2
06:27 - 07:27 - 08:27 12 09:27 1,659 10:27 1
06:28 1 07:28 - 08:28 18 09:28 1,709 10:28 -
06:29 6 07:29 1 08:29 6 09:29 1,456 10:29 1
06:30 18 07:30 - 08:30 - 09:30 1,863 10:30 2
06:31 - 07:31 12 08:31 - 09:31 1,764 10:31 -
06:32 - 07:32 - 08:32 12 09:32 2,031 10:32 1
06:33 - 07:33 1 08:33 - 09:33 1,960 10:33 2
06:34 - 07:34 8 08:34 25 09:34 2,330 10:34 18
06:35 - 07:35 2 08:35 - 09:35 2,199 10:35 -
06:36 - 07:36 - 08:36 7 09:36 1,988 10:36 2
06:37 6 07:37 - 08:37 6 09:37 2,346 10:37 -
06:38 6 07:38 - 08:38 1 09:38 2,685 10:38 -
06:39 - 07:39 - 08:39 3 09:39 2,368 10:39 -
06:40 - 07:40 - 08:40 - 09:40 3,021 10:40 -
06:41 - 07:41 - 08:41 13 09:41 4,766 10:41 -
06:42 3 07:42 - 08:42 - 09:42 5,256 10:42 -
06:43 - 07:43 - 08:43 5 09:43 5,594 10:43 -
06:44 2 07:44 2 08:44 1 09:44 5,824 10:44 -
06:45 1 07:45 - 08:45 - 09:45 4,422 10:45 -
06:46 6 07:46 1 08:46 1 09:46 5,084 10:46 -
06:47 - 07:47 - 08:47 - 09:47 7,487 10:47 -
06:48 - 07:48 1 08:48 - 09:48 6,626 10:48 -
06:49 - 07:49 1 08:49 - 09:49 2,931 10:49 -
06:50 - 07:50 1 08:50 - 09:50 - 10:50 -
06:51 - 07:51 1 08:51 - 09:51 - 10:51 -
06:52 7 07:52 1 08:52 - 09:52 3 10:52 -
06:53 - 07:53 6 08:53 8 09:53 4 10:53 -
06:54 - 07:54 - 08:54 - 09:54 - 10:54 -
06:55 2 07:55 - 08:55 - 09:55 5 10:55 -
06:56 1 07:56 3 08:56 - 09:56 - 10:56 -
06:57 2 07:57 2 08:57 3 09:57 - 10:57 -
06:58 2 07:58 11 08:58 - 09:58 - 10:58 -
06:59 - 07:59 7 08:59 - 09:59 1 10:59 -

 


  Latest articles

Question about negative answers from the cache of BIND9

Slave bind skips delegation record in master zone

Slave zones not updating

SPF RRType

Trying to get full domain info in nslookup

Invalid DNS entries in Netlogon.dns

Ze Network © 2007 Free Space Australia Inc. All rights reserved.

   Wallpaper World   Tran Community