| bind-9.3.1
stops answering queries for nearly a minute
Hello,
we've been noticing for a while periods when our heaviest-loaded
name- server stops answering queries. It is running 9.3.1 and handles
around 20 million queries per day, i.e. a few hundred per second.
It's running on a dual-processor FreeBSD 4 system and compiled without
threads.
What we see is a period of 30-60 seconds when the
name-server stops answering queries. Monitoring with top shows a
sudden spike in CPU activity for named while this is happening.
While this basically happens every hour, it is just slightly longer,
as if a one-hour timer is started when the activity underlying this
finishes.
The results can been seen from the output of trace-nameserver,
a local tool which sends a dns query every second for a record the
name-server is author- itative for, and plots the results; a 'T'
when response took > 800ms otherwise a '.'. Each line corresponds
to a minute with the average/maximum response time in msecs seen
during that minute listed at the end.
NB queries marked with a 'T' are not counted in these numbers, which
means quite a few of those marked with a '.' around 13:09 were slow
thereby increasing the average from 2-3ms to 119ms.
12:58: ............................................................
2/ 9
12:59: ............................................................
2/ 6
13:00: ............................................................
2/ 5
13:01: ............................................................
3/ 15
13:02: ............................................................
2/ 18
13:03: ............................................................
2/ 6
13:04: ............................................................
2/ 7
13:05: ............................................................
2/ 4
13:06: ............................................................
3/ 8
13:07: ............................................................
3/ 6
13:08: ............................................................
3/ 7
13:09: ....T.TTTTTT.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT........T.
119/787
13:10: ............................................................
3/ 28
13:11: ............................................................
2/ 5
13:12: ............................................................
2/ 10
13:13: ............................................................
2/ 8
13:14: ............................................................
3/ 9
13:15: ............................................................
2/ 5
13:16: ............................................................
2/ 13
13:17: ............................................................
2/ 8
Because the period is not exactly hourly, we knew it was not cron-related.
Nor did it seem to be related to zone transfers nor was any noteworthy
activity apparent in the query logs.
I finally enabled tracing at level 10 a few minutes
before an event was due. A 100+Meg named.run was produced in around
5 minutes.
A quick scroll through identified large numbers of lines of the
form:
13:06:20.663 no_references: delete from rbt: 0xc1d6d00 ns1.gpl.com.au
A short perl script counted the number of such lines in each second
and produced the following, i.e. the time axis is rotated by 90
degrees compared to the output of trace-nameserver.
There is a clear correlation between the figure
above and below so I suspect the unresponsiveness to be coming during
house-cleaning (deleting from RedBlack trees ?).
We'd like to know if there is something that could
be done to prevent such long periods of unresponsiveness. Mind you,
the effects would be less important if UNIX resolvers re-ordered
the list of name-servers specified in resolv.conf so the fastest
is consulted first. This dynamic behaviour is what the Windows resolver
apparently does, Win2K and later.
Could smaller cleanups be done every minute ?
Is this a case when threading would help, or would it just mean
extra CPUs would be working on garbage-collection ?
Are there any build options which might help, e.g. I've seen the
metion of ISC_MEM_USE_INTERNAL_MALLOC for performance, which is
slightly different.
I suspect the timeouts we're seeing are a different
issue to
1740. [bug] Replace rbt's hash algorithm as it performed badly
with certain zones. [RT #12729]
or several others mentioning bugs fixed wrt RBTs.
cheers,
Danny Thomas
each column in this tables corresponds to a minute, starting from
13:06:00 through 13:10:59. The numbers suddenly increase at 13:09:10,
i.e. a few seconds after trace-nameserver starts recording time-outs
06:00 - 07:00 - 08:00 - 09:00 - 10:00 -
06:01 - 07:01 - 08:01 - 09:01 - 10:01 -
06:02 - 07:02 - 08:02 - 09:02 - 10:02 -
06:03 - 07:03 - 08:03 - 09:03 - 10:03 -
06:04 - 07:04 - 08:04 - 09:04 - 10:04 -
06:05 - 07:05 - 08:05 - 09:05 - 10:05 -
06:06 - 07:06 - 08:06 - 09:06 - 10:06 -
06:07 - 07:07 - 08:07 - 09:07 - 10:07 -
06:08 - 07:08 - 08:08 - 09:08 - 10:08 -
06:09 - 07:09 - 08:09 - 09:09 - 10:09 -
06:10 - 07:10 - 08:10 - 09:10 1,141 10:10 4
06:11 - 07:11 - 08:11 1 09:11 1,112 10:11 14
06:12 - 07:12 1 08:12 - 09:12 1,043 10:12 -
06:13 1 07:13 - 08:13 - 09:13 884 10:13 2
06:14 - 07:14 2 08:14 - 09:14 1,369 10:14 5
06:15 - 07:15 1 08:15 - 09:15 1,452 10:15 2
06:16 - 07:16 - 08:16 - 09:16 1,434 10:16 1
06:17 - 07:17 - 08:17 9 09:17 1,756 10:17 3
06:18 2 07:18 - 08:18 6 09:18 1,861 10:18 1
06:19 - 07:19 6 08:19 2 09:19 1,711 10:19 6
06:20 4 07:20 7 08:20 2 09:20 1,543 10:20 -
06:21 - 07:21 - 08:21 6 09:21 1,301 10:21 1
06:22 6 07:22 - 08:22 - 09:22 1,323 10:22 6
06:23 - 07:23 1 08:23 - 09:23 1,441 10:23 1
06:24 6 07:24 1 08:24 6 09:24 1,405 10:24 -
06:25 3 07:25 - 08:25 - 09:25 1,500 10:25 1
06:26 3 07:26 - 08:26 1 09:26 1,535 10:26 2
06:27 - 07:27 - 08:27 12 09:27 1,659 10:27 1
06:28 1 07:28 - 08:28 18 09:28 1,709 10:28 -
06:29 6 07:29 1 08:29 6 09:29 1,456 10:29 1
06:30 18 07:30 - 08:30 - 09:30 1,863 10:30 2
06:31 - 07:31 12 08:31 - 09:31 1,764 10:31 -
06:32 - 07:32 - 08:32 12 09:32 2,031 10:32 1
06:33 - 07:33 1 08:33 - 09:33 1,960 10:33 2
06:34 - 07:34 8 08:34 25 09:34 2,330 10:34 18
06:35 - 07:35 2 08:35 - 09:35 2,199 10:35 -
06:36 - 07:36 - 08:36 7 09:36 1,988 10:36 2
06:37 6 07:37 - 08:37 6 09:37 2,346 10:37 -
06:38 6 07:38 - 08:38 1 09:38 2,685 10:38 -
06:39 - 07:39 - 08:39 3 09:39 2,368 10:39 -
06:40 - 07:40 - 08:40 - 09:40 3,021 10:40 -
06:41 - 07:41 - 08:41 13 09:41 4,766 10:41 -
06:42 3 07:42 - 08:42 - 09:42 5,256 10:42 -
06:43 - 07:43 - 08:43 5 09:43 5,594 10:43 -
06:44 2 07:44 2 08:44 1 09:44 5,824 10:44 -
06:45 1 07:45 - 08:45 - 09:45 4,422 10:45 -
06:46 6 07:46 1 08:46 1 09:46 5,084 10:46 -
06:47 - 07:47 - 08:47 - 09:47 7,487 10:47 -
06:48 - 07:48 1 08:48 - 09:48 6,626 10:48 -
06:49 - 07:49 1 08:49 - 09:49 2,931 10:49 -
06:50 - 07:50 1 08:50 - 09:50 - 10:50 -
06:51 - 07:51 1 08:51 - 09:51 - 10:51 -
06:52 7 07:52 1 08:52 - 09:52 3 10:52 -
06:53 - 07:53 6 08:53 8 09:53 4 10:53 -
06:54 - 07:54 - 08:54 - 09:54 - 10:54 -
06:55 2 07:55 - 08:55 - 09:55 5 10:55 -
06:56 1 07:56 3 08:56 - 09:56 - 10:56 -
06:57 2 07:57 2 08:57 3 09:57 - 10:57 -
06:58 2 07:58 11 08:58 - 09:58 - 10:58 -
06:59 - 07:59 7 08:59 - 09:59 1 10:59 -
|
 |
Latest
articles
Question
about negative answers from the cache of BIND9
Slave
bind skips delegation record in master zone
Slave
zones not updating
SPF RRType
Trying
to get full domain info in nslookup
Invalid
DNS entries in Netlogon.dns
|
 |
 |