4

I am using the curator framework to connect to a zookeeper server, but running into weird DNS resolution issue. Here is the jstack dump for the thread,

#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
    at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
    at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72)
    - locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder$1)
    at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
    at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
    at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
    at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
    at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)

The thread seems to be stuck in the native method and never returns. Also it occurs very randomly, so haven't been able to reproduce consistently. Any ideas?

vikky.rk
  • 3,989
  • 5
  • 29
  • 32
  • Not sure if its an issue with DNS or not. – vikky.rk Mar 09 '16 at 03:29
  • check this one: http://stackoverflow.com/questions/1608503/domain-name-resolution-not-working-in-java-applications-on-ubuntu64-9-04-machine – Yigal Reiss Mar 09 '16 at 07:27
  • I'm currently having the same problem at random times. Our Djava.net.preferIPv4Stack=true is defined and we run on a RedHat server. Can we define a timeout to that DNS resolution call. – Simon LG Mar 22 '16 at 14:56

1 Answers1

7

We are also trying to solve this problem. Looks like this is due to glibc bug: https://bugzilla.kernel.org/show_bug.cgi?id=99671 or the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1209433 depending on who you ask ;)

Also worth reading: https://access.redhat.com/security/cve/cve-2013-7423 and https://alas.aws.amazon.com/ALAS-2015-617.html

To confirm that this is indeed the case attach gdb to the java process:

gdb --pid <JavaProcessPid>

then from gdb:

info threads 

find a thread that does recvmsg:

thread <HangingThreadId>

and then

backtrace 

and if you see something like this then you know that glibc/kernel upgrade will help:

#0  0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1  0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2  0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3  0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4  0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so

Update: Looks like the kernel wins. Please see this thread: http://www.gossamer-threads.com/lists/linux/kernel/2264958 for details. Also there is a tool to verify that your system is affected by the kernel bug you can use this simple program: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

to verify:

curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump

And if the output is:

[26170] glibc: check_pf: netlink socket read timeout
Aborted

Then the system is affected. If the output is something like:

exit success [7618] exit success [7265] exit success

then the system is ok. In the AWS context, upgrading AMIs to (2016.3.2) with the new kernel seems to have fixed the problem.

Jacek Tomaka
  • 422
  • 7
  • 15
  • Please do not write link-only answers. Either make it a comment or include the essential parts in your text. – Jan Doggen Aug 11 '16 at 11:03
  • Yes indeed, glibc upgrade fixed the issue! I forgot to update this thread. – vikky.rk Aug 11 '16 at 16:53
  • Thanks @Jacek Tomaka. I think `curl -O` should be `curl -o`. – seanf May 25 '17 at 08:09
  • 1
    For those with AMI context (e.g im using 2014.03) and issue reproduices but the above script said "exit success" , meaning for those with AMI the check with the gist script is not percise. – Robocide Jan 30 '18 at 10:24