5

While using MySql (Aurora) I started noticing strange queries going missing and workers simply hanging forever. After some investigation it seems that the same code running in AWS is OK, but on Azure it simply hangs, forever!

Seems like the reason for this is Azure simply killing connections that are idle (like a long running query). Note that I can reproduce this on a VM that doesn't even have a load balancer ahead of it.

This is reproducible with the following date && time mysql -h$SERVER -u$USER -D mydb -p$PASS -e "SELECT SLEEP(260);"

Note that I tested it with 240, 250 and at 260 it dies. But not just dies, hangs, forever! Looks like Azure doesn't even bother to tell the socket to die so the MySql client hangs.

We have queries running from NodeJs and Python, so I need a solution that works in both if possible.

See here for a good example: https://i.stack.imgur.com/b5vVi.jpg (note I had to kill mysql in another session for it to actually release)

user1730969
  • 462
  • 3
  • 10
  • 1
    Are there any config params like this https://stackoverflow.com/a/40913764/3968921 that can be used to configure threshold for long running queries. Also I think you should kill the query rather than closing the socket connection. Socket can be reused by different queries. – Ankit Deshpande Jan 24 '19 at 09:15
  • @AnkitDeshpande while it make the query fail and not hang, I'm hoping for something to actually fix it, not just mitigate the hang. – user1730969 Jan 24 '19 at 10:25
  • What kind of fix are you looking for, Queries should run infinitely till completion ? – Ankit Deshpande Jan 24 '19 at 11:38
  • Yes, sometimes we have queries that need more than 4 minutes to complete. – user1730969 Jan 24 '19 at 12:05
  • Usually it is a good practice to avoid long running queries. Try optimising your queries and you can probably break it down into batches. There should be some threshold that you need to configure to allow queries running upto 4 mins. – Ankit Deshpande Jan 24 '19 at 12:43

1 Answers1

4

After some research I found a low level workaround which should always work, I inject a lib into any binary that makes all the sockets KEEP-ALIVE using LD_PRELOAD. The lib I inject is libdontdie, a fork of an older lib: libkeepalive.

After building the lib I run: date && time DD_DEBUG=1 DD_TCP_KEEPALIVE_TIME=4 DD_TCP_KEEPALIVE_INTVL=5 DD_TCP_KEEPALIVE_PROBES=6 LD_PRELOAD=/usr/lib/libdontdie.so mysql -h$SERVER -u$USER -D mydb -p$PASS -e "SELECT SLEEP(300);"

And it works as expected (tested on both a VM in azure and a docker image inside AKS).

user1730969
  • 462
  • 3
  • 10