Have you ever experienced a situation where one moment you can connect to the MySQL database and the next moment you cannot, only to be able to connect again a second later? As you may know one cannot open infinite connections with MySQL. There’s a practical limit and more often than not it is imposed by the underlying operating system. If you’re getting:
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.0.10' (99)
…there’s a good chance you’re hitting such limit. What might be misleading in the information above is whom (which side) is preventing the connection from being established.
Understanding the problem at hand
Whenever a client uses the network to connect to a service running on a given port of a server this connection is established through the creation of a socket:
A socket is a tuple of 4 terms: the source and destination IPs and ports.
The destination port is obviously the one where the service is running on the server. For instance usually port 22 for SSH and port 3306 for MySQL. The source port is an arbitrary local network port on the client side, which should show in tools like netstat and ss:
Note: I’ve used netstat and ss alternately in this post but if you read the man page for netstat these days you’ll see a note that says: “This program is obsolete. Replacement for netstat is ss”. I was advised to give preference to ss over netstat on a busy server: besides being faster and providing a more rich set of information, ss puts less stress on the server as it talks directly to the kernel while what netstat does is to scan /dev.$ ss -na dport = :3306 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.0.100:48681 192.168.0.10:3306
The example above shows that server 192.168.0.100 has a TCP connection established with MySQL (3306) running on server 192.168.0.10 through local port 48681. The range of local ports that can be used for TCP and UDP traffic is found in /proc/sys/net/ipv4/ip_local_port_range:
$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 32768 61000
Those values from above are the default ones found in many Linux distributions: 32768 denotes the first local port that can be used and 61000 indicates the last one, for a total of 28233 available ports. It may look like a lot but it truly depends on the nature of the local applications connecting to services in other servers through the network.
When it comes to connecting to database servers, many applications chose to open a new connection for a single request only, closing it right after the request is processed. Even though the connection is closed by the client (application) the local port it was using is not immediately released by the OS to be reused by another connection: it will sit in a TIME_WAIT state for (usually) 60 seconds – this value cannot be easily changed as it is hard coded in the kernel:
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT * state, about 60 seconds */
And contrary to what you may have heard or read, tunning /proc/sys/net/ipv4/tcp_fin_timeout is of no use here as it rules a different type of timeout and has no impact in releasing connections hanging in TIME_WAIT state. To better understand the role played by such state I suggest you read Vincent Bernat’s post, from which I reproduce:
There are two purposes for the
TIME-WAIT
state:The most known one is to prevent delayed segments from one connection being accepted by a later connection relying on the same quadruplet (source address, source port, destination address, destination port) (…)
The other purpose is to ensure the remote end has closed the connection. When the last ACK is lost, the remote end stays in the
LAST-ACK
state. Without theTIME-WAIT
state, a connection could be reopened while the remote end still thinks the previous connection is valid. (…)
The problem with this situation is that if you keep accumulating connections in TIME_WAIT state you’ll quickly saturate the available local ports. And if all ports are taken then any attempt for a new connection will result in an error similar to the one from above.
Reproducing the problem
It’s easy to verify this scenario, it suffices to decrease the local IP port range from the computer that’ll be starting the connections (usually the application server) to, for example, only half a dozen ports:
$ echo 61001 61006 > /proc/sys/net/ipv4/ip_local_port_range
Then we proceed with opening 6 connections from the application server (192.168.0.100) to the database server (192.168.0.10):
$ ss -na dport = :3306 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 192.168.0.100:61005 192.168.0.10:3306 ESTAB 0 0 192.168.0.100:61003 192.168.0.10:3306 ESTAB 0 0 192.168.0.100:61001 192.168.0.10:3306 ESTAB 0 0 192.168.0.100:61004 192.168.0.10:3306 ESTAB 0 0 192.168.0.100:61006 192.168.0.10:3306 ESTAB 0 0 192.168.0.100:61002 192.168.0.10:3306
Now, when we try to open a seventh connection to the database we’ll hit that error stated in the beginning of this post. What the error actually means is:
$ perror 99 OS error code 99: Cannot assign requested address
In fact, it should be complemented with “… because there’s no available local network ports left“.
And even if we close all 6 MySQL connections right away they’ll all move from ESTABLISHED to TIME_WAIT state and we’ll still need to wait for them to expire until we can open a new connection:
$ ss -na dport = :3306 State Recv-Q Send-Q Local Address:Port Peer Address:Port TIME-WAIT 0 0 192.168.0.100:61005 192.168.0.10:3306 TIME-WAIT 0 0 192.168.0.100:61003 192.168.0.10:3306 TIME-WAIT 0 0 192.168.0.100:61001 192.168.0.10:3306 TIME-WAIT 0 0 192.168.0.100:61004 192.168.0.10:3306 TIME-WAIT 0 0 192.168.0.100:61006 192.168.0.10:3306 TIME-WAIT 0 0 192.168.0.100:61002 192.168.0.10:3306
That’s where scalability problems happen if your application server keeps opening more connections than it can have old ones released in time. For example, considering the default port range from 32768 to 61000 and a TIME_WAIT of 60 seconds, in theory we can only open and close around 470 new network connections each second ((61000 – 32768 + 1)/60 = 470.55) before we saturate the available local network ports; that’s not much …
Possible Solutions
This is by no means an exhaustive list but here’s a few possible approaches to consider. If you have something to add about those or happen to known about any other please let me know by leaving a comment below.
Increasing port range
If the server initiating the connections is operating with the default port range you can start by increasing it somewhat. The first 1023 ports are said to be privileged, meaning only root can start an application listening to one of these initial ports. In the other extreme, the highest port you can have assigned is 65535 (2^16-1). In practice, then, you can increase the port range to the maximum of 1024-65535, which would provide 64512 ports, allowing in theory around 1075 briefly connections being opened and closed per second:
$ sysctl -w net.ipv4.ip_local_port_range="1024 65535"
To make this change permanent and survive a server reboot you need to add the following line to /etc/sysctl.conf:
net.ipv4.ip_local_port_range=1024 65535
You should however pay attention when stretching these values to the limits.
Adding extra IP addresses and listening to multiple ports
Something that wasn’t clear to me at first is that the port range limitation is applied per quadruplet (<source address>:<source port>, <destination address>:<destination port>). As such, if you have port range set from 60001 to 60006 in the client you should be able to open no more than 6 MySQL connections from the same <address>:<port> pair to the same <address>:<port> pair, as well as 6 SSH connections, 6 NC connections, etc:
tcp 0 0 192.168.0.100:61006 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61005 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61004 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61003 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61005 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61005 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61002 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61004 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61006 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61002 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61001 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61003 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61006 192.168.0.10:9999 ESTABLISHED tcp 0 0 192.168.0.100:61004 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61001 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61002 192.168.0.11:3306 ESTABLISHED <-- here! tcp 0 0 192.168.0.100:61001 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61002 192.168.0.10:22 ESTABLISHED tcp 0 0 192.168.0.100:61003 192.168.0.10:3306 ESTABLISHED
Note in the list above there’s more than one connection established from the same source <address>:<port> pair, though binded to different destination <address>:<port> pairs. There’s even a 7th connection to MySQL, though to one running on a different server (192.168.0.11:3306).
So, another way of increasing the amount of concurrent connections is by adding an additional IP address to either the outgoing side (client) or to the incoming side (server) and making part of the connections happen through it. Likewise, even though you cannot make MySQL listen to multiple ports simultaneously you can configure your firewall to accept connections to other ports and redirect them to port 3306 (or any other you’re using). In the example below I configure iptables to redirect all connections destined to port 80 to port 3306:
$ iptables -A PREROUTING -t nat -p tcp --dport 80 -j REDIRECT --to-port 3306
Modifying the connection behavior of the application(s)
Expanding the number of possible quadruplets will certainly help in increasing the amount of possible concurrent connections but it won’t scale indefinitely. In the long term you may need to review your application configuration behavior and setup/re-engineer it in such a way to avoid it opening and closing many connections over the network too often – if that’s the problem after all. You may resort to some sort of connection pooling instead, but be sure to evaluate it well first.
Tweaking TCP parameter settings
Even though you cannot easily decrease the timeout for TIME_WAIT state there’s at least 3 different TCP parameters you can use to “bypass” this limitation. You should explore this options with caution though as these settings could affect the reliability of TCP connections.
tcp_tw_reuse
tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6) Allow to reuse TIME_WAIT sockets for new connections when it is safe from protocol viewpoint. It should not be changed without advice/request of technical experts.
It is possible to force the kernel to reuse a connection hanging in TIME_WAIT state by setting /proc/sys/net/ipv4/tcp_tw_reuse to 1. What happens in practice is that you’ll keep seeing the closed connections hanging in TIME_WAIT until either they expire or a new connection is requested. In the later case, the connection will be “relived”.
Here’s an example from the previous scenario where I had limited the port range to only 6 ports. I enabled tcp_tw_reuse and opened 6 connections with MySQL, closing five of them soon afterwards:
$ netstat -tn|grep 3306 tcp 0 0 192.168.0.100:61006 192.168.0.10:3306 TIME_WAIT tcp 0 0 192.168.0.100:61005 192.168.0.10:3306 TIME_WAIT tcp 0 0 192.168.0.100:61002 192.168.0.10:3306 ESTABLISHED tcp 0 0 192.168.0.100:61004 192.168.0.10:3306 TIME_WAIT tcp 0 0 192.168.0.100:61001 192.168.0.10:3306 TIME_WAIT tcp 0 0 192.168.0.100:61003 192.168.0.10:3306 TIME_WAIT
The TIME_WAIT was still in countdown mode (both ss and netstat have option -o/–timers to display those) when I opened a new connection:
$ netstat -ton|grep 3306 tcp 0 0 192.168.0.100:61006 192.168.0.10:3306 TIME_WAIT timewait (35.07/0/0) tcp 0 0 192.168.0.100:61005 192.168.0.10:3306 TIME_WAIT timewait (34.52/0/0) tcp 0 0 192.168.0.100:61002 192.168.0.10:3306 ESTABLISHED keepalive (3586.59/0/0) tcp 0 0 192.168.0.100:61004 192.168.0.10:3306 TIME_WAIT timewait (33.91/0/0) tcp 0 0 192.168.0.100:61001 192.168.0.10:3306 TIME_WAIT timewait (35.65/0/0) tcp 0 0 192.168.0.100:61003 192.168.0.10:3306 ESTABLISHED keepalive (7196.66/0/0)
Note the new connection was established in lieu of the one hanging in TIME_WAIT state for longer (using local port 61003).
tcp_tw_recycle
tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4) Enable fast recycling of TIME_WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).
When you enable /proc/sys/net/ipv4/tcp_tw_recycle closed connections will not show under TIME_WAIT anymore – they disappear from netstat altogether. But as soon as you open a new connection (within the 60 seconds mark) it will recycle one of those. But everyone writing about this alternative seems to advise against it’s use. Bottom line is: it’s preferable to reuse a connection than to recycle it.
tcp_max_tw_ buckets
tcp_max_tw_buckets (integer; default: see below; since Linux 2.4) The maximum number of sockets in TIME_WAIT state allowed in the system. This limit exists only to prevent simple denial- of-service attacks. The default value of NR_FILE*2 is adjusted depending on the memory in the system. If this number is exceeded, the socket is closed and a warning is printed.
This parameter rules how many connections can remain in TIME_WAIT state concurrently: the kernel will
simply kill connections hanging in such state above that number. For example, continuing with the previous scenario where I had configured the client server with a port range composed of only 6 ports, if I set /proc/sys/net/ipv4/tcp_max_tw_buckets to 5, then open 6 concurrent connections with MySQL and then immediatelly close all 6 I’ll find only 5 of them hanging in the TIME_WAIT state – as with tcp_tw_recycle, one of then will simply disapear from netstat. This situation allows me to immediately open a new connection without needing to wait for a minute. However, I won’t be able to open a second one until one of the other 5 connections in TIME_WAIT expire and freed the local port it was using. The secret here, then, is to find a compromise between the number of available network ports and the number of connections we allow to remain in TIME_WAIT state. The default value of this setting is 65536, which means by default the system allows all possible connections to go over the TIME_WAIT state when closed.
If “LAMP” server, use local socket
Finally, it’s important to mention that if your application is hosted in the same server as the database you may (should ?) open connections using MySQL’s socket file directly, without going over the network, and thus avoid this port range/TIME_WAIT problematic altogether. Interestingly, you can accomplish this in one of two ways: specifying the socket with the –socket option or using –host=localhost. As MySQL’s manual mentions:
On Unix, MySQL programs treat the host name localhost specially, in a way that is likely different from what you expect compared to other network-based programs. For connections to localhost, MySQL programs attempt to connect to the local server by using a Unix socket file.
Clarification
I prefer to overstate this problem in favor of avoiding any confusion: the error described in this post and all the situations surrounding it is to be found in the computer that starts the connections (usually the application server), and not in the server holding the services. Whenever a client closes the connection, while it shows hanging in TIME_WAIT state there won’t be any remaining trace of the connection showing in netstat in the server side. However, if for some reason it is the server that “closes the connection first, it gets the TIME-WAIT state while the client will consider the corresponding quadruplet free and hence may reuse it for a new connection.“
The post What happens when your application cannot open yet another connection to MySQL appeared first on MySQL Performance Blog.