Detecting dead Oracle database connections

In my 20+ years of working with Oracle databases, I have heard the same question asked many times: “When does the Oracle database clean up dead client connections?” In every case I can remember, I always heard the same answer.

“Oh sure, the database will clean up dead (or abandoned) connections…sometime.” For many years, that answer was good enough. Recently, a customer who wanted a real answer asked me the same question. Since I didn’t have the information he was looking for, I decided to research this myself.

Parameters

The first place I started looking at was the Oracle documentation on SQL*NET and the sqlnet.ora parameters. There is a parameter you can set in this file that seems like it would do the trick: SQLNET_EXPIRE_TIME. According to the documentation, this setting will tell the database to expire a dead connection after the specified number of minutes. This looked like a slam dunk, so I set up a test environment consisting of a database server (11.2.0.2 running on OEL 5.5) and an 11.2.0.2 client on a separate machine. I made sure to set the SQLNET_EXPIRE_TIME=1 on the server-side before I began testing.

Testing environment

My test consisted of opening an SQL*Plus session on the client, connecting to my test database, and then pulling the network cable out of the client. Using the V$SESSION view, I could see that my connection was still being managed in the database. Its status was INACTIVE, but it was still there. After one minute, I refreshed my V$SESSION query, expecting to see my connection gone, but my test failed. The connection was still there in an INACTIVE state, so I was fairly confident that the SQLNET_EXPIRE_TIME setting either had no effect or not the effect I was expecting.

After digging through the documentation with a little more focus, I noticed that the TCPKeepAlive operating system parameter kept showing up. It was not described as being in control of Dead Client Detection, but I went with that thought and started digging into the TCP settings on OEL 5.5. What I found was that there are three TCP parameters that have a lot to do with how Oracle handles sessions:

tcp_keepalive_time:

The default setting is 7200 seconds (two hours). This is the length of the interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe. After the connection is marked to need keepalive, this counter is not used any further.

tcp_keepalive_intvl:

The default setting is 75 seconds. This is the length of the interval between subsequent keepalive probes, regardless of what the connection has exchanged in the meantime.

tcp_keepalive_probes:

The Default setting is nine probes. This is the number of unacknowledged probes to send before considering the connection dead and notifying the application layer.

The TCP parameters explained:

If you assume that these settings have something to do with Oracle’s dead client detection, it looks like it will be at least two hours before either the operating system or Oracle will consider this connection dead. In fact, it appears that it will be two hours before the operating system sends the first TCPKeepAlive packet. If that packet sends fails, it will wait another 75 seconds before it sends another “probe,” which it will do nine times. If you add all this up, you get the following:

tcp_keepalive_time + (tcp_keepalive_intvl * tcp_keepalive_probes)

7200 + (75*9) = 7875 seconds

That’s 2.18 hours before the operating system declares a TCP connection dead. So I went to lunch, and after I came back, I continued to refresh my V$SESSION query. Sure enough, in a little over two hours, my “dead” connection disappeared from V$SESSION. That’s important: it disappeared. It was not marked as KILLED or SNIPED, it just went away.

Re-testing

This led me to believe that Oracle has little to do with dead client detection and that it just relies on the operating system’s TCP implementation. The only way to know for sure was to change the TCP settings on the server and re-test, so that’s what I did. I removed the SQLNET_EXPIRE_TIME from the sqlnet.ora file and set the server’s TCP KeepAlive parameters using the following commands:

echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 300 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 2 > /proc/sys/net/ipv4/tcp_keepalive_probes

My expectation with these settings was that the first KeepAlive packet would send after 600 seconds (10 minutes) and the interval between KeepAlives would be 300 seconds (five minutes). I set the interval to five minutes. That way, I could eventually set the SQLNET_EXPIRE_TIME to the minimum value, which is 1 minute, and then I could see if the SQLNET setting affected the process. Setting the “probes” parameter to two meant that the database should mark my connection as dead fairly quickly:

tcp_keepalive_time + (tcp_keepalive_intvl * tcp_keepalive_probes)

600 + (300*2) = 1200 seconds

To prove my theory, I started by bringing up an SQL*Plus connection on my client machine and then immediately disconnecting the client’s network cable. Why disconnect the network immediately? My theory was that the server’s TCP implementation would start the TCPKeepAlive counter as soon as a connection was made and that the counter would reset following any data transmission. So I pulled the network connection. This disallowed data transmission from the client so that I could start a stopwatch to record the duration of my Oracle session.

What my test showed was what I expected—my Oracle session lasted about 1200 seconds, or 15 minutes, consistently over five iterations. How did I prove that? I created a small stored procedure that would query V$SESSION every second and output my connection’s status, the start time of the session, and the current time. I decided to write the output to a file so I could see what was happening using “tail–f.” Although the precision of my test is one second at best, it was accurate enough to prove my theory.

The code and output examples:

create or replace procedure testConn (p_username varchar2

, p_timer integer default 1000) as

  v_output   utl_file.file_type;

  v_start    varchar2(8) := to_char(sysdate,'hh:mi:ss');

  v_timer    integer := 0;

  v_status   varchar2(10);

  v_time     varchar2(8);

begin

  v_output := utl_file.fopen( 'OCI', 'testconn.out', 'w' );

while v_timer < p_timer loop

      dbms_lock.sleep(1);

      v_timer := v_timer + 1;

      begin

        select status

              ,to_char(sysdate,'hh:mi:ss')

        into   v_status

              ,v_time

        from   v$session

   where  username = p_username;

      exception when no_data_found then

        exit;

      end;

      utl_file.put(v_output, p_username||' - '||

                   v_status||' - '||

                   v_start||' - '||

                   v_time);

      utl_file.new_line(v_output);

      utl_file.fflush(v_output);

  end loop;

  utl_file.fclose( v_output );

end;

The output generated by this code (in abbreviated form below) shows that this connection lasted just over 15 minutes, as expected.

TFOX - INACTIVE - 10:54:15 - 10:54:16

TFOX - INACTIVE - 10:54:15 - 10:54:17

TFOX - INACTIVE - 10:54:15 - 10:54:18

TFOX - INACTIVE - 10:54:15 - 10:54:19

TFOX - INACTIVE - 10:54:15 - 10:54:20

…

TFOX - INACTIVE - 10:54:15 - 11:09:28 <- Last time when the TFOX session existed

This proved that the TCP settings at the server were definitely in control of the duration of a dead client’s database session. To test whether or not the sqlnet.ora file settings had any effect on dead client detection, I re-ran the same test three more times, each with one parameter set in sqlnet.ora. I chose the parameters based on the documentation’s description of their usage. Only three parameters (shown below) appeared to be related to connection timeouts.

Test 2: SQLNET_EXPIRE_TIME=1

Test 3: SQLNET_SEND_TIMEOUT=1

Test 4: SQLNET_RECV_TIMEOUT=1

Conclusions

In my three subsequent tests, I used the same procedure to simulate a dead client and in all three cases, the duration of the client session was nearly identical—about 15 minutes.

After executing this battery of tests, I concluded that the server’s TCP KeepAlive settings play a significant role in determining when Oracle removes a dead client from the database. I can only say “significant” because I did not test every combination of SQL*Net settings.

What I learned was that, surprisingly, most Linux environments don’t start looking for dead client connections for at least two hours. This may not be significant in a tightly controlled environment. However, it could certainly become a problem in environments where connection abandonment is common and the database doesn’t use connection pools. The problem is that a shadow process maintains every abandoned or dead connection, which uses CPU and memory on the database server. It is very possible for the number of client connections to exceed the PROCESSES setting in the database, causing it to reject connections. Worse yet, the database server could become memory-starved if the number of sessions becomes excessive.

To help to control the number of sessions on your database, start by monitoring for INACTIVE connections and tuning the TCPKeepAlive settings appropriately for your environment. Using an enterprise-class connection pool will help limit the overall number of sessions against the database, but you also need to configure it appropriately.

Tim Fox, Infolob Solutions’ chief technologist, provided this article. You can reach him with questions or comments at tim.fox@infolob.com or 214-621-9595.