The Case of the Delinquent Node.js Service and How We Solved It

Written by on April 12, 2017


This is the story about how we analyzed and fixed an unstable Node.js service in our infrastructure.

Background into the Node.js issue

A few months ago, we decided to have a scheduled downtime and migrate all our existing services from EC2 Classic to EC2 VPC for security, stability and control. We’d been systematically planning this migration two months before D-day — with over a 100 services that were making a switch from Classic to VPC, we didn’t want to leave any room for glitches. We migrated each service to VPC, tested each of them for stability and selectively routed some of our production traffic to these new instances — monitoring the system stability and metrics throughout.

But after few hours, we observed that a specific feature (which was dependent on this service that handled long-lived WebSocket connections and was hosted behind ELB and nginx) was intermittently slow or unavailable.


We grappled with this service that was acting up — restarting the service seemed to fix it but only to have more issues later.

Digging deeper

The ELB CloudWatch Metrics was our starting point.

Clearly, there were spikes in the number of TCP connections each time the service failed. These plateaued out each time we restarted the service, only to work up a frenzy of new TCP connections after a short while.

It had to be the network right? And, so we scoured through all our network settings and group policies but couldn’t build a case against it. They seemed to be standard and nothing seemed out of place.

We ran the tcpaccept tool from the BCC/eBPF suite in one of our Node.js instance to see at what rate the connections were being accepted.

[root@nodejs-app5 ~]# /usr/share/bcc/tools/tcpaccept -t
4194.519 24848 node 6 ::ffff: ::ffff: 81
4194.528 24848 node 6 ::ffff: ::ffff: 81
4194.529 24848 node 6 ::ffff: ::ffff: 81
4194.535 24848 node 6 ::ffff: ::ffff: 8
4194.539 24848 node 6 ::ffff: ::ffff: 81
4194.541 24848 node 6 ::ffff: ::ffff: 81
4194.548 24848 node 6 ::ffff: ::ffff: 81
4194.553 24848 node 6 ::ffff: ::ffff: 81
4194.625 24848 node 6 ::ffff: ::ffff: 81
4194.625 24848 node 6 ::ffff: ::ffff: 81
4196.074 9706 node 6 ::ffff: ::ffff: 81
4196.075 9706 node 6 ::ffff: ::ffff: 81
4196.499 9706 node 6 ::ffff: ::ffff: 81
4196.511 9706 node 6 ::ffff: ::ffff: 81
4196.514 9706 node 6 ::ffff: ::ffff: 81
4196.515 9706 node 6 ::ffff: ::ffff: 81
4196.516 9706 node 6 ::ffff: ::ffff: 81
4196.785 9706 node 6 ::ffff: ::ffff: 81
4197.837 1763 sshd 4 22
4198.073 9706 node 6 ::ffff: ::ffff: 81

The connection acceptance rates looked normal (remember, we were only redirecting small subset of our traffic) and didn’t show anything unusual. Sigh.

Still unsure about where to start looking, we started with the obvious — doing packet capture because the connection count kept going up. We waited for the issue to resurface (after a restart), connected it to one of the machines that hosted the nginx service and ran tshark to capture the packets.

[root@nginx1 ~]# tshark -f ‘host and (tcp[tcpflags] & (tcp-rst|tcp-fin) != 0)’
36.348790 -> TCP 66 81 > 10954 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899165 TSecr=906255810
36.369605 -> TCP 66 81 > 10955 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899170 TSecr=906255815
36.372122 -> TCP 66 81 > 10956 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899171 TSecr=906255816
36.399220 -> TCP 66 81 > 10957 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899178 TSecr=906255822
36.408761 -> TCP 66 81 > 10958 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899180 TSecr=906255825
36.416402 -> TCP 66 81 > 10959 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899182 TSecr=906255827
36.419439 -> TCP 66 81 > 10960 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899183 TSecr=906255827
36.457502 -> TCP 66 81 > 10961 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837
36.458672 -> TCP 66 81 > 10962 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899192 TSecr=906255837
36.459733 -> TCP 66 81 > 10963 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899193 TSecr=906255837
36.463407 -> TCP 66 81 > 10964 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899194 TSecr=906255838
36.485216 -> TCP 66 81 > 10965 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844
36.487000 -> TCP 66 81 > 10966 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899199 TSecr=906255844
36.491632 -> TCP 66 81 > 10967 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899201 TSecr=906255845
36.499982 -> TCP 66 81 > 10941 [FIN, ACK] Seq=1 Ack=1 Win=16 Len=0 TSval=3899203 TSecr=906255847
36.500013 -> TCP 66 10941 > 81 [FIN, ACK] Seq=1 Ack=2 Win=16 Len=0 TSval=906255848 TSecr=3899203
36.526724 -> TCP 66 81 > 10969 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899209 TSecr=906255854
36.555598 -> TCP 66 81 > 10970 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899217 TSecr=906255860
36.559707 -> TCP 66 81 > 10971 [RST, ACK] Seq=1 Ack=1 Win=10 Len=0 TSval=3899218 TSecr=906255862

Wading through the captured packets, we figured the TCP connections were being reset by the remote app machine and new connections were being established. Repeatedly.

So it looked like there was something wrong with our machine setup and/or our app (as it always turns out to be).

To see why these supposedly long-lived TCP (Websocket) connections were being closed, we ran strace on the Node process.

[root@nodejs-app5 ~]# strace -ff -p `pidof node`
clock_gettime(CLOCK_MONOTONIC, {8666, 663543136}) = 0
clock_gettime(CLOCK_MONOTONIC, {8666, 663607713}) = 0
epoll_wait(5, 7ffd00abd8d0, 1024, 11) = -1 EINTR (Interrupted system call)
— — SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=11784, si_uid=0} — –
tgkill(2100, 2100, SIGTERM) = 0
rt_sigreturn() = -1 EINTR (Interrupted system call)
— — SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=2100, si_uid=0} — –
+++ killed by SIGTERM +++

Surprisingly, the Node.js app was actually being killed and getting restarted automatically since we had Monit watching it over. Now we knew why the TCP connections were being reset.

A closer look at the tcpaccept output and you’ll notice that the PID of the process accepting connections kept changing — definitely something that we’d overlooked earlier.

The hunch: An ominous SIGTERM

At this point, we had a lot of questions!

What was killing the Node.js process?

We looked at dmesg and our memory usage seemed normal (besides we have Nagios with check_oomkiller, to notify us should OOM killer be on the rampage), so it couldn’t be OOM killer.

Was the app doing something unexpected?

To find out, we attached a GDB to the delinquent process, set breakpoints on all the exit syscalls and then waited.

[root@nodejs-app5 ~]# gdb -ex ‘catch syscall exit exit_group’ -ex ‘break exit’ -ex ‘continue’ -p `pidof node`
Loaded symbols for /lib64/
0x00007fbc8c8f6419 in syscall () from /lib64/
Catchpoint 1 (syscalls ‘exit’ [60] ‘exit_group’ [231])
Breakpoint 2 at 0x7fbc8c83dea0
Program received signal SIGTERM, Terminated.
0x00007fbc8c8f6419 in syscall () from /lib64/

Our hunch was confirmed when the app itself didn’t crash or call exit but was killed by a SIGTERM.

Looking around to see where the signal was coming from, we started stracing the Monit process to see what it does. Since we know Monit was watching over the Node.js process and could kill/restart it, should there be any problems with the process.

[root@nodejs-app5 ~]# strace -ff -p `pidof monit`
connect(4, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr(“”)}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=4, events=POLLIN|POLLOUT}], 1, 10000) = 1 ([{fd=4, revents=POLLIN|POLLOUT}])
getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, “”, 1024) = 0
write(4, “GET / HTTP/1.1\r\nHost:”…, 95) = 95
read(4, “”, 1024) = 0
write(2, “HTTP: error receiving data — Su”…, 38) = 38
sendto(3, “<11>Nov 15 10:08:01 monit[8186]:”…, 71, MSG_NOSIGNAL, NULL, 0) = 71
shutdown(4, SHUT_RDWR) = -1 ENOTCONN (Transport endpoint is not connected)
close(4) = 0

From the trace, we understood that Monit tries to connect to the Node.js process but fails. Following this, it sends a SIGTERM and restarts the app.

check host node_web_app_0 with address
GROUP node_web_app
start program = “/bin/sh -c ‘cmd_to_start_the_service’”
stop program = “cmd_to_stop_the_service”
if failed port 81 protocol HTTP
request /
with timeout 10 seconds
then restart

We then discovered that Monit upon trying to check the Node.js app’s health status (by running a “http GET of /”), fails and kills/restarts the app.

Here’s a closer look at the at the strace on the Node process to see network activity,

[root@nodejs-app5 ~]# strace -ff -e trace=network -p `pidof node`
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EMFILE (Too many open files)
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = 9
[pid 15505] accept4(14, 0, NULL, SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable

With an EMFILE error popping up often, we wondered if the app repeatedly reached its max file descriptor limit configured in limits.conf and was then prompted by Monit to restart because it couldn’t reach the Node.js process.

[root@nodejs-app5 ~]# cat /etc/security/limits.conf | grep -i nofile
# — nofile — max number of open files
root hard nofile 500000
root soft nofile 500000
* hard nofile 500000
* soft nofile 500000

So, we decided to get to the bottom of things and

[root@nodejs-app5 ~]# ls -l /proc/`pidof node`/fd | wc -l

it looked like the limit kept falling back to 1024 despite us setting it to a much higher value.

[root@nodejs-app5 ~]# grep ‘open files’ /proc/`pidof node`/limits
Max open files 1024 4096 files

Connecting to another machine which has just been started, we verified that the Node.js process has its limits set to what we have in limits.conf.

Why did the limits fall back to these low values? We know for sure that these values are inherited from the parent process.

Each time we connected to a machine that had just started, our Chef OpsWorks-based infrastructure set the right limits for the Node.js process. It’s when Monit takes over and restarts the process that something goes wrong.

The Monit process’s limits matched the low values in the Node.js app. We wondered how.

Started from the bottom, now we’re here

Once the kernel was done initializing, it starts the first user space program, init. All the hard-coded values for various limits were still intact — file -nr with the open file descriptor limit at soft=1024 and hard=4096. On cue, all our programs configured to start during system startup, would now go live.

So all these system services/daemons inherited similar rlimit values but this was only applied to the log-in shell — in effect, just the programs started by the interactive shell would reflect these higher values.

In case you lost us somewhere, here’s a quick rundown,

  • Our app started with high rlimit values (as set by OpsWorks/the Chef script).
  • A few hours in, our app crashed due to an unrelated bug (this, we narrowed down to an internal call that was introduced only in this new setup).
  • Monit on detecting the app crash would rush in, only to restart the app at lower rlimit values (similar to its own).
  • After the app reached 1024 connections, it quashed incoming connections.
  • Monit on its patrol (http GET of /), fails to receive the app’s health status and restarts it, resetting all connections.
  • From the browser angle, since the connections (which uses are closed, it would start a re-connection logic. With each connection running this logic, only a few would succeed while the rest succumb (reaching the 1024 limit) to a restart.

Out of the multiple ways to fix this, we ended up updating our Monit script to set higher file limits. 🙂

check host node_web_app_0 with address
GROUP node_web_app
start program = “/bin/sh -c ‘ulimit -n 100000; cmd_to_start_the_service’”
stop program = “cmd_to_stop_the_service”
if failed port 81 protocol HTTP
request /
with timeout 10 seconds
then restart

If you found this useful, subscribe to our newsletter for more awesome content!


About The Authors

Suman Kumar Dey is Technical Lead — DevOps, Freshdesk and Suresh Kumar is a Technical Architect with Freshdesk.

Alina Benny is a Corporate Marketing Associate with Freshdesk. Say hello to her on Twitter and LinkedIn.


Originally published on Hacker Noon.

Subscribe for blog updates