Monday, December 12, 2011

ps returns incorrect etime

I run ps -Ao pid,pcpu,rss,etime,args to check for long running processes. If etime (elapsed time) of a process is a greater than say 10 hours, I kill the process. Lately I have been seeing valid processes getting killed. I noticed in the logs that etime was returning 49710-06:28:15 or 4294967295 seconds or 2^32-1. Anytime I see these magic numbers 2^N or 2^N-1, I know there is some thing weird. Turns out I am right.  The procps fix states "the ps utility's "etime" field shows the elapsed time since a process was started. On heavily-loaded systems, it was possible for this value to return negative due to an integer overflow. " 
I din't update the procps, instead I fixed my python script.

1 comment:

Unknown said...

We run the exact same thing to deal with longrunning encoding processes so I went ahead and implemented a preemptive fix as well : )
Thanks for this!