Torsten Förtsch
IT System Development & Security
Kaum macht man's richtig, schon geht's, ;-)

>> Home >> ModPerl >> Memory consumption


Content

Measuring memory consumption

When working with mod_perl, mod_php or other script languages embedded in the Apache httpd the memory consumption of the whole has to be paid attention. Every now and then people complain on various mailing lists and forums about their WEB server that works for hours, days or weeks and then suddenly becomes unaccessible. Even SSH or login on a console times out or takes almost forever. After a httpd restart everything works as usual again. These are the symptoms one encounter if the httpd working set does not fit into the available RAM. In other words, memory consumption is too high.

But how do I measure and possibly predict the RAM required for my working set?

I am talking about Apache httpd on Linux here. Other systems may or may not have similar devices.

The wrong way

Some administrators look at the output of the top command and add up either the VIRT or the RES column. Currently 10 instances of httpd are running on my notebook.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8937 r2        20   0  320m  28m 3772 S    0  0.7   0:00.74 /opt/apache/sbin/httpd -k start
 8934 r2        20   0  320m  28m 3756 S    0  0.7   0:01.37 /opt/apache/sbin/httpd -k start
 5811 r2        20   0  319m  27m 3396 S    0  0.7   0:00.66 /opt/apache/sbin/httpd -k start
 5444 root      20   0  315m  25m 5536 S    0  0.7   0:00.91 /opt/apache/sbin/httpd -k start
 8936 r2        20   0  317m  25m 3708 S    0  0.6   0:00.08 /opt/apache/sbin/httpd -k start
 5812 r2        20   0  317m  24m 3616 S    0  0.6   0:00.19 /opt/apache/sbin/httpd -k start
 8100 r2        20   0  317m  24m 3628 S    0  0.6   0:00.17 /opt/apache/sbin/httpd -k start
 8099 r2        20   0  317m  24m 3588 S    0  0.6   0:00.06 /opt/apache/sbin/httpd -k start
 5810 r2        20   0  316m  24m 3608 S    0  0.6   0:00.06 /opt/apache/sbin/httpd -k start
 8093 r2        20   0  316m  24m 3516 S    0  0.6   0:00.04 /opt/apache/sbin/httpd -k start

Do they really consume 3.2 Gb? Or 253 Mb? None of these figures is true!

Understanding memory

Memory management in modern operating systems is really complex. I will try to dive into it deep enough to make the key points comprehensible. Bear with (and correct) me if I fail.

The address space of a process – types of memory segments

At the lowest level a program running on a computer is just a bunch of machine instructions plus some bits of data. In principle data and commands can be intermixed at will. The CPU can be instructed to read a command at address x. This command may say read the data at x+1 and at x+2, add them up and store the result at x. That means after the CPU has executed the command the instruction itself is purged from memory. Where it had been is now the sum.

This type of self-modifying programs can lead to very compact code for a task but it certainly leads to very unmaintainable code. Soon even a novice programmer will learn to separate the algorithmic part of the program from the data. Then he will fetch the 2 summands from somewhere else and store the result near to them.

From that thought a new quality emerges. The program is divided into 2 segments, one that holds the algorithmic part and one for the data. And they are different in certain aspects. The algorithmic part (from now on called text segment) must not be changed by the program. Now place the program in the environment of an operating system where perhaps multiple instances of the same program run in parallel and another idea suggests itself. The code segment can be shared between all instances since it is read-only.

There are much more segment types a program is made up of. Normally they are created as part of a file when a program is compiled and linked. The objdump or readelf utilities can give you lists of those segments.

From a general point of view there are 5 (or 6) important types. Some of them are created at compile time, some only at run time.

Additionally a program can at run time allocate further segments and give them certain characteristics such as read/write access, the ability to contain and execute machine instructions, shareability etc.

Maybe it is already too much detail. The main idea is a program consists of various segments with different attributes.

To complicate things – shared libraries

Programmers are normally lazy folks. They try to invent the wheel only once. That's why program libraries exist. Now, a library function also consists of an algorithmic part and some data. The same ideas apply as before. A shared library is kind of a program module that exists in the memory only once (at least the text segment) but is used by many different programs at the same time. Shared libraries are shared not only by all instances of the same program but by all programs in the system. If all instances of the same program are deleted from memory it does not mean a certain shared library goes away.

The process

Now, let's introduce an important term, the process. Before I have used instance of a program. What I meant was process. A process is an entity that can be identified by a process ID. It comprises a certain amount of memory and one or more threads of execution besides other resources. A process has a continuous virtual address space that is normally not completely accessible (meaning it has holes).

The term virtual address space means a process can read memory at address say 4712 and get 17 while another process at the same time can read the same address and get 22. Address spaces of different processes are separated. This is achieved by splitting memory into so called pages and a big mapping table that belongs to the data structure the operating system maintains for the process. When the process is assigned a CPU this table is loaded into the CPU. The process then accesses address 4712. The CPU splits this address into a page address (4096) and an offset within the page, (616). Then it looks up the page in the mapping table, finds the real starting address, adds the offset and now it can access the memory. This is all done in hardware and you have to care only if you want to write an operating system. When working with modern CPUs it is still a bit more complex.

So, what do we have so far? We have an segmented virtual address space for each process. It is divided into pages. Each segment consists of a whole number of pages. These pages are mapped into real memory via a process specific page table. The next picture shows this in a very simplified way. There are 3 processes. Two of them run /bin/bash, one /bin/cat. All 3 share the text segment of the libc shared library. But each has its own heap. The 2 bash processes share their text segment.

address spaces

In reality the page mapping is done on a page-by-page basis rather than on whole segments. For example this means that processes can partly share their data segments even if they are in principle not shareable and each process can write its data segments at will.

Process creation – copy-on-write

On Linux a new process is created as child of another process by fork or clone. Now, what does this syscall do regarding memory? It has to create the virtual address space for the new process. Hence it has to create a new page table. But does it have to copy the memory itself? Or at least the non-shared parts? No, here a trick is used that is called copy-on-write. Each page can be made read-only. If the CPU (or rather the memory management unit within the CPU) encounters a write access to such a page an interrupt is generated. That means the operating system can interfere at this point. So, what copy-on-write means is, at first the whole address space of both the parent and child processes are made read-only. Of course, soon after that one of the processes wants to write its memory. This causes a page fault interrupt. And now the operating system can decide was that write access really illegal or should it rather be allowed. In the latter case the affected page is duplicated, one for each process.

So, assuming process 1045 in the picture above is a child of process 1044 they can share part of their memory:

copy-on-write

So, back to the original question. Is it now understandable that a tool like top cannot answer the question how much memory my WEB server needs? With all those levels of memory sharing even the question how much memory one particular process consumes does not make much sense.

Memory for a WEB server

Why not try another route? First, lets make a few assumptions. The apache WEB server we are talking about uses the prefork multi-processing module. This means the WEB server is started as one process which then forks off a certain amount of worker children. This means each child shares quite a big part of its memory with the parent. The parent process then watches the number of children and the number of parallel requests and stops some children if they are too idle or creates new if the current amount of children ready to serve drops too low. It tries to adapt the number of children to the current workload.

So, lets find out how much the overall used memory in the system increases for each new worker child.

Assuming that all worker children do more or less the same during a request cycle. They may ship different documents but the memory requirements of doing so do not vary much for different documents. I know it is a difference to have apache shipped static documents by its default handler or create them on demand from the content of a database or so. We simply take the worst case.

With these assumptions the needed memory can be modeled as a linear function:

mem = nproc * A + B

nproc is the max. number of apache children, A the amount of memory each process needs and B is the memory needed for the parent process plus the memory shared among the parent and the children. If the machine acts only/mostly as WEB server than B may also include the memory needed by all the other processes that normally run in such a setup, like syslog, ssh, various monitoring tools etc.

The test setup

Now, measuring A becomes quite simple. The basic idea is to start with a few worker children serving the worst document or documents, have the number of parallel requests slowly grow and watch how the amount of free memory goes down.

Here is an example configuration. It starts with 2 worker children. The children count can grow up to 1024. Keep-alive is turned on with a quite high timeout. (Don't use that in production!) This allows a client to keep a worker child busy from the point of view of the apache parent process. But a worker in the keep-alive state does not consume CPU cycles. So it's easier to reach higher worker counts.

StartServers        2
MinSpareServers	    2
MaxSpareServers	    1024
ServerLimit         1024
MaxClients          1024
KeepAlive           On
KeepAliveTimeout    300

Measuring memory consumption and the number of processes

There are various tools to get the current memory status or information about processes. vmstat almost does what I want but it does not report the current process count. top may be a candidate but parsing its output ... So, I put together a simple script that does exactly what I need at a very low price tag (in CPU cycles):

#!/usr/bin/perl

use strict;

# 1) count the invariant difference between nlink(/proc) and the process number
#    due to race conditions this is done 10 times and the result is rounded.

my $nlink_base=0;
my $times=10;

opendir my $proc, '/proc' or die "Cannot opendir /proc: $!\n";
for( 1..$times ) {
  seekdir $proc, 0 or die "Cannot seekdir /proc: $!\n";
  my $p=0;
  while( my $e=readdir $proc ) {
    if ($e=~/^\d+$/ and readlink("/proc/$e/exe")=~m!/httpd$!) {
      $p++;
    }
  }
  $nlink_base+=(stat "/proc")[3]-$p;
}
$nlink_base=sprintf "%.0f", $nlink_base/$times;

# 2) enter main loop

$|=1;

$ARGV[0]=.5 unless( $ARGV[0] );

my ($mtot, $mfree, $mbuf, $mcache, $stot, $sfree);
open my $minfo, "/proc/meminfo" or die "Cannot open /proc/meminfo: $!\n";
while( 1 ) {
  sysseek $minfo, 0, 0 or die "Cannot sysseek (/proc/meminfo): $!\n";
  my $buf='';
  sysread $minfo, $buf, 8192 or die "Cannot sysread (/proc/meminfo): $!\n";
  for(split /\n/, $buf) {
    if( /memtotal:\s*(\d+)/i ) {
      $mtot=$1;
    } elsif( /memfree:\s*(\d+)/i ) {
      $mfree=$1;
    } elsif( /buffers:\s*(\d+)/i ) {
      $mbuf=$1;
    } elsif( /cached:\s*(\d+)/i ) {
      $mcache=$1;
    } elsif( /swaptotal:\s*(\d+)/i ) {
      $stot=$1;
    } elsif( /swapfree:\s*(\d+)/i ) {
      $sfree=$1;
    }
  }
  printf( "%6d %8d %8d %8d %8d %8d %8d %8d\n",
	  (stat "/proc")[3]-$nlink_base,
	  $mtot+$stot-$mfree-$mcache-$sfree,
	  $mtot, $mfree, $mbuf, $mcache, $stot, $sfree );
  select undef, undef, undef, $ARGV[0];
}

To get the number of processes the script reads the link count of the /proc directory. This directory contains various files and subdirectories. Particularly it contains for each process a subdirectory named after its process ID. Each new process subdirectory increments the link count. The first loop (for( 1..$times )) figures out how much to subtract from the link count to get the number of httpd processes. To do that it reads the /proc/*/exe symbolic link. You must be either the owner (the same user as httpd or root) to do that. The way it is done also assumes there is only one WEB server at the same time running on the machine. Feel free to adapt the script to your conditions.

The main loop then reads the link count, converts it to the number of processes, reads the current memory status and prints it out. It is a good idea to run the script with a higher scheduling priority. nice -n -10 ... will do the job.

The RAM usage is computed as:

usedRAM = totalRAM - freeRAM - cachedRAM + totalSWAP - usedSWAP

Under used RAM I understand here the amount of RAM+SWAP that is not usable for new processes.

The script produces a 8-column output that can easy be read by gnuplot or similar tools. We need only the first 2 columns here.

     7   146476  1021488   884196     3200     4520  1534168  1520464
     8   149448  1021488   881224     3200     4520  1534168  1520464
     8   151680  1021488   878992     3200     4520  1534168  1520464
    10   156060  1021488   874612     3200     4520  1534168  1520464
    10   161892  1021488   868780     3208     4520  1534168  1520464

Generating load

Now we need a load generator that can do a slow ramp-up, use keep-alive requests and do some delay between requests. I have used Apache flood several times. But lately I have written my own load generator because flood if used with a high number of threads consumes enormous amounts of memory and tends to become unstable.

The result

The following script can then be used to visualize the result:

#!/usr/bin/perl

use strict;

# usage
# vmproc-log logfile1 title1 logfile2 title2 ... imagename.png

sub bestfit {
  my ($fname)=@_;
  open my $fh, '<', $fname or die "Cannot open $fname: $!\n";
  my ($n, $prod, $sum1, $sum2, $sum1_sq)=(0) x 5;
  while( defined( my $line=readline $fh ) ) {
    $line=~s/^\s*//;
    my ($proc, $mem)=split /\s+/, $line;
    $prod+=$proc*$mem;
    $sum1+=$proc;
    $sum1_sq+=$proc*$proc;
    $sum2+=$mem;
    $n++;
  }
  my $x=($n*$prod-$sum1*$sum2)/($n*$sum1_sq-$sum1*$sum1);
  return (sprintf('%.0f',$x), sprintf('%.0f',($sum2-$x*$sum1)/$n));
}

my $img=pop @ARGV;

my @logs;
while( my ($file, $title)=splice @ARGV, 0, 2 ) {
  push @logs, [$file, $title, bestfit($file)];
}

open my $gnuplot, '|gnuplot';

print $gnuplot <<"EOF";
set term png transparent size 800,600
set output '$img'
set grid
set xlabel "Number of processes"
set ylabel "Memory in kb"
set key top left Left reverse
plot [0:] [0:] \\
EOF

print $gnuplot "     ",join(", \\\n     ", map {
  (qq{"$_->[0]" using 1:2 title "$_->[1]"},
   qq{$_->[2]*x+$_->[3] title "best fit: $_->[2]kb * nproc + $_->[3]kb" w l lw 2})
} @logs),"\n";

Besides generating the following image it uses the method of least squares to compute A and B of our linear model

result

The picture above shows the memory consumption for 3 different setups. The red dots represent a WEB server shipping a dynamic document using mod_perl and the template toolkit. The blue dots are the same server setup but serving a very short (~50 bytes) static document. The turquoise dots come from a server with almost the Apache default configuration (except for the directives above).

All 3 lines start at about 100 Mb fill level. Then each worker shipping the dynamic document consumes about 5 Mb while the static one it needs less than half of the amount. With the standard config each new worker child needs only about 0.85 Mb. So, perhaps it makes sense to split configuration into 2, one with mod_perl enabled to serve dynamic documents and a low MaxClients setting, and a second for static documents where MaxClients can be much higher.

Memory requirements per request

Over the lifetime of a WEB server its content often changes. Sometimes new WEB applications are launched. At such times the question arises what increase of required memory is to be expected. To estimate this number one can use slight modification of the approach above. Now the question is not how much memory is required for a new worker child but assuming there is a certain number of workers how much memory is needed to perform a certain request.

So, let's state the question that way, A) assuming that a new child is born how much memory does it need to ship the short static document above and B) how much is then needed to generate the dynamic one.

My server machine has 1 Gb RAM. From the experiment above we know at ~150 parallel processes shipping the dynamic document end-of-RAM is reached. So, let's configure the WEB server so that it launches 150 worker children at startup and allows for that amount of idle workers. That way after a restart of the WEB server we have 150 processes none of which has handled a single request yet.

StartServers            150
MinSpareServers         1
MaxSpareServers         150
ServerLimit             150
MaxClients              150

Now, we have to create requests one for each worker child. This can be tricky. On my Linux the ready worker queue apparently works on a first-in-first-out basis. So, if requests are generated slowly then 150 are sufficient to reach each worker once. I haven't checked why that is the case or if it always is. So, before continuing check that.

Now, the idea is this. We generate 150 requests at a certain slow rate and record the increasing memory consumption. To avoid manual post-processing of the vmproc log we need to start the monitor and the request generation simultaneously. For example:

ssh root@fn.home 'nice -n -10 ./vmproc 0.2 >vmproc.log& echo $! >PID' </dev/null >/dev/null
for (( i=0; i<150; i++ )); do
  curl http://fn.home/
  usleep 600000
done
ssh root@fn.home 'kill $(<PID)' </dev/null >/dev/null

This script starts the monitor on the server and records its process id in a file. The argument 0.2 asks the monitor to record memory consumption not twice per second but 5 times. Then we start in a cycle a HTTP request via curl and sleep for 0.6 sec. So, at least 3 times the memory is logged for each request.

The first column of the vmproc.log file is the number of processes on the WEB server machine. But this time it does not change. We need something there that can be used as x-axis. Assuming the request is always executed at almost the same time we can convert the record number in the log file into the number of processes that already have processes the request as

procnr = recnr * 150 / rectotal

where procnr is something that varies from 1 to 150 that can be used as x-axis, recnr is the current record number in the log file and rectotal is the total number of log records. The following one-liner does that:

perl -ane 'print ++$x*150/'"$(sed -n '$=' vmproc.log)"', "\t$F[1]\n"' vmproc.log >vmprocx.log

Now, the vmproc-log script is used to create the result image:

Note that this growth in memory consumption happens of course only for the first request of a certain kind per worker. Otherwise it would be buggy.

Verification with /proc/$PID/smaps

The main part of these scripts and the ideas were developed around the year 2004 when Linux kernel 2.6.5 was state of the art. Since then things evolved. With kernel 2.6.14 came a new file, /proc/$PID/smaps, which allows for more detailed per process memory inspection.

The file consists of multiple records of the follwing form:

00400000-0046d000 r-xp 00000000 08:07 32283        /opt/apache22-prefork/sbin/httpd
Size:                436 kB
Rss:                 148 kB
Pss:                  24 kB
Shared_Clean:        148 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

Each record describes a segment of the virtual address space the process is currently using. I am not going to describe all of the fields. If you have installed the linux source tree see Documentation/filesystems/proc.txt for more information. Only the 2 Private_* entries are of interest to us. Here the kernel reports for each segment how much memory is not shared with other processes. How can that help us? When a new worker child is born it shares almost all of its memory with the parent due to the copy-on-write effect. With the first request some part of that memory will become unshared because it is written to and perhaps another part is newly allocated. All of that memory is reported as Private_* then.

So, here is what to do. Start the WEB server with StartServers 2 and MinSpareServers 2. Then sum up for each of the workers the amount of private memory. Issue 2 requests for the static document in slow succession. Recalculate the sum over the private memory and build the difference with the former sum. The result must approximately match 1318 kb, the amount of memory a newly born child eats up for such a request according to the experiment above. Then issue 2 requests for the dynamic document and recalculate the sum of private memory. Now the difference to the previous sum must match 2922 kb according to the experiment:

fn:~ # perl -anle '/Priv/ and $x+=$F[1]; END{print $x}' /proc/9386/smaps
176
fn:~ # perl -anle '/Priv/ and $x+=$F[1]; END{print $x}' /proc/9386/smaps
1476
fn:~ # perl -anle '/Priv/ and $x+=$F[1]; END{print $x}' /proc/9386/smaps
4360

1476 - 176 = 1300 which is only 1.4% away from 1318. 4360 - 1476 = 2884 which is only 1.3% away from 2922. So, the values match pretty much the expected values.

There is a small pitfall here. When a page is swapped it is not present in main memory. When the process accesses this page the memory management unit within the CPU generates an interrupt. The kernel reads the page from secondary storage and the process repeats the access. So, it is quite normal for a page not to reside in RAM when the process is running. The pitfall here is that the kernel cannot decide whether a page is shared or not unless it is present in RAM. Hence, for the measurements above we have to make sure none of the pages of the WEB server processes goes to swap space during the experiment. The simplest way to do that is to turn off swap. swapoff is the command to use.

Letzte Aktualisierung: 19.12.2010