Using A Stack Instead Of Recursion For Greater Scalability
March 24, 2007
by John S. Reid
Have you ever needed to get a list of all objects in your Active Directory or LDAP
database? How about getting all files that end in ".dll" from your file system?
Or have you needed to traverse a MCMS installation? What these tasks all have in
common is that each solution involves traversing a tree structure.
At my day job I needed to code a solution that did just that against our MCMS tree.
The previous programmer wasn't really a .NET/webservice guy so he had coded up a
solution that just wasn't going to work for the load we expected to have, and here's
why: he used recursion.
Recursion is the most common type of coding practice used to perform such tree walk
functions because it's quick and easy to write and is fairly straightforward. In
recursion the method doing the work calls itself with new parameters to get the
next layer down as in the code sample below, based partially on a Microsoft
support article.
public class MyFileInfo
{
public string Name;
public string Path;
public long Size;
public DateTime
CreationTime;
public string Version;
}
List<
MyFileInfo>
m_listMfi =
new List<
MyFileInfo>();
void DirSearch(
string
strStartDir)
{
string[] dirs =
Directory.GetDirectories(strStartDir);
foreach (
string
d
in dirs)
{
string[] files =
Directory.GetFiles(d,
"*.dll");
foreach (
string
f
in files)
{
FileInfo fi = new
FileInfo(f);
// create and populate the MyFileInfo object
MyFileInfo mfi = new
MyFileInfo();
mfi.Path = f;
mfi.Name = fi.Name;
mfi.Size = fi.Length;
mfi.CreationTime = fi.CreationTime;
mfi.Version = FileVersionInfo.GetVersionInfo(f).FileVersion;
m_listMfi.Add(mfi);
}
DirSearch(d);
}
}
As you can see the method DirSearch calls DirSearch repeatedly for each subdirectory
in the tree. When all subdirectories have been processed, control returns to the
parent element where it checks the next element at the same level and so on and
so forth. Pretty easy right? So what's the issue?
Recursion Doesn't Scale.
The biggest problem with recursion is stack space. Most C and C++ programmers have
run into the infamous Stack Overflow error, but for those who have not yet experienced
it a simple explanation is that you've run out of memory and your program terminates.
But how can I run out of memory when I have 2GB installed in my computer? That's
all well and good, but the stack doesn't use it.
The stack is an address space that is allocated at program launch and is used for
value type variables, pointers to reference types, call stack frames and hidden
variables.
|
Stack Variable |
Description |
|
value type |
A variable that contains a value, rather than a variable that points to or references
a value. http://msdn2.microsoft.com/en-us/library/s1ax56ch.aspx |
|
reference pointer |
A reference variable is one that points to a value that is stored on the heap. But
even though the value is stored on the heap the pointer that references that value
is stored on the stack. In a 32 bit system pointers are 4 bytes. |
|
call stack |
When a method calls another method the program needs to know where to execute instructions
once the newly called method completes. The address of the execution step after
the method call is stored on the stack. |
|
hidden variable |
The compiler creates hidden variables to store data that is between variables you
have defined. You are not able to reference these, but they exist nonetheless. Let's
look at an example:
int i = int.Parse(myString[myString.Length
- 1]);
How many variables are in the above statement? If you said 2 you need to crack open
your computer science books again. There are at least 4.
- The literal "1".
- The myString.Length (probably reused for the sum also, otherwise there would be
yet another variable).
- The char that's returned by the string index.
- The int i.
|
How many variables can you count in the DirSearch method of the recursion code sample
above? In our very simple example I count at least 10 including the stack frame.
That's obviously not a huge number (40 bytes), but let's assume a directory tree
that is 10 levels deep (400 bytes), and let's assume that this processing is taking
place in a web service that is being hit by 100 or 1000 clients at a time (~39 KB
or ~391 KB). Therein lays the problem, because the stack is only 1 MB.
You can increase the stack size of your application during compilation, but no one
does (and by how much anyway?), and only if your application is a compiled exe.
If your application is a dll it will use the stack of the binary that loads it,
which means it isn't possible to increase the stack size of ASP.NET dlls.
Now imagine that you're not just getting a few pieces of information within DirSearch,
but are processing some advanced mathematics, or calling some other methods, or
you have to fill some freaky huge classes or worse - structures (structures are
stored on the stack while classes are stored on the heap). That stack space can
get used up pretty fast, and remember that your method isn't the only one to use
the stack space.
Microsoft is actually aware of this and has an overload to the Directory.GetDirectories()
which will grab subdirectories for you, and this overload wisely doesn't use recursion.
Using A Heap-Based Stack
If we create our own stack we can eliminate the "layered" nature of recursion and
cut back on our stack memory usage by completely eliminating the sum of the bytes
that would be used by all recursive calls it has to make. In the DirSearch sample
above those 10 layers are reduced to 1, effectively cutting our stack memory usage
to 40 bytes per caller instead of 400 bytes per caller. In a web service
scenario that can mean the difference between supporting 100 users or 1000.
Here's the sample:
List<
MyFileInfo>
m_listMfi =
new List<
MyFileInfo>();
void DirSearch(
string
strStartDir)
{
// create the stack and add the starting directory
Stack<
string>
stkDirs =
new Stack<
string>();
stkDirs.Push(strStartDir);
while (stkDirs.Count != 0)
{
string strDir = stkDirs.Pop();
// get the subdirectories and add them to the stack.
// you can place these lines after getting the file
// information if you want to maintain the same file
// order that the recursive example would have given
string[] dirs =
Directory.GetDirectories(strDir);
foreach (
string
d
in dirs)
stkDirs.Push(d);
string[] files =
Directory.GetFiles(d,
"*.dll");
foreach (
string
f
in files)
{
FileInfo fi = new
FileInfo(f);
// create and populate the MyFileInfo object
MyFileInfo mfi = new
MyFileInfo();
mfi.Path = f;
mfi.Name = fi.Name;
mfi.Size = fi.Length;
mfi.CreationTime = fi.CreationTime;
mfi.Version = FileVersionInfo.GetVersionInfo(f).FileVersion;
m_listMfi.Add(mfi);
}
}
}
The code really isn't all that different, but you will have a notable gain in scalability
due to the reduced memory footprint.
A System.Collections.Stack variable like the one I used above actually uses a List<>
internally so you could use that instead if you wished. The call to Directory.GetDirectories()
uses a List<> when you specify to get subdirectories, but I find that the
syntax of the Stack makes the task of programming this type of behavior much simpler.
Stack memory is faster to allocate and access and recursion does indeed have it's
place. In a single user application you aren't likely to hit the limits, but the
scalability factor can't be ignored in a massively multi-user environment.
So the next time you're tasked with fixing a recursive function and your colleagues
are complaining about stack overflows, you'll be able to approach the problem with
one more tool in your belt. Just try not to be too smug.
* About a week after I wrote this I found code in our production environment at my day job
that was essentially a web spider, crawling from link to sublink to gather search information
on our intranet. The coder used recursion. The method used about 120 bytes on the stack for each
iteration and who knows how many levels it goes. We're lucky it's a single-caller application
that only searches our intranet. Can you imagine Google trying to get away with that?