Have you ever needed to get a list of all objects in your Active Directory or LDAP database? How about getting all files that end in ".dll" from your file system? Or have you needed to traverse a MCMS installation? What these tasks all have in common is that each solution involves traversing a tree structure.

At my day job I needed to code a solution that did just that against our MCMS tree. The previous programmer wasn't really a .NET/webservice guy so he had coded up a solution that just wasn't going to work for the load we expected to have, and here's why: he used recursion.

Recursion is the most common type of coding practice used to perform such tree walk functions because it's quick and easy to write and is fairly straightforward. In recursion the method doing the work calls itself with new parameters to get the next layer down as in the code sample below, based partially on a Microsoft support article.

public class MyFileInfo
{
public string Name;
public string Path;
public long Size;
public DateTime CreationTime;
public string Version;
}

List<MyFileInfo> m_listMfi = new List<MyFileInfo>();

void DirSearch(string strStartDir)
{
string[] dirs = Directory.GetDirectories(strStartDir);
foreach (string d in dirs)
{
string[] files = Directory.GetFiles(d, "*.dll");
foreach (string f in files)
{
FileInfo fi = new FileInfo(f);

// create and populate the MyFileInfo object
MyFileInfo mfi = new MyFileInfo();
mfi.Path = f;
mfi.Name = fi.Name;
mfi.Size = fi.Length;
mfi.CreationTime = fi.CreationTime;
mfi.Version = FileVersionInfo.GetVersionInfo(f).FileVersion;
m_listMfi.Add(mfi);
}

DirSearch(d);
}
}

As you can see the method DirSearch calls DirSearch repeatedly for each subdirectory in the tree. When all subdirectories have been processed, control returns to the parent element where it checks the next element at the same level and so on and so forth. Pretty easy right? So what's the issue?

Recursion Doesn't Scale.

The biggest problem with recursion is stack space. Most C and C++ programmers have run into the infamous Stack Overflow error, but for those who have not yet experienced it a simple explanation is that you've run out of memory and your program terminates. But how can I run out of memory when I have 2GB installed in my computer? That's all well and good, but the stack doesn't use it.

The stack is an address space that is allocated at program launch and is used for value type variables, pointers to reference types, call stack frames and hidden variables.

Stack Variable Description
value type A variable that contains a value, rather than a variable that points to or references a value. http://msdn2.microsoft.com/en-us/library/s1ax56ch.aspx
reference pointer A reference variable is one that points to a value that is stored on the heap. But even though the value is stored on the heap the pointer that references that value is stored on the stack. In a 32 bit system pointers are 4 bytes.
call stack When a method calls another method the program needs to know where to execute instructions once the newly called method completes. The address of the execution step after the method call is stored on the stack.
hidden variable The compiler creates hidden variables to store data that is between variables you have defined. You are not able to reference these, but they exist nonetheless. Let's look at an example:

int i = int.Parse(myString[myString.Length - 1]);

How many variables are in the above statement? If you said 2 you need to crack open your computer science books again. There are at least 4.
  1. The literal "1".
  2. The myString.Length (probably reused for the sum also, otherwise there would be yet another variable).
  3. The char that's returned by the string index.
  4. The int i.

How many variables can you count in the DirSearch method of the recursion code sample above? In our very simple example I count at least 10 including the stack frame. That's obviously not a huge number (40 bytes), but let's assume a directory tree that is 10 levels deep (400 bytes), and let's assume that this processing is taking place in a web service that is being hit by 100 or 1000 clients at a time (~39 KB or ~391 KB). Therein lays the problem, because the stack is only 1 MB.

You can increase the stack size of your application during compilation, but no one does (and by how much anyway?), and only if your application is a compiled exe. If your application is a dll it will use the stack of the binary that loads it, which means it isn't possible to increase the stack size of ASP.NET dlls.

Now imagine that you're not just getting a few pieces of information within DirSearch, but are processing some advanced mathematics, or calling some other methods, or you have to fill some freaky huge classes or worse - structures (structures are stored on the stack while classes are stored on the heap). That stack space can get used up pretty fast, and remember that your method isn't the only one to use the stack space.

Microsoft is actually aware of this and has an overload to the Directory.GetDirectories() which will grab subdirectories for you, and this overload wisely doesn't use recursion.

Using A Heap-Based Stack

If we create our own stack we can eliminate the "layered" nature of recursion and cut back on our stack memory usage by completely eliminating the sum of the bytes that would be used by all recursive calls it has to make. In the DirSearch sample above those 10 layers are reduced to 1, effectively cutting our stack memory usage to 40 bytes per caller instead of 400 bytes per caller. In a web service scenario that can mean the difference between supporting 100 users or 1000.

Here's the sample:

List<MyFileInfo> m_listMfi = new List<MyFileInfo>();

void DirSearch(string strStartDir)
{
// create the stack and add the starting directory
Stack<string> stkDirs = new Stack<string>();
stkDirs.Push(strStartDir);

while (stkDirs.Count != 0)
{
string strDir = stkDirs.Pop();

// get the subdirectories and add them to the stack.
// you can place these lines after getting the file
// information if you want to maintain the same file
// order that the recursive example would have given
string[] dirs = Directory.GetDirectories(strDir);
foreach (string d in dirs)
stkDirs.Push(d);

string[] files = Directory.GetFiles(d, "*.dll");
foreach (string f in files)
{
FileInfo fi = new FileInfo(f);

// create and populate the MyFileInfo object
MyFileInfo mfi = new MyFileInfo();
mfi.Path = f;
mfi.Name = fi.Name;
mfi.Size = fi.Length;
mfi.CreationTime = fi.CreationTime;
mfi.Version = FileVersionInfo.GetVersionInfo(f).FileVersion;
m_listMfi.Add(mfi);
}
}
}

The code really isn't all that different, but you will have a notable gain in scalability due to the reduced memory footprint.

A System.Collections.Stack variable like the one I used above actually uses a List<> internally so you could use that instead if you wished. The call to Directory.GetDirectories() uses a List<> when you specify to get subdirectories, but I find that the syntax of the Stack makes the task of programming this type of behavior much simpler.

Stack memory is faster to allocate and access and recursion does indeed have it's place. In a single user application you aren't likely to hit the limits, but the scalability factor can't be ignored in a massively multi-user environment.

So the next time you're tasked with fixing a recursive function and your colleagues are complaining about stack overflows, you'll be able to approach the problem with one more tool in your belt. Just try not to be too smug.

* About a week after I wrote this I found code in our production environment at my day job that was essentially a web spider, crawling from link to sublink to gather search information on our intranet. The coder used recursion. The method used about 120 bytes on the stack for each iteration and who knows how many levels it goes. We're lucky it's a single-caller application that only searches our intranet. Can you imagine Google trying to get away with that?