Stephen A. Fuqua (SAF) is a Bahá'í, software developer, and conservation and interfaith advocate in the DFW area of Texas.

Performance #6: Reading Directly Into the Parser

July 23, 2007

This article is part of the series An Exercise in Performance Tuning in C#.Net.

As I look at the code I now have, I wonder if the fileLines variable is an unnecessary intermediate step. Can I rewrite so that stream.ReadLine() is passed directly into the parsing? If I do so, I’ll be leaving the file open longer, but since no other application should be attempting to access the file, I’m okay with that. This means moving the open file command into MyClass.ProcessFile().

This is what we left off with:

// Read intput file into a string
List<string> fileLines = new List<string>();
using (StreamReader stream = new StreamReader(inputFileName, Encoding.ASCII, true, 800))
{
     string line;
     while ((line = stream.ReadLine()) != null)
          fileLines.Add(line);
}

// Parse the input file
MyClass upload = new MyClass(fileLines);
upload.ProcessFile();


ProcessFile:

// perform various tasks on each line

And here is the new code:

using (StreamReader stream = new StreamReader(inputFileName, Encoding.ASCII, true, 800))
{
     while (stream.Peek() >= 0)
     {
          processSingleLine(stream.ReadLine());
     }
}

I actually made two changes here. It is possible the stream.Peek() statement helped speed things up as well. This change was plainly necessary in order for the second change — passing the result of stream.ReadLine() directly to stream.Peek(). This bit of code is now embedded in the ProcessFile() function rather than in the parent application.

Result: 82% improvement in processing time! That's incredible. Maybe I should explain just a bit more. The processSingleLine() routine is parsing out the input file's data into various objects, and doing some manipulation along the way. In the original version, the file was being opened by the application, and its lines added to an array. This array (or List<string>) was then passed into the ProcessFile() function, which basically just called processSingleLine().

By calling processSingleLine() directly, I end up leaving the file open for as long as the parsing takes. In the original situation I wanted to "open late and close early" — keep the file open only long enough to read its contents into memory. But, here is a crucial point: in this situation, no other application will be trying to read the file. So it does not really matter if the file remains open. By keeping it open and passing the output from readLine() into processSingleLine(), I eliminated the creation of a large array of strings, which creation required considerable overhead (moving in and out of memory).