Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Best way to Processs Large Files

Status
Not open for further replies.

kev1597770

Programmer
Jan 25, 2006
3
US
I have text files that are 500MB+ that i need to process line by line. Currently i am using the basic code structure below. This results in a process that takes a few hours.

I tried using threads to process parts of the same source file and reading characters from the file into a few thousand character buffer. Both of these methods slowed down the process.

What is the most effecient C# algorithm to process large files?

//########################################################

StreamReader sr = new StreamReader( sourcePath );
StreamWriter sw = new StreamWriter( destinationPath );
string strFileLine = null;
//########################################################

while (( strFileLine = sr.ReadLine()) != null )
{
// Perform some parsing and formating on strFileLine

sw.WriteLine(...);
}

//########################################################
 
You might want to divide it into two threads - one for reading the file, and one for acting on it's content. Google for the "Producer Consumer" pattern.

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
You can almost think of it as a printer spooler. You pass data from one thread to the "printer spooler" and it takes care of the processing for you.
 
I think the process of reading a file line by line is the bottleneck. This bottleneck will still exist if i merely transfer it to a producer thread and have a consumer thread process the data.

How can i rework my C# code to make a limited number of large data requests to the hard drive instead of tens of thousands of small data requests?

The producer consumer algorithm that im using appears below. From the program output, which also appears below, it seems like the producer and consumer threads arent working in parallel. Perhaps this is because they cant both operate on the same queue at once, hence the lock. Is there a way to modify this algorithm so production and consumption is not exculsive?

public class ProducerConsumer
{
readonly object listLock = new object();
Queue queue = new Queue();

public void Produce(object o)
{
lock (listLock)
{
queue.Enqueue(o);
if (queue.Count==1)
{
Monitor.Pulse(listLock);
}
}
}

public object Consume()
{
lock (listLock)
{
while (queue.Count==0)
{
Monitor.Wait(listLock);
}
return queue.Dequeue();
}
}
}

Producing: /
Producing:
Producing: total 1233
Producing: 2 drwxr-xr-x 48 root root
Producing: 2 drwxr-xr-x 48 root root
Producing: 379679 -rw------- 1 root other
Producing: 379678 -rw------- 1 root other
Producing: 186622 drwxr-xr-x 11 root other
Producing: 3236 -rwxr-xr-x 1 root other
Producing: 236269 drwxr-xr-x 3 root other
Producing: 186673 drwxr-xr-x 3 root other
Producing: 379239 -rw------- 1 root other
Consuming: /
Consuming:
Consuming: total 1233
Consuming: 2 drwxr-xr-x 48 root root
Consuming: 2 drwxr-xr-x 48 root root
Consuming: 379679 -rw------- 1 root
Consuming: 379678 -rw------- 1 root
Consuming: 186622 drwxr-xr-x 11 root
Consuming: 3236 -rwxr-xr-x 1 root
Consuming: 236269 drwxr-xr-x 3 root
Consuming: 186673 drwxr-xr-x 3 root
Producing: 3237 -rw-r--r-- 1 root other
Producing: 1113165 drwx------ 2 root other
Producing: 728166 drwxr-xr-x 3 root other
Producing: 187449 -rw-rw-r-- 1 root other
Producing: 806222 drwx------ 2 root other
Producing: 1006952 drwxr-xr-x 3 root other
Producing: 1113162 drwxrwxr-x 4 root other
Producing: 186641 drwxr-xr-x 2 root root
Producing: 2259 lrwxrwxrwx 1 root root
Producing: 620650 drwxrwxr-x 8 root other
Consumer1 Consuming: 379239 -rw------- 1 root
.
.
.
 
Are you running on a dual-core or hyper-threaded machine?

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top