System.OutOfMemoryException 1

andegre · Jun 24, 2010

I get this error because I'm returning a dataset that contains 5.6 million rows. Where can I "up" the memory on the machine/server that I'm running on so that it has enough to finish the program?

TIA

jmeckley · Jun 24, 2010

the problem isn't memory, the problem is loading 5.6 million rows. why would you nee to load 5.6 million records at once?

Jason Meckley
Programmer
Specialty Bakers, Inc.

faq855-7190
faq732-7259

andegre · Jun 25, 2010

Need to extract data for a customer.

jmeckley · Jun 25, 2010

that's nothing uncommon, but there are better ways of doing this than select top "5.6M * from table". first i would use a tool that is explicitly designed for ETL (extract transform load). The 2 I'm aware of are SSIS* and Rhino.Etl. ETL tools are designed to manage huge amounts of data, so this helps reduce the memory issue.

using one of these tools i would try to build the the process all in one go. select everything, transform and dump to a single file. if the preformance is unacceptable I would then batch the exports. for example
select 1M records, transform, export to a unique file, repeat until all rows are processed. this would generate 6 files. zip the files together and send to the client.
another option
process 1M rows into separate files, then merge the files into 1 using a library built for parse large files.

I much prefer to code my process rather than use UI wizards so I would opt for Rhino.ETL every time.

*SSIS is available with Sql 2005+. in Sql 2000 you would use DTS, which I would avoid.

Jason Meckley
Programmer
Specialty Bakers, Inc.

faq855-7190
faq732-7259

jmeckley · Jun 25, 2010

promissing results. I put this together as a proof of concept. here is the scenario
I have a database with a single table

Code:

create table Customers (
   Id big int not null primary key,
   Name varchar(50),
   Line1 varchar(50),
   Line2 varchar(50),
   City varchar(50),
   State varchar(50),
   ZipCode int
)

I loaded this table with 5.6M records.
I then built a console app to run my etl process using Rhino.ETL.

Code:

internal class Program
{
	private static void Main()
	{
		using (new Logging())
		{
			using (var process = new DatabaseToFileProcess())
			{
				process.Execute();
			}
			Console.ReadLine();
		}
	}
}
public class Logging : IDisposable
{
	public Logging()
	{
		BasicConfigurator.Configure();
	}

	public void Dispose()
	{
		LogManager.Shutdown();
	}
}
public class DatabaseToFileProcess : EtlProcess
{
	protected override void Initialize()
	{
		var path = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "customers.csv");

		Register(new ConventionInputCommandOperation("source")
					 {
						 Command = "select * from Customers"
					 })
			.Register(new PutDataInFile(path));
	}
}
public class PutDataInFile : AbstractOperation
{
	private readonly string destinationFile;

	public PutDataInFile(string destinationFile)
	{
		this.destinationFile = destinationFile;
	}

	public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
	{
		using (var file = FluentFile.For<Customer>().AppendTo(destinationFile))
		{
			foreach (var row in rows)
			{
				file.Write(row.ToObject<Customer>());
				yield return row;
			}
		}
	}
}
[DelimitedRecord(",")]
public class Customer
{
	public long Id;

	[FieldQuoted]
	public string Name;

	[FieldQuoted]
	public string Line1;

	[FieldQuoted]
	public string Line2;

	[FieldQuoted]
	public string City;

	[FieldQuoted]
	public string State;
	
	public int ZipCode;
}

Code:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <connectionStrings>
    <add name="source" connectionString="Server=.\SqlExpress;initial catalog=EtlSpike;Integrated Security=SSPI;" providerName="System.Data.SqlClient"/>
  </connectionStrings>
</configuration>

the application references the following 3rd party binaries. all of which come with the rhino.etl source.
log4net
rhino.etl.core
filehelpers
boo.lang
boo.lang.parsers

some crude stats
1. the cpu usage was between 40-50%. that is with VS, SqlExpress, Outlook and FF running as well as the application itself.
2. the application only consumed 15,888K on my box (laptop) compare this to sqlexpress which was using 1,149,468K. the foot print is very small
3. total execute time: 5600000 Rows in 00:38:21.0533014
4. file size: 458MB
5. i would imagine disabling debugging message (log4net) would increase the throughput.

These results are very promissing. I haven't thrown this much data at Rhino.ETL before. I'm very impressed. set this to run "off line" or over night and the customer will either recieve it automatically, or the file will be ready for download in the morning.

other stuff:
There are many more options on how to create output files. This is controlled by the filehelpers library. the example above creates a comma delimited text file where string fields are quoted, not header text. nothing too fancy.

Rhino.Etl can take advantage of conventions to reduce the amount of code required. in the above example the columns/aliases of my sql statement match the name & type of the Customer object so mapping the Row object to the Customer is 1:1 and the system "just knows".

you could take this 1 step farther and have the "put in file" process ftp/email the file to the customer. it would look like this

Code:

public class PutDataInFileAndSendEmail : AbstractOperation
{
	private readonly string destinationFile;

	public PutDataInFile(string destinationFile)
	{
		this.destinationFile = destinationFile;
	}

	public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
	{
		using (var file = FluentFile.For<Customer>().AppendTo(destinationFile))
		{
			foreach (var row in rows)
			{
				file.Write(row.ToObject<Customer>());
				yield return row;
			}
		}
		EmailFile();
	}
	
	private void EmailFile()
	{
		var message = new MailMessage
							{
								//to, from, body, header
							};
		message.Attachements.Add(new FileInfo(destinationFile));
		new SmtpClient().AsyncSend(message);
	}
}

I'm not 100% on the EmailFile method, but it would look something like that anyway.

Jason Meckley
Programmer
Specialty Bakers, Inc.

faq855-7190
faq732-7259

andegre · Jun 25, 2010

You've got to be one of the most helpful programmers I've ever encountered...thanks again.

As for this example, after posting my original question, I went and modified my stored proc to take in a @LowEnd and @HighEnd input parameters where I only process 1 million customers at a time...kind of the same concept as your original suggestion, and it's working pretty well right now.

As for the Rhino.ETL (btw, I'm not fond of DTS or SSIS so wouldn't mind looking into the Rhino.ETL) I'm curious about this because I have 3 or 4 other extract jobs that are creating files anywhere from 3 GB to 22 GB. I wouldn't mind trying it on one of those big ones because I HATE trying to fix old DTS packages when we're running SQL Server 2005...

Have you had much experience with Rhino for files of that size?

jmeckley · Jun 25, 2010

that size, no, but seeing the results of this process I don't think rhino would have a problem handling files that large. under the hood of [tt]FluentFile.For<Customer>().AppendTo(destinationFile)[/tt] Rhino is using the AsyncFileManager from FileHelper to read the files asynchronously. I assume it's reading the file 1 line at a time and then sending it through the pipeline. rather than reading the entire file into memory and then sending it through the pipeline.

for more information on rhino.etl check out

http://hibernatingrhinos.com/open-source/rhino-etl

Jason Meckley
Programmer
Specialty Bakers, Inc.

faq855-7190
faq732-7259

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

System.OutOfMemoryException 1

andegre

MIS

jmeckley

Programmer

andegre

MIS

jmeckley

Programmer

jmeckley

Programmer

andegre

MIS

jmeckley

Programmer

Similar threads

Part and Inventory Search

Sponsor