Optimising a User Defined Function

jrsansom · Jul 25, 2006

Hi Everyone,

I have a User Defined Function that currently takes about 6 minutes to process 100,000 records. Given that I need to process about 3 million records I am naturally keen to optimise this UDF if at all possible.

The UDF basically creates a string token based on two strings passed as parameters to the function. It is relatively self explanatory, I hope, and I have included the UDF below for your reference.

I would be grateful if anyone could suggest how I could go about optimising this.

Many Thanks,
John

Code:

CREATE FUNCTION cdsCalculateMatchKey1
	(@companyColumn varchar(255), @postcodeColumn varchar(100) )
RETURNS varchar(20)  AS  
BEGIN 
	--Declare Counters
	declare @consonantCounter 	int
	declare @consonantLimit 	int
	declare @vowelCounter 		int
	declare @vowelLimit 		int
	declare @numeralCounter	int
	declare @numeralLimit		int
	declare @stringLength		int
	declare @stringIndex		int

	--Declare string processing variables
	--declare @companyColumn 		Varchar(255)
	declare @companyToken		varchar(255)
	declare @currentCharacter 		varchar(1)
	
	--Set initial variable statuses
	set @consonantCounter	= 0
	set @consonantLimit 	= 5
	set @vowelCounter	= 0
	set @vowelLimit		= 3
	set @stringIndex		= 0
	set @numeralCounter	= 0
	set @numeralLimit	= 2
	set @companyColumn 	= upper(replace(@companyColumn,' ',''))
	set @postcodeColumn    = upper(replace(@postcodeColumn,' ',''))
	set @companyToken	= ''
	set @stringLength = len(@companyColumn)
	
	
	--Begin querying the @companyColumn string and extrapolating the desired character data
	while(@stringIndex <= @stringLength and 
		(@consonantCounter<@consonantLimit or @vowelCounter<@vowelLimit or @numeralCounter<@NumeralLimit))
	begin
		select @currentCharacter = substring(@companyColumn,@stringIndex,1)
		--Print @currentCharacter
	
		if exists(select @currentCharacter 
			where @currentCharacter in('b','c','d','f','g','h','j','k','l','m',
							'n','p','q','r','s','t','v','w','x','y','z'))
		begin
			if(@consonantCounter<@consonantLimit)
			begin
				--print 'Consonant ' + @currentCharacter
				set @consonantCounter = @consonantCounter + 1
				set @companyToken = @companyToken + @currentCharacter
			end
		end
		else if exists(select @currentCharacter where @currentCharacter in('a','e','i','o','u'))
		begin
			if(@vowelCounter<@vowelLimit)
			begin
				--Print 'Vowel ' + @currentCharacter
				set @vowelCounter = @vowelCounter + 1
				set @companyToken = @companyToken + @currentCharacter
			end
		end
		else if exists(select @currentCharacter where isnumeric(@currentCharacter)=1) 
		begin
			if(@numeralCounter<@numeralLimit)
			begin
				--Print 'Vowel ' + @currentCharacter
				set @numeralCounter = @numeralCounter + 1
				set @companyToken = @companyToken + @currentCharacter
			end
		end
		
				
		set @stringIndex = @stringIndex + 1
	end

	--Define the matchkey and return its value
	return @postcodeColumn+@companyToken
END

hmckillop · Jul 25, 2006

Doing any UDF in large batch updates, especially string processing will be costly.
Try having a look at the FAQ,

http://www.tek-tips.com/faqs.cfm?fid=3141

which will tell you how to split your single update into batches.

"I'm living so far beyond my income that we may almost be said to be living apart

jrsansom · Jul 25, 2006

Hi,

Thanks for the reply.

Having read the FAQ, would it then be fair to say that a batch (update) will only have a performance benifit if the log file is not already large enough to support the query. i.e. needs to grow to accomodate it.

The reason I ask is testing thus far using a batch update actually takes longer than running the UDF as a standard update query.

I have included the test scenario below.

Code:

Update tableName --With (tablockx, holdlock)
	    Set COMP_KEY1 =  dbo.cdsCalculateCompMatchKey1(Company_Name,Postcode)
--7:53

Code:

set rowcount 50000

declare @counter int
set @counter=50000

while @counter=50000
begin
	Begin Transaction

	  
	  Update tableName With (tablockx, holdlock)
	    Set COMP_KEY1 = dbo.cdsCalculateCompMatchKey1(Company_Name,Postcode)
	  where COMP_KEY1 is null
	
	  --Get number of rows updated
	  --Process will continue until less than 50000
	  Select @counter=@@rowcount
	
	  --Commit the transaction
	  Commit
End
set rowcount 0
--8:19

hmckillop · Jul 26, 2006

The batch file also has other benefits. You can see what stage you are out by including output statements i.e. print, this means you have confidence in what operations are happening.
I dont think you will be able to perform huge performance increase without restructuring your query.
If its 6 minutes for 100,000 rows, 3 hrs the query should be done. Is this a recurring query?

"I'm living so far beyond my income that we may almost be said to be living apart

jrsansom · Jul 26, 2006

Hi,

Thank you again for your reply.

The UDF is one of 6 UDF's that will need to be run once a month. So it is fair to say that on our current hardware setup this will take about an hour collectively per data set. Isolated this is not a problem but when you consider that this process is one of many in a much larger system the accumulated time can soon mount up.

We are in the process of constructing a large data load/consolidation process (using DTS) and I want to ensure that what is produced is as efficient as possible you see.

Do you have any recommendations as to how the query could be re-structured to improve performance?

Many Thanks,

hmckillop · Jul 27, 2006

I think if you are using DTS , you would be better writing the string function in the DTS package and applying a VBScript function as you insert your data. This will be consierably more faster than asking SQL Server to perform the multiple string operations in the database.

If this field is effecttively a computed column on two existing columns, what purpose is it used for, and why would you not just apply the function to the query which accesses the data?

"I'm living so far beyond my income that we may almost be said to be living apart

jrsansom · Jul 27, 2006

Hi,

I was thinking along similar lines about this today. So in short you are saying it is more efficient to perform string manipulation via ActiveX controls than directly on the DB itself using T-SQL?

To provide a little more detail, the field will be generated on numerous tables, in order to provide a match key between the data in these tables and an existing database. I am looking to identify duplicate data you see, and to eventually consolidate all tables (differing source of data) into a single relational database.

hmckillop · Jul 28, 2006

If you are trying to provide a match check between tables, you would be better of adding an index on the companyname, postcode field in each relevant table and then do a select statement using a join for comparison.
This means no string manipulation and a direct comparison.

In answer to the first part of your last response, yes, an ActiveX component will outperform SQL on string operations by significant amount.

"I'm living so far beyond my income that we may almost be said to be living apart

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Optimising a User Defined Function

jrsansom

Programmer

hmckillop

Programmer

jrsansom

Programmer

hmckillop

Programmer

jrsansom

Programmer

hmckillop

Programmer

jrsansom

Programmer

hmckillop

Programmer

Similar threads

Part and Inventory Search

Sponsor