Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Is JOINT using varchar columns less efficient ? 1

Status
Not open for further replies.

PROPAR

Programmer
Oct 25, 2001
51
0
0
FR
As I build my SQL Server 2005 database model, I was criticized for using varchar columns to do a SQL joint instead of integer identity columns (that I should create for this purpose only).

I planned to do :

SELECT *
FROM TABLE1,TABLE2
WHERE TABLE1.string = TABLE2.string

(TABLE1.string and TABLE2.string are both indexed varchar(50) primary keys)

I was told I should more efficiently do :

SELECT *
FROM TABLE1,TABLE2
WHERE TABLE1.id_integer = TABLE2.id_integer

and add *.id_integer dedicated identity integer columns to improve joint efficiency (as int compare is supposed to be speedier than varchar compare).

Is there any truth in this ?
Thanks for your help.

 
criticized? by whom?

your boss? make the change

DBA in charge of reviewing the design? make the change

coworkers on a related project? consider making the change

some guy on the internet? ignore him

:)



r937.com | rudy.ca
 
I don't (didn't) have much of a solid answer or discussion piece but interested in seeing results sense that will be the best method on performance of the join over an int and char. The only answer you can give to this question is hard evidence in testing as r937 pointed out in so many words

Crude but comparable test.

I created 2 tables and inserted a million rows per table

both tables are identical as follows. tbl1 and tbl2
Code:
USE [test]
CREATE TABLE [dbo].[tbl1](
	[id] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
	[text] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
	[primaryInt] [numeric](18, 0) NOT NULL,
	[primaryVar] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]

I then ran a simple join on them
Code:
select * from tbl1 a
join tbl2 b on a.primaryVar = b.primaryVar

select * from tbl1 a
join tbl2 b on a.primaryInt = b.primaryInt

The execution plan thus followed showing
exe_plan.gif



Thoughts from our more experienced database developers?

[sub]____________ signature below ______________
The worst mistake you'll ever make is to do something simply the way you know how while ignoring the way it should be done[/sub]
 
Thanks onpt.
So the varchar joint seems to be only marginally less efficient, no ?
 
Read the following and make the decision yourself. This should tell you everything you need to know about join performance.

For best join performance, the indexes on the columns being joined should ideally be numeric data types, not CHAR or VARCHAR, or other non-numeric data types. The overhead is lower and join performance is faster.


Well Done is better than well said
- Ben Franklin
 
I've been playing around with this for a little bit. Here's what I've come up with...

1. Best is a relative term, and can be hard to quantify. In order to qualify for 'best' status, something can be a lot better, or only a little better than the number 2 thing. Ex: 2 Olympic runner and I run a race. The times are 2:10, 2:11, 8:40. (I'm not a good runner [smile]). In my opinion, the difference between the number 1 guy and the number 2 guy is not enough to be definitive. The same applies here. The difference in time between joins on integers and varchars is insiginificantly small.

2. When joining tables, it is more important to make sure indexes are used for the join columns. Without proper indexing in the database, the difference in time would be like comparing my race time with an olympic gold medalist's time. The difference in time is huge.

There are many database systems out there that use GUID's for primary keys instead of integers. The reason these systems don't fail is because joining on GUID's is almost as fast as joining on integers, in fact the time difference in negligible.

That being said, I would still advise that you use some arbitrary value as the primary key (which I presume you are joining on with other tables). For example, suppose you decide to use a combination of first name and last name for your joins. Everything works fine until the day you add another person that has the exact same name as someone else. Of course, if this is the primary key, you won't be able to add the person (problem #1), and even if you were, your joins would fail because it would return the wrong data (problem #2). And then, someone will get married and want to change their name (problem #3).

By using arbitrary values as your primary key, you avoid all of these issues. In fact, most systems don't show the key value to end users because it doesn't mean anything. They exist solely for the purpose of maintaining data integrity.

If you agree with everything I said, then consider this... since you should be using arbitrary values for your keys (which you usually join to other tables on), then why not pick the data type of the one that gives the best performance. Even if integers are only slightly better than varchars, I would prefer to use them. You see, integers take 4 bytes to store, varchars take 1 byte per character, and nvarchar takes 2 bytes per character. By using integers, you use less storage for the data, but you also use less data for the indexes.

This has other effects on the database. SQL Server stores data in 8K pages. The more records you can fit in to an 8K page, the less physicall I/O (hard drive access) you'll have. The less I/O you have, the faster your overall performance.

And, finally... SQL Server caches data in memory. The more data that is cached in memory, the faster your performance will be. By reducing the amount of memory overhead, you will effectively be able to store more meaningful information in memory and your performance will be better.

So... While it is true that join performance differences between integers and varchars is minimal, you should consider the overall system performance. Integers are better, and should therefore be used.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Thanks George!


Borislav Borissov
VFP9 SP1, SQL Server 2000/2005.
MVP VFP
 
onpnt,

Unfortunately, I do see a flaw with your test. The flaw is that you created 2 identical tables. If you do your joins on integer columns, then the 2nd table wouldn't have to have the varchar column (and vice versa). When you remove the un-necessary column, the performance difference is recognizable. Here's how I performed my test.

Code:
[COLOR=blue]CREATE[/color] [COLOR=blue]TABLE[/color] [dbo].[tblParent](
    [id] [[COLOR=blue]numeric[/color]](18, 0) [COLOR=blue]IDENTITY[/color](1,1) NOT NULL,
    [primaryInt] [COLOR=blue]Int[/color] NOT NULL,
    [primaryVar] [nvarchar](50) [COLOR=blue]COLLATE[/color] SQL_Latin1_General_CP1_CI_AS NOT NULL
) [COLOR=blue]ON[/color] [[COLOR=blue]PRIMARY[/color]]

Code:
[COLOR=blue]Declare[/color] @i [COLOR=blue]int[/color]
[COLOR=blue]Set[/color] @i = 1

[COLOR=blue]While[/color] @i <= 250000
  [COLOR=blue]Begin[/color]
    [COLOR=blue]Insert[/color] [COLOR=blue]Into[/color] tblParent(PrimaryInt, primaryVar)
    [COLOR=blue]Values[/color] (@i, [COLOR=#FF00FF]Left[/color]([COLOR=#FF00FF]Convert[/color]([COLOR=blue]Varchar[/color](50), newid()) + [COLOR=#FF00FF]Convert[/color]([COLOR=blue]Varchar[/color](50), newid()), 50))

    [COLOR=blue]Set[/color] @i = @i + 1
  [COLOR=blue]End[/color]

Code:
[COLOR=blue]Create[/color] [COLOR=blue]index[/color] tblParent_PrimaryInt [COLOR=blue]On[/color] tblParent(PrimaryInt)
[COLOR=blue]Create[/color] [COLOR=blue]index[/color] tblParent_PrimaryVar [COLOR=blue]On[/color] tblParent(PrimaryVar)

Code:
[COLOR=blue]Select[/color] id, primaryInt [COLOR=blue]Into[/color] tblInt [COLOR=blue]From[/color] tblParent
[COLOR=blue]Create[/color] [COLOR=blue]index[/color] tblInt_PrimaryInt [COLOR=blue]On[/color] tblInt(PrimaryInt)

Code:
[COLOR=blue]Select[/color] id, PrimaryVar [COLOR=blue]Into[/color] tblVar [COLOR=blue]From[/color] tblParent
[COLOR=blue]Create[/color] [COLOR=blue]index[/color] tblVar_PrimaryVar [COLOR=blue]On[/color] tblVar(primaryVar)

Code:
[COLOR=blue]Declare[/color] @Start [COLOR=#FF00FF]DateTime[/color]
[COLOR=blue]Set[/color] @Start = [COLOR=#FF00FF]GetDate[/color]()

[COLOR=blue]Select[/color] tblParent.Id
[COLOR=blue]From[/color]   tblParent
       [COLOR=blue]Inner[/color] [COLOR=blue]Join[/color] tblVar 
         [COLOR=blue]On[/color] tblParent.PrimaryVar = tblVar.PrimaryVar

[COLOR=blue]Select[/color] .001 * [COLOR=#FF00FF]DateDiff[/color]([COLOR=#FF00FF]Millisecond[/color], @Start, [COLOR=#FF00FF]GetDate[/color]())

Code:
[COLOR=blue]Declare[/color] @Start [COLOR=#FF00FF]DateTime[/color]
[COLOR=blue]Set[/color] @Start = [COLOR=#FF00FF]GetDate[/color]()

[COLOR=blue]Select[/color] tblParent.Id
[COLOR=blue]From[/color]   tblParent
       [COLOR=blue]Inner[/color] [COLOR=blue]Join[/color] tblInt
         [COLOR=blue]On[/color] tblParent.PrimaryInt = tblInt.PrimaryInt

[COLOR=blue]Select[/color] .001 * [COLOR=#FF00FF]DateDiff[/color]([COLOR=#FF00FF]Millisecond[/color], @Start, [COLOR=#FF00FF]GetDate[/color]())

According to test on my computer, the varchar table runs in approximately 3.2 seconds while the test on the integer table runs in approximately 1.8 seconds. While this is not a huge difference in time, it is significant. I suspect the reason for the difference is because of the file I/O differences between the int table and the varchar table. The int table would have less physical reads while the varchar table would have more.

Make sense?

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Thanks George. When I did that it seemed like a bad idea to make the tables identical like that.

Awesome post and explanation as always *

[sub]____________ signature below ______________
The worst mistake you'll ever make is to do something simply the way you know how while ignoring the way it should be done[/sub]
 
That is about 40% faster for the int after creating a clustered index first on int and then on the varchar column
pages and extent differences are minimal


here is what I used

Code:
create TABLE [dbo].[tblParent](
    [id] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
    [primaryInt] Int NOT NULL,
    [primaryVar] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]


--put a tran here and use nocount on , makes a BIG difference in speed ;-)
set nocount on
begin tran
Declare @i int
Set @i = 1

While @i <= 250000
  Begin
    Insert Into tblParent(PrimaryInt, primaryVar)
    Values (@i, Left(Convert(Varchar(50), newid()) + Convert(Varchar(50), newid()), 50))

    Set @i = @i + 1
  End
commit

 --> Index the parent table
Create index tblParent_PrimaryInt On tblParent(PrimaryInt)
Create index tblParent_PrimaryVar On tblParent(PrimaryVar)


--> Create the integer table
Select id, primaryInt Into tblInt From tblParent
Create index tblInt_PrimaryInt On tblInt(PrimaryInt)


--> Create the varchar table
Select id, PrimaryVar Into tblVar From tblParent
Create index tblVar_PrimaryVar On tblVar(primaryVar)


--> Test the varchar table
Declare @Start DateTime
Set @Start = GetDate()

Select tblParent.Id
From   tblParent
       Inner Join tblVar 
         On tblParent.PrimaryVar = tblVar.PrimaryVar

Select .001 * DateDiff(Millisecond, @Start, GetDate())


CODE --> Test the integer table
Declare @Start DateTime
Set @Start = GetDate()

Select tblParent.Id
From   tblParent
       Inner Join tblInt
         On tblParent.PrimaryInt = tblInt.PrimaryInt

Select .001 * DateDiff(Millisecond, @Start, GetDate())


--now let's add a clustered index on the varchar column
create clustered index pk_Whatever on tblParent(PrimaryInt)
dbcc showcontig ('tblParent')

--check pages and extents
DBCC SHOWCONTIG scanning 'tblParent' table...
Table: 'tblParent' (1707153127); index ID: 1, database ID: 26
TABLE level scan performed.
- Pages Scanned................................: 3969
- Extents Scanned..............................: 498
- Extent Switches..............................: 497
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.80% [497:498]
- Logical Scan Fragmentation ..................: 12.50%
- Extent Scan Fragmentation ...................: 2.41%
- Avg. Bytes Free per Page.....................: 33.5
- Avg. Page Density (full).....................: 99.59%

--run query
Declare @Start DateTime
Set @Start = GetDate()

Select tblParent.Id
From   tblParent
       Inner Join tblInt
         On tblParent.PrimaryInt = tblInt.PrimaryInt

Select .001 * DateDiff(Millisecond, @Start, GetDate())
2e246cw.jpg



Code:
--drop index
drop index tblParent.pk_Whatever

--create it on the varchar column
create clustered index pk_Whatever on tblParent(primaryVar)
dbcc showcontig ('tblParent')

--check pages and extents
DBCC SHOWCONTIG scanning 'tblParent' table...
Table: 'tblParent' (1707153127); index ID: 1, database ID: 26
TABLE level scan performed.
- Pages Scanned................................: 3970
- Extents Scanned..............................: 504
- Extent Switches..............................: 503
- Avg. Pages per Extent........................: 7.9
- Scan Density [Best Count:Actual Count].......: 98.61% [497:504]
- Logical Scan Fragmentation ..................: 0.03%
- Extent Scan Fragmentation ...................: 0.20%
- Avg. Bytes Free per Page.....................: 35.5
- Avg. Page Density (full).....................: 99.56%


--run query
Declare @Start DateTime
Set @Start = GetDate()

Select tblParent.Id
From   tblParent
       Inner Join tblVar 
         On tblParent.PrimaryVar = tblVar.PrimaryVar

Select .001 * DateDiff(Millisecond, @Start, GetDate())

2n1tds4.jpg









Denis The SQL Menace
--------------------
SQL Server Code,Tips and Tricks, Performance Tuning
Google Interview Questions
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top