Field types question

Andrzejek · Nov 2, 2023

New to SQL Server, so I would ask the experts.
I see a lot of tables with fields defined as nvarchar(MAX) (up to 2GB of data?) or datetime2(7) (all the way to a millisecond), etc. I like to use a ‘lean-mean’ approach to defining data types in my tables, so if my data will not (should not) exceed 7 characters, I would define my field to varchar(7), if I just need a date (no time portion), I would use DATE data type. Validate my data at the entry point in my application to ensure the limits.

Is this the right approach?

Or, should I follow the ‘other’ way and use (max) fields to be able to handle whatever is thrown my way and deal with it? The way others do.

Aside from performance issues, I also face some problems with data migrations when I grab the data from other places and try to validate it before I fit it into my structure. Checking for the length: I have 7 characters limit, who knows how long the data will be from other source? Should be up to 7, but since they do allow (max) I have to check it, and that takes time, resources, and performance hits when I have to deal with a lot of fields.

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson

Chris Miller · Nov 2, 2023

In general, using the shortest data types sufficient for your needs leads to more records per data page and that's fine.
Records longer than 8060 bytes are split into in-row and row-overflow pages. It's not the end of the world and if you have long texts for you store book chapters, that's not worth splitting your book text in varchar(8000) chunks.

There are more important things to learn first, if you ask me. If there is one topic about data storage on top of any other detail that's the difference between tables with or without a clustered index.

There can be drastic effects. To give you one example, read this Brent Ozar short on MAX columns:

https://www.brentozar.com/archive/2016/10/max-data-types/

If you store text to query it, you will make use of full-text indexing and specific features and functions anyway and not such an oversimplified search. Varchar(MAX) and BLOB fields are also fine, if you never want to search in them, just fetch them along with a record - if needed - say to contain a product description or even a whole PDF product specs. They are just not good for searching in them without using full-text indexing features.

If you want to know how data is stored there is metadata about that available, one source of information is the system view sys.dm_db_partition_stats

https://learn.microsoft.com/en-us/s...tion-stats-transact-sql?view=sql-server-ver16

Chriss

gmmastros · Nov 5, 2023

I've changed my mind numerous times throughout my career. Initially, I was in the camp, "shortest fields possible", but in recent years it's morphed into "Shortest possible with wiggle room".

To be clear, I rarely use marchar(max) unless it's a Notes column, because then you never know. Max columns have performance overheads that I would rather avoid.

I've gone to the "with wiggle room" approach because your present self cannot predict how your future users will use the data. Disk space is cheap and relatively fast. Do you know what's really slow? Changing a varchar(10) column to varchar(12). There's likely to be dozens of places where the code needs to change, but the worst part is finding all those places.

For example, in the US, it takes a minimum of 10 characters to store a phone number. International phone numbers are longer. As soon as a user wants to store a longer phone number, you will likely have a considerable amount of time modifying your app to accommodate it. Additionally, some people like to store phone numbers with formatting, like (800) 111-222 or 800-111-2222, the formatting takes extra characters. And then, there are folks who like to store extensions in the phone number like 800-111-2222 x 3333. Currrently, I allow 20 characters for any phone number column, and if you asked some of my customers, they would (on rare ocasions) prefer longer fields, like "800-111-2222 x 3333 (Mother Work)". You can argue that this is just plain wrong because it violates database normalization priciples (storing many bits of information in one column). You can also argue that formatting belongs on the front end. Both of these things are true. To that, I would refer you to my previous statement... your present self cannot predict how future users will use your app.

I agree with Chris. If you are facing performance issues, it is probably not because of the data types, and probably has more to do with indexing.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom

Chris Miller · Nov 5, 2023

gmmastros said:
I agree with Chris. If you are facing performance issues, it is probably not because of the data types, and probably has more to do with indexing.

Thanks, George. I also agree with your wiggle room idea. My thought was mainly on using date vs datetime. If you know you want to store something attached to a date, you rarely will later think that you could instead use a datetime. If you think about a calendar event or appointment the natural type is datetime, of course, but there are enough other things that are only needing precisely a date without a time portion and so that's what I'd not put into the category of wiggle room.

It's no good idea to shorten everything to what you only need now. There's also no point in first defining a varchar(max) column to see what users insert and then after a year change that to the longtest current value. Murphy's law tells you the next value will be longer. Or in a milder form there will be longer values sometime in the future.

Varchar values are really stored with variable length in the data pages, so there is no wasted space if you raise the maximum possible length. That still doesn't make varchar(8000) a default you can use for anything, even though varchar(8000) has the same different quality to varchar(max) as any other varchar

with n up to 8000. It's actually smalln that can become more wasteful than fixed length char. You can, as said verify how storage space is used and experiment with generating test data to get to findings earlier than by real usage, even though you will likely have other problems in the future anyway.

The change of a table is not necessarily the worst case scenario you always wamt to avoid. It is possible to alter your database, you're not setting things in stone. I would also question code that could not handle an extended varchar when it can already works with varying length anyway. For example, I don't know any programming language that actually has a string length limitation other than memory available to a process. I know you can have aspects like input size limits with fixed size widgets and that usually is motivated by the same normative reasons from which the column sizes come from, so naturally you also need to adapt both application and database. A data driven approach to that helps to only need to adapt the database including meta data about it so the user interface can retrieve what limitations have to be forwarded to the user.

Chriss

Andrzejek · Nov 6, 2023

Just to shed more light of what I am trying to accomplish...
I am grabbing data from other source and store it so I can give some functionality to the users. Just found out that all fields in the 'source' are set to varchar(255). That answers most of my questions of how to set up my fields. I will still set some fields on my end as INT since I always expect a number. If something else will come my way, I will just reject it as an error. You have to draw some law somewhere, right?

I do agree with "Shortest possible with wiggle room". I have been using this approach for years now...

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson

Chris Miller · Nov 6, 2023

Data with all columns being varchar() sounds like its a staging table for importing any CSV. Not specifically adapted to one type of CSV data. It makes importing CSV files easy as no thoughts are needed about data types except it fails on longer values. But it doesn't help with finding out what data types to use for an actual table.

Chriss

Andrzejek · Nov 6, 2023

When I say: "all fields in the 'source' are set to varchar(255)", that is a $50M(+++) software that was implemented a few years ago. For this price you cannot expect much, right? You get what you've payed for. [upsidedown]

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson

Chris Miller · Nov 7, 2023

Well, it's not a bad idea for a staging table of CSV. A staging table never is the endpoint of data, it is staged there. What that means is explained here, fo example:

https://en.wikipedia.org/wiki/Staging_(data)

But what you need after staging data from an import process (or as the wiki article says an ETL process) is getting this data into actual tables.

So I wouldn't be surprised to find a staging table like that in an expensive software, if it allows for ETL processes, but I'd also expect it to finalize the import or load process, knowing the data types of specific CSV files or inferring data types with complex routines and then pulling that data into other tables of the database from ther. So maybe it's a wrong idea of you to read from that staging table and instead you should find where the data eventually ends up and can be retrieved from with normal data types, not all varchar(255).

Chriss

Andrzejek · Nov 7, 2023

Chriss, you are right on certain points. I do have an ETL processes, but I don't think I am grabbing data from a staging table. I checked, and even at the data entry, there are no validation of the data. For example, login name can be (should be) up to 7 characters (according to the business rules), the 'expensive' software allows 255 characters anyway, and if/when notified, some IT people will go and 'fix' the wrong data.
[banghead]

But, thank you for your input.

---- Andy

"Hmm...they have the internet on computers now"--Homer Simpson

Chris Miller · Nov 7, 2023

Well, I can only make assumptions and I'm not the one who blames an obviously seriously enterprise application that's likely not coming from a small ISV. Anyway, the other reason I have found out about such bad table designs is they are not human designed, they arise from persisting layers of frameworks that in themselves are not bad, the database layer then just becomes the persisting layer of that framework and developers using that framework may oinly have limited influence on what tables are finally used for the data storage aka persistence layer.

For example, your example about user login names to 7 characters could be done by an application layer that checks business rules. No queston, if the persistence layer then still allows 255 chjaracter login names, that's not matching the goal, but then you can still only blame them to have done a bad job if you can log in with a longer login name you "sneaked" into the database layer directly. Their protection of the data persisting layer than is clearly bad, and that should be published as a fault.

Chriss

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Field types question

Andrzejek

Programmer

Chris Miller

Programmer

gmmastros

Programmer

Chris Miller

Programmer

Andrzejek

Programmer

Chris Miller

Programmer

Andrzejek

Programmer

Chris Miller

Programmer

Andrzejek

Programmer

Chris Miller

Programmer

Similar threads

Part and Inventory Search

Sponsor