Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations John Tel on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL 2000 - Design question (hierarchy table)

Status
Not open for further replies.

cfStarlight

Technical User
Mar 27, 2007
172
US
I need to build a type of 'virtual' folder structure in a database table. Like windows explorer, but table driven. The physical files will be stored on disk, but the tree (or hierarchy) will be stored in a db table. It will contain a a fair amount of data.

I've been reading about several methods for representing this type of tree:

1 nested sets model
2 adjacency model
3 materialized path

Each seems to have pros and cons. I don't have much experience with them, but my concerns are:

1 Nested sets::: Concerned about blocking since node changes require updating the entire table

2 Adjacency model::: MS SQL 2000 does not support CTE's. So retrieving an entire branch involves looping which may be slow or resource intensive.

3 Materialized path::: Looks like a possibility. But would require a large varchar column to store the path. Afaik, MS SQL 2000 does not support indexes on large varchar columns so I'm concerned about query performance.

Any suggestions, advice, real life experiences (or even horror stories ;) that might help me in choosing one of these models?
 
I've seen a combination of adjacency list + materialized path. It allows some of the benefits of both which can help make up for some of the deficits of each. For example, you still have to loop to directly examine all the properties of a node's parents or children (such as counting or ensuring they're all active) but selecting an entire branch becomes a simple LIKE condition on a single column. The model also had a flag for when a node was a leaf node, further facilitating selection, and not greatly increasing the number of nodes needing updating if deleting or adding a leaf node.

In regards to the large varchar column to store the path, my idea is to use a binary representation of an integer key (either in varchar or varbinary) instead of text (which would require skipping the separator charactor and padding all keys to be four bytes). Some experimentation of this would be in order.

MS SQL Server 2000 supports key lengths of up to 900 bytes. So my model would mean an int allows 225 levels deep that you could support in your materialized path. A bigint would allow 112 levels deep. Selecting from the thing would be interesting but perhaps a numbers table to join against could help, especially one with only the values 4n+1 in it where n goes from 0 to 224.

If you HAD to use a text representation of the number plus separators (and I recommend separators before and after the string for selection speed) you'd be eventually down to only 81 levels deep allowable in 900 bytes. But that might be enough. If you used a hex representation of the key with separators you could get 89 levels deep. A base-36 representation would allow 149 levels deep.


... just some ideas. I honestly don't have a ton of real-world experience with each different method, but I do know that these exist out there in production databases. I believe the nested sets model is the hardest to understand and maintain.

P.S. There seems to be a fourth option detailed here. But I don't know if it's feasible.

[COLOR=#aa88aa black]Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt.[/color]
 
Thanks for the response. Your idea of 'a binary representation of an integer key' sounds interesting. Can you give me an small example to make sure I'm understanding it correctly.

Yes, nested sets may lack the intuitive quality of materialized path but after reading a few articles it seems to make sense. Then maintenance side could be encapsulated in stored procedures. So I'm not too concerned with that part. I was more concerned with overall performance.

I'm still trying to wrap my brain around tropashko's article, interesting stuff, but not smooth reading ;-)

If anyone has any experiences/thoughts on the different models they'd like to share, all thoughts are welcome.
 
Tropashko's article seems to fail though because of the small size of ints. If you click into the responses you'll find a very technical but interesting paper from Oracle on the subject. I'm trying to imagine another encoding for the same sort of thing which would take less space, but so far failing. :)

Also, I've been thinking about this, and an index on a materialized path seems pretty useless. It couldn't be used except if you both excluded the node itself from its path description (including it has several big advantages) and you were only looking for the root node. Unless you reversed the order of the path, but then you only can use the index to get the direct parent, which in a combined method is already present in another column. Plus, reversing the order of the path makes the path unintuitive and harder to use because left-counted position doesn't represent tree depth any more.



[COLOR=#aa88aa black]Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt.[/color]
 
I missed the Oracle paper. I'll check it out.

Hm. I guess when I thought about indexing I was thinking more of nested sets or adjacency model where it makes sense. Why do you say an index on a materialized path is useless? Because you'd need to use LIKE, which probably wouldn't utilize an index anyway?
 
like doesn't use indexes when it begins with a wildcard character. In order for a like clause to use an index (to be sargable) the comparison has to start with literal characters.

[COLOR=#aa88aa black]Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt.[/color]
 
Here's some fun for everyone:
Code:
-- create database Tree
-- use Tree
go
set nocount on
go
/*
drop table Tree
drop table Numbers4
drop view TextTree
drop view NodeAncestor
drop view NodeAncestorBinary
drop view NodeDescendant
drop view NodeDescendantBinary
*/
go
create table Tree (
   NodeID int not null identity (1,1) constraint PK_Tree primary key clustered,
   NodeBinary as Convert(varbinary(4), NodeID), -- not sure if this is helpful. Probably not.
   ParentID int null -- Nodes with null as Parent are root Nodes. Other schemes are possible.
      constraint FK_Tree_ParentID_NodeID foreign key (ParentID) references Tree(NodeID),
   NodeName varchar(256) not null,
   Path varbinary(900) not null constraint DF_Tree default (0x),
   Depth int not null constraint DF_Tree_Depth default (1),
   constraint CK_Tree_ParentID_Ne_NodeID check (IsNull(ParentID, 0) <> NodeID)
)
-- Path column will include self first for ease of including Parent Node in whole branch selection
-- and for ease of ordering nodes by hierarchy without extra calculations
-- tradeoff: reduces maximum Depth by 1, lengthens all Paths by 4 bytes,
-- makes what sargable queries there are against this column pretty useless (and the index useless).
go
create unique nonclustered index IX_Tree_NodeName on Tree (NodeName)
create nonclustered index IX_Tree_ParentID on Tree (ParentID)
-- not sure the following is helpful at ALL. It may even be harmful.
create unique nonclustered index IX_Tree_Path on Tree (Path)
go
create trigger TR_Tree_I on Tree for insert
as
set nocount on
update T
set
   T.Path = IsNull(P.Path, 0x) + Convert(varbinary(4), T.NodeID),
   T.Depth = IsNull(P.Depth, 0) + 1 -- saves math later to be 1 instead of 0
from
   Inserted I
   inner join Tree T on I.NodeID = T.NodeID
   left join Tree P on T.ParentID = P.NodeID
go
create trigger TR_Tree_U on Tree for update
-- for simplicity one could require that updates to ParentIDs only occur one at a time
-- but an interesting problem is to handle updates of the ParentIds of almost every node in the tree simultaneously:
as
set nocount on
if not update (parentid) return
-- collect all children of affected rows as its own sub-tree
create table #Affected (
   NodeID int,
   ParentID int,
   NewPath varbinary(900),
   Depth int,
   AncestorID int
)

declare @rowcount int

insert #Affected
select
   I.NodeID,
   I.ParentID,
   NewPath = I.NodeID,
   Depth = 1,
   AncestorID = I.ParentID -- the greatest ancestor known for all affected nodes
from
   Inserted I
   -- Don't bother with rows that haven't had ParentID updated
   -- Depends on particular patterns of table updates, but I'm gambling here that
   -- these not-ParentID updated rows won't be children of ParentID-updated rows
   -- Since otherwise they'll just be added in the next step with perhaps more overhead
   inner join Deleted D ON I.NodeID = D.NodeID and IsNull(I.ParentID, 0) <> IsNull(D.ParentID, 0)

set @rowcount = @@rowcount

if @rowcount = 0 return

while @rowcount > 0 begin
   insert #Affected
   select
      C.NodeID,
      C.ParentID,
      NewPath = P.NewPath + Convert(varbinary(4), C.NodeID),
      Depth = P.Depth + 1,
      AncestorID = P.AncestorID -- get as far as we can now
   from
      Tree C
      inner join #Affected P ON C.ParentID = P.NodeID
      left join #Affected X ON C.NodeID = X.NodeID
   where
      X.NodeID IS NULL -- don't insert nodes that already exist
   set @rowcount = @@rowcount
end

-- calculate paths and depths for just the #Affected sub-tree
set @rowcount = 1

while @rowcount > 0 begin
   update C
   set
      C.NewPath = P.NewPath + Convert(varbinary(4), C.NodeID),
      C.Depth = C.Depth + 1,
      C.AncestorID = P.AncestorID
   from
      #Affected C
      inner join #Affected P on C.ParentID = P.NodeID
   where
      C.NodeID IS NOT NULL -- don't update new root nodes
      AND C.NewPath <> P.NewPath + Convert(varbinary(4), C.NodeID)
      -- ( ... OR IsNull(C.AncestorID, 0) <> IsNull(P.AncestorID, 0)) -- can't use simpler clause because of circular reference issues

   set @rowcount = @@rowcount

   if not exists(select * from #Affected where Depth = 1) begin
      -- no depth 1 nodes in affected branch = circular reference
      raiserror('Update failed. A node cannot be its own ancestor.', 16, 1)
      rollback tran
      return
   end
end

-- now finally correct all updated nodes' paths and depths
update T
set
   T.Path = IsNull(R.Path, 0x) + A.NewPath,
   T.Depth = IsNull(R.Depth, 0) + A.Depth
from
   Tree T
   inner join #Affected A on T.NodeID = A.NodeID
   left join Tree R on A.AncestorID = R.NodeID
go
create trigger TR_Tree_D on Tree instead of delete
as
set nocount on
-- collect all children of affected rows as its own sub-tree
create table #Deleted (NodeId int)
insert #Deleted select NodeID from Deleted

while @@rowcount > 0 begin
   insert #Deleted
   select
      T.NodeID
   from
      Tree T
      inner join #Deleted D ON T.ParentID = D.NodeID
      left join #Deleted X ON T.NodeID = X.NodeID
   where
      X.NodeID IS NULL -- don't insert nodes that already exist in #Deleted
end

delete T
from
   Tree T
   inner join #Deleted D on T.NodeID = D.NodeID
go
-- helper table to make Path investigation easier, although I found a less costly way for most queries
create table Numbers4 (Number int not null identity(1, 4))
insert Numbers4 default values
while scope_identity() < 897 insert Numbers4 default values
go
-- some views for convenience. It might be better to issue the queries directly in many cases, though
go
create view TextTree
as
select top 100 percent
   NodeDescr = replicate(' ', Depth * 3 - 3) + '+- ' + Convert(varchar(11), NodeID) + ' ' + NodeName,
   *
from Tree
order by Path
go
-- this query was my first stab
/*
create view NodeAncestor
as
select
   DescendantID = T.NodeId,
   AncestorID = convert(int, (substring(T.Path, N.Number, 4))),
   AncestorDepth =  N.Number / 4, -- absolute depth of ancestor
   DescendantDepth = T.Depth, -- absolute depth of descendant
   RelativeDepth = T.Depth - N.Number / 4
from dbo.Tree T
inner join dbo.Numbers4 N
   on N.Number < len(T.Path) - 4 --exclude self
-- minus 4 to exclude self, or leave out because some queries are easier that way
-- but then one has to use RelativeDepth = 0 for some other queries
*/
go
--but I found a more efficient way (about 1/2 to 1/3rd the cost because no table scan and no join to Numbers)
go
create view NodeAncestor
as
select
   DescendantID = C.NodeID,
   AncestorID = N.NodeID,
   AncestorDepth = N.Depth, -- absolute depth of ancestor
   DescendantDepth = C.Depth, -- absolute depth of descendant
   RelativeDepth = C.Depth - N.Depth
from
   Tree N
   inner join Tree C on left(C.Path, N.Depth * 4) = N.Path AND C.NodeID <> N.NodeID -- excluding self is optional
go
create view NodeAncestorBinary
as
select
   DescendantBinary = C.NodeBinary,
   AncestorBinary = N.NodeBinary,
   AncestorDepth = N.Depth,
   DescendantDepth = C.Depth,
   RelativeDepth = C.Depth - N.Depth
from
   Tree N
   inner join Tree C on left(C.Path, N.Depth * 4) = N.Path AND C.NodeID <> N.NodeID
go
create view NodeDescendant
as
select
   AncestorID = N.NodeID,
   DescendantID = C.NodeID,
   DescendantDepth = C.Depth,
   AncestorDepth = N.Depth,
   RelativeDepth = C.Depth - N.Depth
from
   Tree N
   inner join Tree C on left(C.Path, N.Depth * 4) = N.Path AND C.NodeID <> N.NodeID
go
create view NodeDescendantBinary
as
select
   AncestorBinary = N.NodeBinary,
   DescendantBinary = C.NodeBinary,
   DescendantDepth = C.Depth, -- absolute depth of descendant
   AncestorDepth = N.Depth, -- absolute depth of ancestor,
   RelativeDepth = C.Depth - N.Depth
from
   Tree N
   inner join Tree C on left(C.Path, N.Depth * 4) = N.Path AND C.NodeID <> N.NodeID
go


-------- test 1
insert Tree (ParentID, NodeName) select NULL, 'Root 1'
insert Tree (ParentID, NodeName) select 1, 'Trunk 1'
insert Tree (ParentID, NodeName) select 2, 'Branch 1'
insert Tree (ParentID, NodeName) select 2, 'Branch 2'
insert Tree (ParentID, NodeName) select 2, 'Branch 3'
insert Tree (ParentID, NodeName) select 2, 'Branch 4'
insert Tree (ParentID, NodeName) select 3, 'Twig 1'
insert Tree (ParentID, NodeName) select 3, 'Twig 2'
insert Tree (ParentID, NodeName) select 4, 'Twig 3'
insert Tree (ParentID, NodeName) select 5, 'Twig 4'
insert Tree (ParentID, NodeName) select 5, 'Twig 5'
insert Tree (ParentID, NodeName) select 5, 'Twig 6'
insert Tree (ParentID, NodeName) select 6, 'Twig 7'
insert Tree (ParentID, NodeName) select 6, 'Twig 8'
insert Tree (ParentID, NodeName) select 6, 'Twig 9'
insert Tree (ParentID, NodeName) select 6, 'Twig 10'
insert Tree (ParentID, NodeName) select 1, 'Corporeal'
insert Tree (ParentID, NodeName) select 1, 'Incorporeal'
insert Tree (ParentID, NodeName) select 17, 'Animal'
insert Tree (ParentID, NodeName) select 17, 'Mineral'
insert Tree (ParentID, NodeName) select 17, 'Vegetable'
insert Tree (ParentID, NodeName) select 19, 'Cat'
insert Tree (ParentID, NodeName) select 19, 'Hippopotamus'
insert Tree (ParentID, NodeName) select 19, 'Dinosaur'
insert Tree (ParentID, NodeName) select 13, 'Tiny Twig 1'
insert Tree (ParentID, NodeName) select 13, 'Tiny Twig 2'
insert Tree (ParentID, NodeName) select 13, 'Tiny Twig 3'
insert Tree (ParentID, NodeName) select 13, 'Tiny Twig 4'
insert Tree (ParentID, NodeName) select 27, 'Tinier Twig 1'
insert Tree (ParentID, NodeName) select 27, 'Tinier Twig 2'
insert Tree (ParentID, NodeName) select 29, 'Leaf 1'
insert Tree (ParentID, NodeName) select 29, 'Leaf 2'
insert Tree (ParentID, NodeName) select 29, 'Leaf 3'
insert Tree (ParentID, NodeName) select 30, 'Leaf 4'
insert Tree (ParentID, NodeName) select 30, 'Leaf 5'
insert Tree (ParentID, NodeName) select 18, 'Feelings'
insert Tree (ParentID, NodeName) select 36, 'Love'
insert Tree (ParentID, NodeName) select 36, 'Hate'
insert Tree (ParentID, NodeName) select 36, 'Apathy'
go
-- create statistics that execution plans in QA says are missing
-- but not sure if Path is very useful since all start with root Node values such as 0x00000001
CREATE STATISTICS [Statistic_Path] ON [dbo].[Tree] ([Path]) WITH FULLSCAN 
CREATE STATISTICS [Statistic_Number] ON [dbo].[Numbers4] ([Number]) WITH FULLSCAN 
go



--see tree
select NodeDescr from TextTree
go
-- All nodes under a specified Parent
-- Not sure if the less-than clause is better... I think it is for very large trees, but so far it has been slightly more costly
-- This first stab not as good as later method below
select C.*
from Tree C
inner join Numbers4 N
   on N.Number < len(C.Path)
   and substring(C.Path, N.Number, 4) = (select NodeBinary from Tree where NodeName = 'Branch 4')
order by C.Path
go

-- All nodes under a specified Parent, a better way (no numbers table)
-- execution plan looks more than twice as good, and actual cpu cost/reads cut by a factor of 7 - 10!
select C.*
from
   Tree C
   inner join Tree P on left(C.Path, P.Depth * 4) = P.Path
where
   P.NodeName = 'Branch 4'
order by C.Path


-- test 2 -------- fill tree as deep as possible
truncate table tree

insert tree (parentid, nodename) select null, 'Root'
while scope_identity() < 225
   insert tree (parentid, nodename) select scope_identity(), 'Child ' + convert(varchar(11), scope_identity())

-- break tree on node depth 226
insert tree (parentid, nodename) select scope_identity(), 'Child ' + convert(varchar(11), scope_identity())


---------- test 3 - fill tree very wide
truncate table tree
insert tree (parentid, nodename) select null, 'Root'
while scope_identity() < 65535
 insert tree (parentid, nodename) select 1, 'Child ' + convert(varchar(11), scope_identity()) 
go
--try a massive update
update tree set parentid = floor(sqrt(nodeid)) where nodeid > 3 and parentid = 1
-- evidently 65532 rows at once takes too long, try batches:

set rowcount 100
update tree set parentid = floor(sqrt(nodeid)) where nodeid > 3 and parentid = 1
while @@rowcount > 0
   update tree set parentid = floor(sqrt(nodeid)) where nodeid > 3 and parentid = 1
set rowcount 0
-- not too bad.
select * from texttree


--------- test 4 - insert a lot of nodes at once (wide)
-- CURRENTLY FAILING (failure = insert takes too long)
-- I haven't been able to figure this one out yet, it should be working!
declare @tree table (nodeid int identity(2,1))
insert @tree default values
while scope_identity() < 65535 insert @tree default values
-- alter table tree disable trigger TR_Tree_I 
insert tree (parentid, nodename, path)
select
   1,
   'Child ' + convert(varchar(11),   nodeid),
   0x00000000 + convert(varbinary(4), nodeid) -- on multi-row inserts this is required to avoid problems with Path column unique index
from @tree

---- test 5 - insert an entire tree at once (wide + deep)
select * into #temptree from tree
truncate table tree
insert tree (parentid, nodename, path) select parentid, nodename, path from #temptree
drop table #temptree

go
use tempdb
drop database tree
go


[COLOR=#aa88aa black]Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt.[/color]
 
like doesn't use indexes when it begins with a wildcard character. In order for a like clause to use an index (to be sargable) the comparison has to start with literal characters.

Agreed. That's usually the case. But the materialized path comparison to extract a branch would usually start with a literal anyway.

WHERE Path LIKE '1.2.%'

Thanks for such a thorough example. Time to play ;)
 
That's only if you're going for the root. So you're right, if you need a branch that doesn't start with the root, do an extra query to determine the FULL starting path, and then the index can help you.

The problem with my binary method is that I'm not sure LIKE will work on that column -- if it has to convert to varchar to do the comparison (which I believe it does) then it will get messed up on things that happen to be expressed as characters such as % _ [ ].

[COLOR=#aa88aa black]Cum catapultae proscriptae erunt tum soli proscript catapultas habebunt.[/color]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top