Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Consolidate multiple rows into single row 1

Status
Not open for further replies.

peac3

Technical User
Jan 17, 2009
226
AU
Hi SQL guru,

We are trying to locate the client information that has been changed.
However, in our historical dataset, the change definition is deleted record and added record


Hence the original dataset is as below:
Code:
RowNumber ParentID ChildID   IdentityCode ActionType      Date
1         ABC        123     T            Deleted    2024-05-01 
2         ABC        123     P            Added      2024-05-03 
3         BCD        234     T            Deleted    2024-05-05 
4         CDE        345     P            Deleted    2024-05-06
5         CDE        345     T            Added      2024-05-07 
6         DEF        456     P            Added      2024-05-08
The good news is it has been sorted by row number and we would like to retrieve the Identity code whether it has been changed, added or deleted with information on parentID, childID and date.

The logic we are trying to achieve in pseudo-code is

CASE
WHEN (ParentID and ChildID are the same as previous row number) and (current row number ActionType is Added and previous row number ActionType is Deleted) then is CHANGE
ELSE original ActionType -- whether it is added or deleted
END AS New_ActionType

CASE
WHEN New_ActionType is CHANGE THEN date is the (Added date)
ELSE original Date
END AS New_ActionDate

Also, we need to know from what "From identity code" to "to identity code"

If ActionType is added means "From identity code" is null and "to identity code" is the current identity code
If ActionType is deleted means "From identity code" is the current identity code or blank and "to identity code" is NULL

Hence the output result is as below:
Code:
RowNumber ParentID ChildID   NewActionType      Date           FromIdentityCode ToIdentityCode
1         ABC        123     Change            2024-05-03              T                P
2         BCD        234     Deleted           2024-05-05              T                NULL
3         CDE        345     Change            2024-05-07              P                T
4         DEF        456     Added             2024-05-08              NULL             P

I am trying to play around with SQL below
Code:
SELECT * FROM
TABLE T1
LEFT OUTER JOIN TABLE T1 ON T1.RowNumber = T2.RowNumber +1
But somehow it is getting too fiddly, hopefully, you guys guru could help me with this.

Thanks in advance,
 
Code:
DECLARE @records TABLE( 
  RowNumber INT,
  ParentId Char(3),
  ChildID Int,
  IdentityCode Char(1),
  ActionType Char(7),
  [Date] Date);

Insert Into @records Values
(1,'ABC',123,'T','Deleted','20240501'),
(2,'ABC',123,'P','Added'  ,'20240503'), 
(3,'BCD',234,'T','Deleted','20240505'),
(4,'CDE',345,'P','Deleted','20240506'),
(5,'CDE',345,'T','Added'  ,'20240507'), 
(6,'DEF',456,'P','Added'  ,'20240508');

With
    Pairs as (Select RowNumber,  ParentID, ChildID, ActionType, Lead(ActionType) Over (Partition By ParentId, ChildId Order By RowNumber) as NextActionType,
[Date] as FirstDate, Lead([Date]) Over (Partition By ParentId, ChildId Order By RowNumber) as [Date],
IdentityCode,  Lead(IdentityCode) Over (Partition By ParentId, ChildId Order By RowNumber) as NextIdentityCode From @records),
    Relabeled as (Select RowNumber, ParentID, ChildID,
Case When ActionType='Deleted' And NextActionType='Added' Then 'Changed' Else ActionType End as NewActionType, 
ISNULL([Date],FirstDate) as [Date], 
Case When ActionType='Added' Then NULL Else IdentityCode End as FromIdentityCode, 
Case When ActionType='Added' Then IdentityCode Else NextIdentityCode End as ToIdentityCode From Pairs),
    Unmerged as (Select Lag(NewActionType) Over (Partition By ParentId, ChildId Order By RowNumber) as PreviousActionType, * From Relabeled)

-- the final consolidating query:
Select Row_Number() Over (Order By RowNumber) as RowNumber, 
ParentId, ChildId,
NewActionType, [Date], 
FromIdentityCode, ToIdentityCode 
From Unmerged Where Not ISNULL(PreviousActionType,'None')='Changed';

I hope it's clear the @records table variable is just scripted to have your data to experiment with without generating an actual database table. You can skip that part of the script and take the part following the first "WITH", replace @records with your table name to get there.

Some steps I make here might be consolidated into fewer steps, but as I already mentioned long ago in a previous question you had with the type of data aggregations you want to do depending on previous/next record, it's best to learn windows functions like LAG() and LEAD(), the OVER clause and then you get to such ideas.

Let me explain the steps at least a bit:
1. The 'Pairs' result will pair up every two succeeding records for each partition of (ParentId, ChildID) mainly to pair data
2. The 'Relabeled' result relabels 'Deleted'+'Added' to 'Changed' - I allowed that bit of consistency with the past tense actions here, instead of 'Change'.
3. The 'Unmerged' is just there to enable the final query to skip the records that follow the 'Changed' records.

The partial queries can't be easily combined because of where LEAD() and LAG() can't be used within queries. For example, to write out the major merging condition with ActionType='Deleted' And NextActionType='Added' you first have to have those columns, so the Pairs are necessary to express that condition, you can't substitute ActionType and NextActionType with the expressions for them used in the 'Pairs' part as LEAD can't be used within a WHERE or CASE clause. Even if that would be enabled in future T-SQL versions, it wouldn't mkae the query easier to read, even though it would be shorter with less steps.

Alwaqys keep in mind that in such CTEs (That's the techcnical term for WITH... queries) both the single queries and the overall script is optimized, so this whole script is handled with one overall SQL plan.

Anyway, another reason I'd not try to shorten this into less step is that the way this is staging the steps makes the overall script easier to extend/maintain for future requirements.

Last not least, a better idea to record the history of data is actually just using features of MSSQL to track data changes as described here:
And another feature here:

IIRC CDC will need an Enterprise edition, but temporal tables are even available in the free editions. What you need to do, though, is change from delete+insert to updates. I see a show stopper for you, though: As your records are deleted on one date and added back the next day, that would mean two histories of two records for all these features of CDC, data tracking and temporal tables, so your overall data recording strategy is questionable. I assume you will never know that you require the same key next day again, but that shouldn't be a usual way to work with data.

I (we) lack the bigger picture here, of course, to what this data means and comes from, it lacks an overall primary key, too, that would point out even more, that you shouldn't recycle keys as the pairs of parnt/child ids are for what it's worth in this context.

To share an opinion, you may take lightheartedly (i.e. don't take this too close to your heart): I often find developers taking pride in being unusual, which in their opinion constitutes progress or even genius, but my experience is any unusual non-standard way of structuring data and working with it only constitutes the need for processing steps that are unnecessary and just a waste of time, sometimes even making features existing for the same purpose which you wouldn't need to program at all, impossible to use.

To give a concrete, better idea of avoiding Delete/Add on two separate dates: Introduce a column "Recordstatus", for example with a "final" state that resembles being deleted. That easily enables reinstating the record as active and to still keep the database size lower you could delete records which you can consider not being reused after a week/month/year - whatever is a good measure for that. Another idea is to have an optional information for a record to be the continuation of a previously deleted record, which, if you use temporal tables or CDC could be referred to in the recorded history.

Chriss
 
To add to my last point - onoe thing I'd recommend to do instead of what you want (or your customer): Don'T renumber the rows just to have a gap free sequence. In your data the rownumber is the only thing that constituteds a single value primary key . even though it likely isn't in the bigger picture.

I renumber the rows by using Row_Number() Over (Order By RowNumber) as RowNumber to get your wanted result:

[pre]RowNumber ParentID ChildID NewActionType Date FromIdentityCode ToIdentityCode
1 ABC 123 Changed 2024-05-03 T P
2 BCD 234 Deleted 2024-05-05 T NULL
3 CDE 345 Changed 2024-05-07 P T
4 DEF 456 Added 2024-05-08 NULL P
[/pre]

Instead you can just keep the rownumber as is and get gaps in the final result, when using this final query instead:
Code:
Select RowNumber, 
ParentId, ChildId,
NewActionType, [Date], 
FromIdentityCode, ToIdentityCode 
From Unmerged Where Not ISNULL(PreviousActionType,'None')='Changed';

The final result then is

[pre]RowNumber ParentID ChildID NewActionType Date FromIdentityCode ToIdentityCode
1 ABC 123 Changed 2024-05-03 T P
3 BCD 234 Deleted 2024-05-05 T NULL
4 CDE 345 Changed 2024-05-07 P T
6 DEF 456 Added 2024-05-08 NULL P[/pre]

The gap numbers 2 and 5 indicate the row numbers of the records that were merged into the "Changed" records 1 and 4. Keeping that gap would also enable you to repeatedly apply a consolidation step, that does not only delete records following a "Changed" record, but only does so, when the record following the "Changed" action is actually the one in direct sequence by rownumber. That means the WHERE condition I used is only sufficient in the stage where I know I consolidated the step following the change and don't need to carefully look whether that has the same ParentId/ChildId combination and the same "ToIdentityCode" as was merged into the "Changed" action. I know that without verification as I used partiioning by ParentId/ChildId combinations, too.

In future uses of data consolidation your new base data might be that of previous consolidations and then such shortcuts are not valid, having a gap in the numbering would still enable to keep the conditions shorter and still concise and verifiably correct, because the consolidated records can be mapped back to the unconsolidated data by their rownumber, at least. And the other major use, ordering by row number, is still possible and correct, too.

Chriss
 
To learn from the soluion you should take a look at the intermediate results (which won't necessarily actually be created by query optimization, but which you can always "materialize" by executing only up to some step, for example, to list the "Pairs", do:
Code:
With
    Pairs as (Select RowNumber,  ParentID, ChildID, ActionType, Lead(ActionType) Over (Partition By ParentId, ChildId Order By RowNumber) as NextActionType,
[Date] as FirstDate, Lead([Date]) Over (Partition By ParentId, ChildId Order By RowNumber) as [Date],
IdentityCode,  Lead(IdentityCode) Over (Partition By ParentId, ChildId Order By RowNumber) as NextIdentityCode From @records)
              
Select * From Pairs;

result:
[pre]RowNumber ParentID ChildID ActionType NextActionType FirstDate Date IdentityCode NextIdentityCode
--------- -------- ------- ---------- -------------- ---------- ---------- ------------ ----------------
1 ABC 123 Deleted Added 2024-05-01 2024-05-03 T P
2 ABC 123 Added NULL 2024-05-03 NULL P NULL
3 BCD 234 Deleted NULL 2024-05-05 NULL T NULL
4 CDE 345 Deleted Added 2024-05-06 2024-05-07 P T
5 CDE 345 Added NULL 2024-05-07 NULL T NULL
6 DEF 456 Added NULL 2024-05-08 NULL P NULL[/pre]

The result will show the importance of using the PARTITION clause within the OVER clause, because you only want to pair up records that belong to the same partition of data with the same ParentID/ChildID combination.

At first glimpse it seems wrong you get two pairs from the ABC/123 records, but unlike your idea of JOIN TABLE T1 ON T1.RowNumber = T2.RowNumber +1 the partitioning adds separation records with NULL values between two partitions, which are accessed in case the LAG() function would ask for the value of the previous record in the first record of a partition which should not give it the value of the last record of the previous partition, and vice versa when a LEAD() function would ask for the value of the next record at the last record of a partition and should not take the value of the first record of the next partition but instead get the NULL value that states "there is no such value" because there is no previous or next record within the partition.

You could get about the same by modifying your join condition to T1.RowNumber = T2.RowNumber +1 AND T1.ParentID= T2.ParentID and T1.ChildID = T2.ChildId, but then would lose some data for sake of finding no partner record at the end of partitions.

So, in short you can see from that, the way window functions and specifically partitions work is already better thought of and implemented as you can do in traditional joins without using windows functions. So it's not basic, but still mandatory knowledge, especially as you do so much in terms of dependencies on previous or next records (or both).

Chriss
 
As you look for verification in your new thread, here's how that result can be used to verify it agianast the original data, ie show that each "Changed" row has replaced a Delete+Added combination:

Code:
DECLARE @records TABLE( 
  RowNumber INT,
  ParentId Char(3),
  ChildID Int,
  IdentityCode Char(1),
  ActionType Char(7),
  [Date] Date);

Insert Into @records Values
(1,'ABC',123,'T','Deleted','20240501'),
(2,'ABC',123,'P','Added'  ,'20240503'), 
(3,'BCD',234,'T','Deleted','20240505'),
(4,'CDE',345,'P','Deleted','20240506'),
(5,'CDE',345,'T','Added'  ,'20240507'), 
(6,'DEF',456,'P','Added'  ,'20240508');

With
    Pairs as (Select RowNumber,  ParentID, ChildID, ActionType, Lead(ActionType) Over (Partition By ParentId, ChildId Order By RowNumber) as NextActionType,
[Date] as FirstDate, Lead([Date]) Over (Partition By ParentId, ChildId Order By RowNumber) as [Date],
IdentityCode,  Lead(IdentityCode) Over (Partition By ParentId, ChildId Order By RowNumber) as NextIdentityCode From @records),
    Relabeled as (Select RowNumber, ParentID, ChildID,
Case When ActionType='Deleted' And NextActionType='Added' Then 'Changed' Else ActionType End as NewActionType, 
ISNULL([Date],FirstDate) as [Date], 
Case When ActionType='Added' Then NULL Else IdentityCode End as FromIdentityCode, 
Case When ActionType='Added' Then IdentityCode Else NextIdentityCode End as ToIdentityCode From Pairs),
    Unmerged as (Select Lag(NewActionType) Over (Partition By ParentId, ChildId Order By RowNumber) as PreviousActionType, * From Relabeled),
   Result as (Select RowNumber, 
ParentId, ChildId,
NewActionType, [Date], 
FromIdentityCode, ToIdentityCode 
From Unmerged Where Not ISNULL(PreviousActionType,'None')='Changed'),
Reexpanded as
(Select * From Result
Union All
Select Rownumber+1 as Rownumber, ParentId, ChildId,
NewActionType, [Date], FromIdentityCode, ToIdentityCode 
From Result Where NewActionType='Changed')

--Verification
Select * From @records as origin 
Left Join Reexpanded
On Reexpanded.Rownumber = origin.Rownumber

The last result will have a joined row from the reexpanded result that matches an original row, which proves the merged "Changed" records correspond to two previous Deleted+Added rows. You could also change the Verification query to list records of the origin @records table not finding a match in Reexpanded. The confirmation then is, that no such record misssing a reexpanded record exists.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top