Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with query on huge table

Status
Not open for further replies.

commatom

Programmer
Aug 30, 2007
54
0
0
US
I have a table with 1.2 billion rows.

This table tracks a users transaction and which device it used to complete the transaction.

Not all transactions give us the device id so we are trying to fill in the gaps with the correct device id by building a tables which has date ranges that each device was used per user.

Originally it was written like this.

Code:
SELECT userid, deviceid, min(time_stamp)
  FROM t_transaction;

This gave us 80% correct data because 20% of the time, users went back to the former device.

I know the following would work:

Code:
 SELECT userid, deviceid, time_stamp
 FROM (SELECT userid,
              deviceid,
              time_stamp,
              lag(deviceid) over (partition by userid order by time_stamp) p_deviceid) x
WHERE deviceid <> p_deviceid

I am concerned that this will be to large to run in production and our dev environment is not in a position to test. Are there any suggestions?

The table is partitioned by month.

Thanks in advance!
 
tom,

I suggest that you approach this as two distinct issues, firstly getting the correct queries to do the job, and testing that they're doing precisely what they're supposed to.

So, to begin, can you please post create table statements for all the tables involved and sample data, showing what you're trying to do. Obviously with 1.2 billion rows in the main table, I would expect a few well chosen rows, which demonstrate the problem.

Since your post states that you're trying to fill in missing data, it suggests that an update statement should be involved in this somewhere along the line - is this correct?

Second, you have a problem with updating 1.2 billion rows. Needless to say, this will blow your undo tablespace directly into the stratosphere, and instead of being a gigabyte gobbler, it might become a terabyte terminator. Usually I suggest CTAS (Create Table As Select) for these things, but without more detail (for example are there any indexes affected, any unusual data types (e.g. long and raw), and dependent triggers, stored procedures etc. I can't realistically go any further.

Regards

T
 
I was also going to suggest building a separate table where the null device IDs could be copied off. This would at least give you a smaller table to analyze, with the offending rows isolated. Of course, this assumes you have a relatively low percentage of rows that are in the problem set. I do hope you have a primary key on your big table?!
 
I still think your approach may be flawed. What if there are two successive rows with a NULL device ? For the second one, lag will pick up the previous row and the device for that will be NULL.

A better approach would be something like this:

Code:
select userid, 
       deviceid, 
       case when deviceid is null 
            then last_value(deviceid ignore nulls) over (partition by userid order by time_stamp) 
            else deviceid end as pdeviceid, time_stamp
from t_transaction

This would fill in the blanks from the last entry which had a non-NULL device.


 
Dagon-

You are correct, what I was planning to do with my SQL was to build a table with date ranges. I would have had a "where device id is not null" clause in there.

Im currently trying to figure out how big of a problem this is. Thanks for the nudge towards the right direction.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top