performance problem when counting records when count is than 10 millions

Srinuvelati · Dec 11, 2019

I have a performance problem

I have two tables one is master other is transaction data. And transaction data record count is in more than 10 millions

I need to get the the transaction data records count where the key field is in master. Please provide the best solution I have tried with joins ... in ... not in conditions. if you have any way out let me know

Mike Lewis · Dec 12, 2019

Srinuvelati, it would also be helpful if you could address the points I raised in my above post timestamped 11 Dec 19 15:03.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Dec 12, 2019

I am familiar with Lakh, it's simply 1,00,000 and instead of having a comma every 3 magnitudes you're putting this there. You already said it's 10 million records.

The quantity of data makes things slower. The time necessary rises in different ways, sometimes exponentially, sometimes quadratic, sometimes only logarithmic. The best you can expect is linear performance in case you want to count everything. This count isn't maintained in the DBF header like RECCOUNT().

Anyway, the number of records is uninteresting for the fact which indexes are used for query optimization, that's a qualitative issue and you need index expressions on the expressions used to join or filter.

I think you won't get much faster than what you already have anyway.
I tried 10 million records with separate a,b and with the normal case of a foreign key that's identical to the main key.

It takes the same time for me, about 10 seconds.
You'll never get this down to <1 second. This always is a full scan through the data, no matter if DBF or CDX.

Edit: Dropping all indexes I get a duration of 15 seconds, so the indexing helps to speed up the join, but since a full count isn't a selective query this doesn't profit as much.

If you'd delete detail data with main data you could make use of VFP9s feature to better optimize queries regarding DELETED() and would not need to join the main data for the count. It obviously depends on your needs of the detail data, as you tell it you keep orphaned detail data. I don't know your reasons for that, you didn't react much to questions also of the others.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Mike Lewis · Dec 12, 2019

Olaf said:
I am familiar with Lakh, it's simply 1,00,000 and instead of having a comma every 3 magnitudes you're putting this there

I am too. But I thought it was 10^5 rather than 10^6 (I might be wrong). But it's true that the placement of commas is non-standard.

I first heard the term "Lakh" in the 1960s, in Lionel Davidson't novel The Rose of Tibet, which involved a treasure worth 500 lakh rupees or something similar.

&& End of useless information

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Srinuvelati · Dec 12, 2019

I also feel the same performance for this cannot be increased... is there any other way of finding how many orphans.. bu using relation ..or something else etc... you are telling 10 secs for 10 million records.. for me it is taking 10secs for 1 million records

Mike Lewis · Dec 12, 2019

it is taking 10secs for 1 million records

But what we are saying is that the increase in time is not necessarily linear. Just because it takes 10 seconds for one million, that doesn't mean it will take 100 seconds for 10 million.

But even if does take 100 seconds, that brings us back to my earlier question (which you still haven't answered): Is this a one-off requirement, or something that you will be doing regularly? If the former ... well, you could have finished the job by now. If the latter, what is the cost of keeping the system busy for a couple of minutes?

And another question: Why are these orphans getting into the transaction file in the first place? This is presumably something you want to avoid (or is it?). If that's right, perhaps you should be devoting your attention to how to prevent them.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Olaf Doschke · Dec 12, 2019

Well, the time for me maybe better because of better hardware, I also don't have your table, only sample data with only the fields mentioned and not more, my CDX is not having further tags, etc.

That all plays into the overall time necessary. So any values or only valid for a certain system, not generally. So our systems have a factor of 10 for the performance, of HDD and/or RAM. it seems. Or despite saying so you don't have the indexes necessary to speed things up a bit more, though you think. Anyway, you don't get a factor 10 out of an index, per my experiments.

If you deleted orphaned detail transactions you don't need to join, just count.

And Mike, yes it is 10^5, therefore the strange comma positions. But in the initial post, the talk was about 10 million anyway.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Olaf Doschke · Dec 12, 2019

There are still possibilities, because

1. you could try to reindex the tables, the recreation of indexes may remove bloat and make index tags cleaner and faster

2. you may set up foreground/background memory - SYS(3050).
The typical Rushmore bitmap will have 10 million bits as your table is that long and every record is represented as 1 bit, hits with 1, misses with 0, but still 1 bit per record. That's 10 Mbit = 1.25 MB. Not very much, but that's just for this query. The CDX I created with just the one necessary index tag has ~50MB and you don't need to keep that in memory, but since you don't have any filter conditions but the join condition you'll go through all of it and likely it helps if you reserve more memory, why not 100 MB or 256 MB?

3. You could do it like VFP and better and maintain a counter. When you use RECCOUNT() that actually just reads a count from the DBF header, and relies on this being the correct record count. In fact, you get errors, if that doesn't match with the file size as of HEADER()+RECCOUNT()*RECSiZE() (+2 IIRC, but you get the idea).

RECCOUNT() counts deleted rows, too, that's its weakness. But you can do better and as I already suggested use triggers to maintain the actual count of non-orphaned records, even if you don't cascade deletes. When a record of the main table is deleted you can subtract as many records of the other table as are referring to this record.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Srinuvelati · Dec 13, 2019

Mike

I am transferring the foxpro data to Oracle... I am facing this problem in initial load... And I believe this problem will appear in incremental load too. I mean its not one time job.. I dont have any idea from where this orphans are getting added. as this application is 25 year old. And I want to avoid these orphans in the transfer every time whenever I transfer. at the same time I know this problem will be resolved by deleting these records. however I want to fix this type of issues for the future transfers.

Olaf.. I will check the sql with increasing memory

Olaf Doschke · Dec 13, 2019

I don't know about Oracle, but with SQL Server you'd go through CSV with a format file using BCP tool.

In Incremental load you'll also not have that count problem, do you? The biggest data migration I did going directly from DBF into MSSQL was going in 10,000 record chunks using SQLPrepared statements and in part because that doesn't need that high RAM and in another part, because SQL Server only handles a certain amount of locks in parallel. You find a natural chunk size that works faster than going for all data at once or for single rows at a time.

Finding chunks does not necessary mean you always have exactly N records in them, I did query certain ID ranges, so typically the N just was the upper bound.

And last not least: Why not delete the rows you want to purge in the migration process? If they are really orphaned and also not of use in the current frontend, then get rid of them, also PACK.

Even if you get new ones due to having neither defined cascading deletions nor restricting master data deletions in referential integrity code, getting rid of data you don't need even as a side process not integrated into the main application can help.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Olaf Doschke · Dec 14, 2019

And just for the sake of fun, when you do delete orphaned rows and INDEX ON NOT DELETED() TAG deleted BINARY.
The index size now is <1.5MB.

And then SELECT COUNT(*) FROM t1, as you don't need to join t2, and VFP9 Rushmore takes binary indexes on DELETED() besides also supporting filtered FOR DELETED() and more (see the VFP9 help "Indexes Based on Deleted Records") it breaks down to this on my computer:

Time to count ~10 million records with 5% deleted rows

Without index: 2.4 Seconds
With index: 0.012 seconds

So that's the most promising way of fast counting in the single table and even if you want to refrain modifying your production data, for the migration you could and perhaps even should use a data copy, ie restored a backup to work without being influenced by the current application usage of the data. On the copy, you can even work with exclusive on and have your migration EXE on the file server, etc.

And then you can not only delete, you can not only index on DELETED(), you can also PACK, that makes it even better because RECCOUNT() then is 1:1 the count.

Since you brought up that migration topic I'm not sure anymore, if it's really just about the count but any handling of data without a high selectivity, ie boiling a result down to just a few records. Most important that's not to be confused with just having one aggregate value. That's a small result set, but to get it you need to go through all data. High selectivity means only addressing a small number of rows.

Well, from my migration experiences I can only say partitioning data helps a lot but I don't see why that would need such a join optimization. Aside from some small tables when you go for the massive tables and migrate chunks of data the most natural way in nested or hierarchical structured relationships is going from one head record to all detail data and not vice versa.

Bye, Olaf.

Olaf Doschke Software Engineering

https://www.doschke.name

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

performance problem when counting records when count is than 10 millions

Srinuvelati

Technical User

Mike Lewis

Programmer

Olaf Doschke

Programmer

Mike Lewis

Programmer

Srinuvelati

Technical User

Mike Lewis

Programmer

Olaf Doschke

Programmer

Olaf Doschke

Programmer

Srinuvelati

Technical User

Olaf Doschke

Programmer

Olaf Doschke

Programmer

Similar threads

Part and Inventory Search

Sponsor