Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Table with 2million records 2

Status
Not open for further replies.

gkratnam

Programmer
Aug 9, 2007
37
US
I have a table with about 2million records. Here is the approximate size.
TIME.CDX – 300,000KB
TIME.DBF – 200,000KB
TIME.FPT – 210,000KB

TIME table structure (there are other fields too….)
PROJECT_ID - N(8)
TASK_ID- N(8)
USER_ID C(10)
TIME_HOURS N(8,3)

TASK table structure (there are other fields too….)
PROJECT_ID - N(8)
TASK_ID- N(8)

Relationship: PROJECT -> TASK -> TIME

The code
Code:
SELECT PTASK
SCAN WHILE project_id = proj.project_id
SELECT PTIME
Seek Str(ptask.project_id, 8) + Str(ptask.task_id,8)
Sum Nooptimize While project_id = ptask.project_id .And. task_id = ptask.task_id Hours To ltime
ENDSCAN

This code slows things and take about 5minutes in some cases (depends on the number of tasks and time entries)
I am looking for ways to handle this without archiving the table. We still need to access all the entries in real time.
I am open to any suggestions or guidance. Thanks!
 
You don't mention what indexes you have. The very first thing to do - if you haven't already done so - would be to create indexes on your various ID fields. That alone might solve the problem.

You should also consider changing the ID fields from N(8) to Integer. That will make the indexes (and the table) more compact, which in turn will speed things up.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Also, I can't see the point of your SEEK, where you convert the project and task IDs to strings. I assume you want to find a record where those two IDs match the corresponding fields in the Tasks table. If so, I think it would be beter to do [tt]LOCATE FOR project_id = ptask.project_id AND task_id = ptask.task_id[/tt]. As long as you have indexes in place on the relevant fields, the Locate should be fully optimised.

And do you have any reason to have NOOPTIMIZE in your SUM command?

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
And another point: The code, as written, doesn't appear to produce a useful result. Each time round the main loop, you are storing a value in a variable, ltime, and then overwriting it the next time round. Presumably you have some code to actually use that value in some way?

You might instead consider something like this:

Code:
SELECT PTASK
SCAN WHILE project_id = proj.project_id
SELECT SUM(hours), project_id, task_id FROM PTIME ;
  GROUP BY project_id, task_id ;
  WHERE project_id = proj.project_id ;  
  INTO CURSOR csrTimes
ENDSCAN

That will give you a cursor containing the total times for each task (assuming that's what you want).

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Hi Mike, thanks so much for all the great points!

I do have the indexes. These are a few.

Code:
STR(project_id,8)+STR(task_id,8)+user_id+
STR(project_id,8)+STR(task_id,8)+STR(bill_code,3)

I will try your recommendation on storing the results in a cursor.

I don’t know why Optimization is turned off. I vaguely remember they were facing some incorrect calculations, this was many years ago, not sure it still holds. Is it okay to turn it on in this scenario?

How about having 2 million records? Is it okay? Don’t we need to split them at some point?

Changing the ID field to Integer - The new features in the application are using the VFP tables, but these tables are still in Foxpro2.x. Integer is not an option in the old version. How can I change them to Integer without causing any issues? There are cursors in the code which reference N(8), do we need to change them to I(8)?

Please educate me on these. I truly appreciate your input and time. Thanks!
 
So, if I have understood this right, you have some numeric fields (the various IDs), but your indexes are on expressions that convert those fields to strings. I can see the thinking behind this: you want to seek records based on a combination of those fields. But there is no reason to do it that way. It would be more efficient to create individual indexes on the actual numeric fields (not on STR() conversions), and then use (for example) LOCATE to find the required records, as in my second post above. Provided the correct indexes exist, LOCATE will be fully optimised.

Interestingly, we had an almost identical question the other day (thread184-1811607). You might want to have a glance at it, especially the replies from myself and Chris.

Regarding the fact that the tables are "are still in Foxpro2.x". Do you still have 2.x programs in use that access those tables? If not, it is perfectly possible to change the data types, including to data types not supported by 2.x. Just open the tables exclusively in VFP, and do MODIFY STRUCTURE (and you can probably ignore the prompt that comes up about choosing a code page).

But be warned: Once you have done that, you won't be able to go back to using the tables in 2.x. Also, this will only work if the values held in these fields are in fact integers. If they have any digits to the right of a decimal point, those will be lost on the conversion.

And of course you should test the whole thing thoroughly before putting it into production, just in case that change has any unexpected side effects.

And yes, you should change the data types of any relevant fields in any cursors in the same way.

By the way, the data type for Integer is simply I, not I(8). The "8" is implied.

Finally, you ask if 2 million records is OK. In general, I don't see that as a problem. Provided you have the correct indexes in place, VFP should be able to handle that. I've never worked with such large tables myself, but I have worked satisfactorily with a table with over one million records. Of course, it also depends on the size of each record. In the (unlikely?) event that it exceeds 500-odd bytes, you will come up against the overall limit of 2 GB per table. But if that was the case, it would presumably already have happened.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Could you also explain why LOCATE is preferred over SEEK? Thanks!
 
Simply that when you SEEK, you have to specify a particular value or expression. So, if x is a variable containing the value you want to find, you can simply do SEEK x. But in your case, you want to find a record containing a combination of values, x and y. That's why you are concatenating them together, with STR(x) +STR(y).

With LOCATE, you specify a logical condition. The command will find records where that condition is true. So, in your case, you can do LOCATE FOR x = something AND y = something_else.

Both commands will give the same result. But the SEEK version involves converting the numeric values to strings, and creating an index based on those strings, rather than having indexes based directly on the numeric values, which would be more efficient.

I might not have explained that very well. I hope I haven't confused you even more.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
If I read your original post correctly, what you're trying to do is get the total for each task within each project. If that's it, don't use a loop. Use SQL:

Code:
SELECT Project_ID, Task_ID, SUM(Task_Hours) ;
  FROM Time ;
  GROUP BY 1, 2 ;
  INTO CURSOR csrTaskTime

You'll still want to have the right tags, one for Project_ID and one for Task_ID, but this should be really, really fast.

Tamar
 
Just to clarify the difference between SEEK and LOCATE.

The main point I was trying to make was that, with a SEEK, you need an index that matches the expression that you are seeking. In this case, the expression is a concatenation of two numeric fields, each converted to strings. That is less efficient than an index on an integer. With LOCATE, you don't have that problem. So, a LOCATE with an index on the integers is probably going to be faster than a SEEK on a concatenation. I didn't mean to suggest that LOCATE is inherently faster than SEEK.

Mike





__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Another point:

You mentioned earlier that you had set a relationship between the three tables. As far as I can see, you don't need that to do that. Removing the relationship might speed things up slighly, on the basis that it will reduce the number of record pointer movements. It's worth a try.

Also, I said earlier that the integer data type is just I, not I(8), because the "8" is implied. In fact, the "8" is irrelevant. An integer data type is always four bytes, not eight.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
I second Tamar, SQL can also be extended to only aggregate the times of active projects, WHERE proj.isactive or where DATE() between startdate and enddate, for example. You could of course also make that a FOR condition when iterating the projects.

I think SET KEY TO could do this faster, if you want to stick to xBase coding. SQL can only really shine when it does not need to process (almost) all data, a server database at least can move that work to the server and only send back the aggregated result, condensing perhaps 10,000s of pTime records for 1,000s of pTasks and 1,00s of projects to 100 records.

Instead of optimizing code you can also optimize data. Reindex to get rid of index bloat or "wearing off". Indexes are binary trees and they can get less optimal to use over time as VFP tries to save time at altering the tree to add., alter or remove nodes or detah and reattch whole branches. And why not store a result of the aggregation and only alter it adding new records from ptime of today, this week or the last hour.

Such aggregation over time relies on older data not changing, which actually speaks for archiving and read-only use of that data. I don't mean all data older than 1 day goes to the archive, but all data that didn't change for the last quarter, for example.

And for checking data integrity you can sometimes process all data to verify old data wasn't manipulated. Checksums or cryptographic signatures might be used to verify that, too. And then you might just find some well-known and wanted corrections.

An archive also offers new opportunities like parallel processing. Also in a single process you can easily search two or more tables with same conditions by union (all) or have partition meta data about which table(s) have data for a certain project id, so besides finding that in a first step you still work on a single table or set of tables with partial data. It's a very common solution in SQL Servers, but even easier to do in VFP to just add a bunch of directories and DBFs, etc. and a kind of inventory table of contents on top. Just keep it simple and choose an easy way to partition by project, by date, by employees, whatever suits best. Just don't go down the rabbit hole and try to partition by several dimensions. You get to data cubes and data marts or warehouses and then you risk not only redundant but also contradictory data partitions.

Aggregations or queries on all data become a bit more complicated, but could also be done in parallel and merged/(unioned/joined in the end. And the case you have at hand will still work project by project, so you only switch tables at the point of going to the next project.

Chriss
 
Thank you All for the valuable feed back. I will try them.

One more follow up question. How does the 2GB table limit calculated? Is it all the files(CDX,DBF,FPT) combined?

Thanks!
 
No, the 2 GB limit applies to each individual file, not to the total of all three files.

And it's not just database files. Other files, including text files, are also subject to the same limit.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Thanks so much Mike! I have learned a lot from this thread.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top