Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Determine and Create most efficient .dbf structure

Status
Not open for further replies.

FoxNovice

Programmer
Apr 10, 2008
25
US
Hi all! Thanks for all your help with resolving other issues I have encountered. I have a new one now.

We receive huge, variable length, tab delimited data files from a client several times a week. They often contain 1 to 2 million records each time, and their default layout "expands" each record to about 3200 bytes each - which is WELL over the 2GB limt for a .dbf file. There are a lot of blank spaces in most of the fields (sometimes fields come in completely blank but still have to be accounted for).

We have been chopping the file down into segments, importing each segment into the existing structure, and processing each segment separately, but that is very clumsy and time-consuming.

A recent change to the client's program requirements now mandates that all records be processed together (for proper postal discounts, reporting, etc.)

Is there a way to open the variable length, tab delimited file and - without importing it into a .dbf structure - determine the longest value for each data field, throughout the entire file??

I can do that now with a smaller .dbf by using a len(alltrim(fieldname)) command for each field as I scan through the file, but I would like to be able to perform that same function with the tab delimited file, and from the results of that process create an "efficient" .dbf structure, and then import the tab delimited file into that "efficient" structure, hopefully staying under the 2GB limit.

There are currently 92 fields in each record. However, the client periodically adds fields to the end of the records as needed.

If the counter names for the fields are (for example - FLD01CNT, FLD02CNT, FLD03CNT, etc.) I would to be able to match them up with the corresponding .dbf field names (Name1, Address1, Address2, etc.) somehow. Maybe use the existing .dbf structure to capture the actual field names, perhaps using a copy structure extended sort of process??

Each record has a CR/LF "end of record" sequence. Also known in Hex as 0D/0A or in Decimal 13/10.

Completely blank fields would need to have a field length of 1.

There are always 2 "header records" in each file which I would like to skip (they would make the actual data field lengths calculations incorrect).

Thanks for any help!

John
 
[ ]

Not entirely sure what you have.

Does the tab delimited file have uniform column sizes (data padded with spaces before the next tab) or are the tabs at the end of the data making the column size "different" in every record?

Either way, I think the simplest way to attack the problem is to write a low level FoxPro scan program to scan the entire tab-delimited file looking for the longest data in each column. A couple of hours writing the low level program would save you considerable hours over how you are now doing things.

I would define an array with enough variables (plus extra for the added fields you unexpectedly get) and set them all to zero.

Then I would low-level scan each record (skipping the first two) and if the length of the corresponding field was larger than the corresponding array variable, I would then update the array variable with the new longer field length.

By the time you have scanned all records all array variables should contain the length of the longest field in the document. Once you get to that point it is easy to define your dbf structure from that.

The only hitch I see is that if your data is padded in your tab-delimited file. If that is the case, you would need to "trim" the trailing white space before comparing the length to your array variable.

mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
the simplest way to attack the problem is to write a low level FoxPro scan program to scan the entire tab-delimited file looking for the longest data in each column.

Sorry, Mmerlinn, but that won't work. The low-level file functions are still be subject to the 2 GB limit.

The only way to avoid that limit, as far as I know, is to step outside VFP's file handling. For example, you could use the Windows scripting host. Something like this:

Code:
oWS = CREATEOBJECT("Scripting.FileSystemObject")
oFile = oWS.GetFile("c:\work\myfile.txt")
oTxt =  oFile.OpenAsTextStream()
DO WHILE NOT oTxt.AtEndOfStream()
  cLine = oTxt.ReadLine()
  * Do something here with the line of text
ENDDO

This is just an example. I haven't tried to test it. The point is that, once you have the line of text (that is, the data record) in a variable (cLine in this example), you can manipulate it pretty well any way you like.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

My Visual FoxPro site: www.ml-consult.co.uk
 
[ ]
Mike

It appears that his incoming tab-delimited files are under 2gb each 'and their default layout "expands" each record to about 3200 bytes each - which is WELL over the 2GB limt for a .dbf file.'

Note the word "expands".

mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
Thanks for the prompt feedback!

Yes, the tab delimited files ARE variable length (where the same data field could have different lengths between different records).

Also, we have not yet received a tab delimited file that approaches the 2GB mark. For example, the 1.6MM records I have right now arrived in a tab delimited file about 1GB in size.

I am currently utilizing the same approach mmerlin outlined above (scan the file, use trim commands, update field lengths, etc.) but only for .dbf files.

I am somewhat "self-taught" when it comes to using FoxPro and working with .dbf files, but I am not as familiar with the low-level commands/programming (which is why I am looking for help in this area).

I would like to utilize mmerlin's approach to scan the data between the tabs in the tab delimited file, and have those results create a .dbf structure that I can use to import the data into.

Thanks again!

John

 
The 2GB limit (whether encountered now or in the future) can be a challenge.

While it is not the only way to handle it, I have found that importing the LARGE files into a SQL Server table (which does not have the 2GB limit) and then processing as needed to create tables of manageable size is what has worked for me.

Good Luck,
JRB-Bldr
 
[ ]

FoxNovice

This weekend when I have some time I will post some low level code that I use to scan CSV files that you should be able to modify to suit.

mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
Thanks again all!

The SQL Server approach sounds like it would be a good solution. Unfortunately, the operation I currently work for is sort of a "mom and pop" shop, and they are not really willing to spend money and time (for installation, configuration, training, support, etc.) for a SQL Server setup.

I don't know if this helps: I maintain an empty structure that I use to load the tab delimited data (when the data files are small enough). I can add fields to that empty structure easily since the client alerts us ahead of time when they add fields.

In my current approach for calculating the maximum length of each field (using a .dbf version of the data) I use the empty structure and copy it extended to new table to capture the field names (it's important to maintain those field names for the subsequent processing that needs to be run).

I set all the field_len values to 1 in the new table. Then as I interrogate the length of a data field in the original file I update that field_len value in the new table with the higher value, until I get a maximum field length. If a field is totally empty throughout, it still gets a length value of 1 since each field must be accounted for.

When finished I take that new table and create the "efficient" .dbf structure from it. Then I would re-import the tab delimited file into that efficient structure, which makes the subsequent processing (sorts, indexing, file splits, postal presorting, etc.) run much faster, and with fewer memory overrun issues, and to stay under the 2GB limit.

Ideally, I would like to do things in a similar manner, except to use the tab delimited file to update the field_len values in that new (structure) table. It's just that I do not know how to cycle through each field in the tab delimited file, determine the length, and update the corresponding field_len value in the structure table - I've tried but just cannot seem to get it to work properly.

Thanks!

John
 
"The SQL Server approach sounds like it would be a good solution. Unfortunately, the operation I currently work for is sort of a "mom and pop" shop, and they are not really willing to spend money and time (for installation, configuration, training, support, etc.) for a SQL Server setup."

You don't need to spend money on SQL Server if you use the 'free' version - SQL Server 2005 Express Edition is the next version of MSDE and is a free

It is somewhat limited, but should do the job for the needs that you describe.

Good Luck,
JRB-Bldr
 
[ ]

Here is a quick and dirty UNTESTED way of scraping your incoming files for field lengths. Code for the ATXLEFT() & ATXRIGHT() UDFs can be found here: faq184-5975

[tt]

DIMENSION myarray[200] && Must be larger than max possible fields
STORE 0 TO myarray

zorigpath = "myfile.txt" && Complete path of file to be scraped
zbytecount = 2048 && Must be larger than max line length

zorighandle = FOPEN(zorigpath, 0)

zfound = FGETS(zorighandle, zbytecount) && Skip to real data
zfound = FGETS(zorighandle, zbytecount) && Skip to real data

DO WHILE NOT FEOF(zorighandle)

zfound = 0
DO WHILE EMPTY(zfound)
zfound = FGETS(zorighandle, zbytecount)

IF FEOF(zorighandle) AND EMPTY(zfound)
EXIT
ENDIF

IF EMPTY(zfound)
LOOP
ENDIF

zfound = zfound + ','
zcolumncounter = 0

DO WHILE NOT EMPTY(zfound)
zcolumncounter = zcolumncounter + 1
zstr = ALLTRIM(ATXLEFT(zfound, ','))
zfound = ATXRIGHT(zfound, ',')

IF LEN(zstr) > myarray[zcolumncounter]
myarray[zcolumncounter] = LEN(zstr)
ENDIF

ENDDO

ENDDO

ENDDO

= FCLOSE(zorighandle)

FLUSH

&& Now define your DBF fields for your DBF using myarray for field lengths

[/tt]


mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
[ ]

[tt]
STORE 0 TO myarray
[/tt]


should be

[tt]
STORE 1 TO myarray
[/tt]


mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
[ ]

Dang, I just realized you said TAB delimited and not COMMA delimited.

So add this line at the top:

[tt]
zdelimiter = CHR(9)
[/tt]


and change these lines

[tt]
zfound = zfound + ','

zstr = ALLTRIM(ATXLEFT(zfound, ','))
zfound = ATXRIGHT(zfound, ',')
[/tt]


to these lines

[tt]
zfound = zfound + zdelimiter

zstr = ALLTRIM(ATXLEFT(zfound, zdelimiter))
zfound = ATXRIGHT(zfound, zdelimiter)
[/tt]




mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
Thanks for all your ideas!

jrbblder - I downloaded SQL Server 2008 Express and plan on working with it some this weekend. I know you mentioned SQL Server 2005 Express, but version 2008 was available so I opted for that one - I hope I made the proper choice.

myearwood - Your approach would work for me except for one thing: I believe when defining the cursor the field lengths also need to be defined, and that (maximum size field lengths) is actually what I am trying to figure out beforehand.

mmerlin - I utilized the code and UDFs your provided, made some minor adjustments, and it does EXACTLY what I need. Thank you so much!

But at the risk of sounding a little ungrateful (I am definitely grateful!!), it runs throught the tab delimited file pretty slowly. Based on some preliminary benchmarking, it would take almost 2.5 hrs to run through a file containing 1.5MM records. Is this a function of working with text files and low level commands, or the multiple calls to the UDFs, or both??

Thanks again for all your help!

John

 
I just want to point out that when parsing any kind of separated string, ALINES() is pretty much always the best choice. For example, to bring up a line of comma-separated items, you can do this:

LOCAL aItems[1], nItemCount
* Assumes the string is in cString
nItemCount = ALINES(m.aItems, m.cString)

and you have an array with one item per element.

Tamar
 
[ ]

Most likely the extreme number of calls to the UDFs.

If you use SUBSTR() & AT() instead it should be a LOT faster. The UDFs really should not be used in massive loops and I never thought of that when I reduced my CSV scraping program to something that you could use. There is even a note in the FAQ that indicates there might be a noticeable time penalty for using the UDFs.

Try replacing

[tt]
zstr = ALLTRIM(ATXLEFT(zfound, zdelimiter))
zfound = ATXRIGHT(zfound, zdelimiter)
[/tt]


with

[tt]
zstr = ALLTRIM(LEFT(zfound, AT(zdelimiter, zfound) - 1))
zfound = SUBSTR(zfound, AT(zdelimiter, zfound) + LEN(zdelimiter))
[/tt]


and see how much faster it is.

I am not very familiar with SUBSTR() & AT() because I seldom use them, so I don't know if you will need to do any further error checking in this instance.

But, even though slow (but probably faster than what you were doing), I am glad that you were able to make that code work.

mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
OK. I adjusted the code and re-ran the program. How much faster? Scary faster. About 12 minutes to cycle through the text file of 1.5MM records this time, instead of almost 2.5 hrs.

I actually did think the initial "slowness" was due to the numerous UDF calls, and was in the midst of making similar adjustments.

Thanks again for all your help!

John
 
[ ]

15 million calls to the UDFs definitely slowed things down. Basically proves that you should NEVER use ANY UDFs inside of massive loops.

Now you are probably wishing that you had been able to scrape the text files months sooner. I don't know how much time this will save you, but it will be a considerable amount of time, not to say anything about the headaches you don't need to deal with anymore.

mmerlinn


"We've found by experience that people who are careless and sloppy writers are usually also careless and sloppy at thinking and coding. Answering questions for careless and sloppy thinkers is not rewarding." - Eric Steven Raymond
 
So what did you use now? With ALINES you can include trimming the values in the same step you seperate them, so I agree with Tamar that this is recommendable.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top