Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

*Faster* file access / string extraction

Status
Not open for further replies.

jnpoyser

IS-IT--Management
Sep 13, 2002
8
GB
Hi all,

I'm trying to process some data files, I have around a gig of files in total. I'm using random file access as each file has a fixed record size. If i use a user defined data type with the get statement I can easily get a throughput of over 3,000 records per second. Great!

However, the structure needs to be defined at run time so I can't (well I don't think I can) use a structure. Instead, the format of the data file is stored in a config file that is loaded when the routine is run. With me so far?

This is the problem though - I have to use the mid operator which drops my speed down to 500 records per second. Not so good.

Any ideas?


James.
 
All the wat to then "mid"dle. MichaelRed
m.red@att.net

There is never time to do it right but there is always time to do it over
 

Since I do not know what your code is like let me ask you a question. Could you use a structure larger than necessary that would handle the various sizes???
 
The structure would have to match the size of the record exactly or VB will throw an error. As there is no way to create a structure at run time and each file requires a different size structure, I cannot create a 'generic' structure to be used for all files.

James.
 
Have you considered using ADO's text driver to treat the text file as a database table? Then you could dynamically build a select statement in code to retrieve the records.
 
It seems clear from your post that you processed is based on meta-data, that your loading from a config file at runtime. You are reading in a string of bytes, and then need to map that data into organized fields, as defined by the meta data? I am thinking that using the VarPtr and StrPtr pointers in combination with the CopyMemory API may prove useful.

What does the meta-data look-like?
Do you have a enumerable list of record structures? That is, do you know at design time that you have 10 different record types, but don't know which one until run time?

You stated in your original post that you needed a Mid function. How did you know that? I know there is a lot of info that you don't know until runtime, but have you told us everything you do know at design time, so that we have some idea of what info we can take advantage of? Good Luck
--------------
As a circle of light increases so does the circumference of darkness around it. - Albert Einstein
 
Just as a helpful idea, if you're using the variant forms of your string functions, your speed can be greatly improved. Sample:

Instead of -
Code:
Dim myLine

Line Input #1, myLine
myLine = Mid(myLine, InStr(1, myLine, ";"))

Use -
Code:
Dim myLine As String

Line Input #1, myLine
myLine = Mid$(myLine, InStr(1, myLine, ";"))

Note the use of the dollar sign on the end of the mid. I don't know if this is what you are doing or not, but in some cases, these calls can be 80% faster.

I'd suggest posting code / pseudo-code for us to see. It helps us to optimize your code. Good luck, -iNSTA
aim: instar4per
email: instar4per @ hotmail.com
 
Hi all,

Here's an example of a definition file that will be read in:

1,STK-COMP,AN
15,STOCKNO,AN
25,STKDES1,AN
25,STKDES2,AN
25,STKDES2-INNERS,AN
5,STKGROUP,AN
15,STK-THEIR2,AN
1,STKVATCD,N
2,STK-COMMENTS,N
6,STKPACK,N
6,STKPACK-WEIGHT,C4
6,STKPACK-VOLUME,C4
6,STKSELL,C4
6,STKSELL-INNER,C4
5,STKSELL-UNITD,AN
3,STKSELL-UNITQ,C4
6,STKCOST,C4
5,STKCOST-UNITD,AN
3,STKCOST-UNITQ,C4
6,STK-INNERS-QTY,C
5,STK-INNERS-UNITD,AN
1,STKDMK,N
3,STKSDP,N
2,STKBCDE-1,C4
6,STKBQTY-1,C4
6,STKBPRICE-1,C4
...etc for approx 100-250 fields

The first value is the length of the field, the second is the name and the third is the format.

The goal is to supply the app with the definition file and data file at run time and then produce either a properly formatted and converted tab seperated file or pump it into an SQL database.


The code so far...
The above def. file is read into an structured array holding the three fields.

In the data file each record is a fixed, known length. For the above example, I open the file in random mode with a record size of 937 bytes.

I read the record into a string and use 'mid' to extract the first byte and append it to the output, then the next 15 bytes, then the next 25 bytes etc etc:

i.e. I use Mid(strTempLine, iLinePos, FileItems(x).iSize) to extract the current field.

Once the record is complete, the output is written to a line in a new file, the string is cleared and the next record read from the file.



Someone else has suggested VarPtr and StrPtr - could I use this to access specific bytes directly? How could I, for example, extract 25 bytes starting at byte offset 123?



James.



 
I have used CajunCenturion suggestion succesfully to proces files of ~200-400 mb by reading the files in chuncks (size depending on hardware ~10 mb) into a byte array and manipulating that before dumping it to an output file.
However since you mention SQL server, a faster method would be to use the definition file to dynamically create a 'fileformat' file that descripes you data and then use that in a 'BULK LOAD' command - if you final goal is to put it into the SQL server I don't think it can be done much faster than that.
Sunaj
'The gap between theory and practice is not as wide in theory as it is in practice'
 

As strongm says, why not use ADO/Jet or DAO/Jet?

You could connect to the text file, use GetRows to retrieve all the data at once into a 2 dimensional array (Field number, RecordNumber), which is pretty fast, or the GetString method which will retrieve all of the data at once into a string (including column and row delimiters).

And, you might want to use a Schema.Ini to hold the definition. Jet will automatically use this definition file to determine the structure of a text file located in the same folder. If there are different files with different structures in the same folder, then this can be handled with-in the same Schema.Ini file, listing each file's name and its' structure separately.
[/b][/i][/u]*******************************************************
General remarks:
If this post contains any suggestions for the use or distribution of code, components or files of any sort, it is still your responsibility to assure that you have the proper license and distribution rights to do so!
 
What I forgot to mention is the fact that you could also use the Schema.Ini and Jet, and the Execute method of ADO/DAO to get the all data from a text file and either insert it into a new text file, or into a database table, using an INSERT INTO ~ SELECT ~ FROM ~ IN Select statement, all in one clean sweep.... [/b][/i][/u]*******************************************************
General remarks:
If this post contains any suggestions for the use or distribution of code, components or files of any sort, it is still your responsibility to assure that you have the proper license and distribution rights to do so!
 
Much of the above skirts one question.

Why are you using random access to retrieve the original information?

Many / most of the suggestions are orientated toward loading the various text files into some data structure in their entireity, while the use of random access implies that you are selecting specific records from the various files and crerating small(er) recordsets which are processed via standard (SQL) approaches.

In your original post you said "...I'm using random file access as each file has a fixed record size. If i use a user defined data type with the get statement I can easily get a throughput of over 3,000 records per second ...".
If the use of random access is simply to improve the throughput of the processing, I think it would be even faster to load the file in it's entireity into a single string variable and process the string in memory (rule of thumb is that memory operations are > 10 X of disc activity.

In general, you can apply whatever parseing activity you currently apply to the disc file to the string,


Troy Williams suggested this approach in a post some time ago, and I modified it to be more general, as follows:


Code:
Public Function basTextToArray(FName As String, _
                               Optional RecSep As String = vbCrLf) _
                                As Variant

    Dim Fil As Integer
    Dim NRecs As Long
    Dim RawFile As String                       'Holds the entire contents of the file
    Dim RawSplit() As String                    'the file split up on a line per line basis

    Fil = FreeFile                              'get the next free file number

    'This works for large files. I (Troy Williams)tried it
    'with a 50 meg file on a computer with 128 Mb of ram and it worked fine.
    'open the file and dump the contents into the rawfile variable

    Open FName For Binary As #Fil               'Open file
    RawFile = String$(LOF(Fil), 32)             'Create "empty" String of Length
    Get #Fil, 1, RawFile                        'Fill "Empty Str with File
    Close #Fil                                  'Close File

    'Get the Nunber of Records and Fields
    RawSplit = Split(RawFile, RecSep)           'Split the file up by lines
    NRecs = UBound(RawSplit) - 1                'Num Lines/Records in the file

    basTextToArray = RawSplit

End Function
Public Function basTestTextToArray(FName As String)

    Dim MyLines() As String
    Dim Idx As Integer

    MyLines = basTextToArray(FName)

    For Idx = 0 To UBound(MyLines)
        Debug.Print MyLines(Idx)
    Next Idx

End Function
[code]

Of course, if the inputs are truly Random access files, the Split function in hte above would generally NOT be useful, but would be replaced with the record parsing code appropiate to the individual file.

To use 'try' the process, I also creaated the second function, as the first one simply returns an array of strings representing the individual lines in the original file.  I would suggest (the obvious) that you use the test program only on a SMALL sample file.  It's purpose is just to demonstrate the process.  In a 'real' application, the loop to print the individual lines would be replaced with the actual parseing process.

For CSV (or any character delimited) files, it is quite reasonable to again use the "Split" function to reduce the individual lines to an additional array.  Just "Split" the first line and use it's UBound to determine the number of Fields, and the UBound of the 'line' array as the number of Records.

On a completly different note, you also state:

" ... 
6,STKBPRICE-1,C4
...etc for approx 100-250 fields

..."

Which will approach the limit of the number of fields allowed in Ms. A. -and in most current relational dbs.  This will need to be addressed, as tables created w/ this many fields (~~ 250) will not be useable.

 MichaelRed
m.red@att.net

There is never time to do it right but there is always time to do it over
 
Thanks for all of the replies, again I'll clear up a few little spots!

As for using an ado text driver or dumping into sql server I don't think this will be possible. A lot of the data is encoded and stored in odd formats that require converting into something windows is more happy with. (e.g. a 12 digit signed value stored in a 6 byte sequence). As far as I know not even the DTS in sql server can manage this.

As for using a random method - the file is accessed sequentially it's just random is still there from when I used a structure to hold the values.

Although the files have upto 250 fields, many of them will be merged or dropped when they reach the database, so that won't be a problem.


I don't know how practical loading the whole file into memory would be - some files are around 200-300 Mb but loading more workable chunks into a byte array sounds like a plan.

Just a quickie about byte arrays, how could I, for example, grab charaters 123 to 133 and put them into a new strings without adding them iterativly using concat?


Thanks again - James.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top