Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Fastest Binary File Parser EDI 2

Status
Not open for further replies.
Sep 17, 2001
673
US
I am looking for the fastest method of parsing large binary files. I have to scan the data in the file to figure out the delimeters first as they can be different between files received. Once I know the data and segment delimeters I can parse the data. So far I have tried FOPEN() and FREAD() one character at a time which is pretty slow. I have looked at doing the same type of process FILETOSTR() then scanning through the string 1 character at a time. Once I hit the end of segment delimeter I do something with the resulting string. So the question is has anyone else had to deal with this and what is the best method of doing this in fox? The files can be from 11-30 MB and like I said are binary. Speed is critical as right now I am at about 10 minutes for 11MB file.

In the sample line below from the binary file the * represents the data delimeter and < represents the segment delimeter:

************

ISA*00* *00* *02*UPSN *88*4838483848 *848384*1111*U*11111*122211212*0*P*><GS*IM*UPSN*9466284343*87994939*1111*222*X*443333<ST*444*009888888<B3**000000EA8484897**PP**20070553*1750****PPPP*24444321<C3*USD<ITD*01<N9*18*00003434KV

***********

The end result will be something like this (showing 3 segments truncated):

ISA*00.....<
GS*IM*....<
ST*444*...<

I would have code to handle each segment in steps. The biggest issue is that I need to process these large files very quickly but so far I am at about 10 minutes using FREAD() and a bit less for the FILETOSTR() method.

Regards,

Rob
 
Use fopen() fcreate() fgets() and fputs(). Take a less than full chuck (say 1000 chars) and determine the delimiter with that.

Then read the rest of the file with fgets(), parse on the fly and write to the new file.

If your files are *really* large, then you need to use something else since VFP can only handle 2 gig text files. See FAQ Bypassing Low Level File Functions 2GB Limits faq184-4732

VFP is faster though, so use it if you can.

Brian
 
This 25 meg sample file takes 5 seconds on my box. I'm sure you could build out the delimiter logic as needed.

Brian

Code:
CLOSE ALL

m.cString = "ISA*00*          *00*          *02*UPSN           *88*4838483848     "
m.cString = m.cString + "*848384*1111*U*11111*122211212*0*P*>"
m.cString = m.cString + "<GS*IM*UPSN*9466284343*87994939*1111*222*X*443333<"
m.cString = m.cString + "ST*444*009888888<B3**000000EA8484897**PP**20070553*1750****"
m.cString = m.cString + "PPPP*24444321<C3*USD<ITD*01<N9*18*00003434KV<"

m.nNew= FCREATE("test.txt")
FOR m.i = 1 TO 100000
	FPUTS(m.nNew,m.cString)
ENDFOR 
FCLOSE(m.nNew)

m.nsec=SECONDS()

m.nOut = FCREATE("output.txt")
m.nIn = FOPEN("test.txt")

m.cString = FGETS(m.nIn,1000)

*test field delimiter
m.nDelim = 0
m.nStar = OCCURS("*",cString)
m.nPipe = OCCURS("|",cString)
m.nComma = OCCURS(",",cString)

DO CASE
	CASE m.nStar>m.nPipe AND m.nStar>m.nComma
		m.cDelim = "*"
		m.nDelim = m.nStar 
	CASE m.nPipe>m.nStar AND m.nPipe>m.nComma
		m.cDelim = "|"
		m.nDelim = m.nPipe
	CASE m.nComma>m.nStar AND m.nComma>m.nPipe
		m.cDelim = ","
		m.nDelim = m.nComma 
	CASE m.nComma+m.nStar+m.nPipe < 5 && arbitrary
		MESSAGEBOX("delimiter not found")
ENDCASE 

*test segment delimiter
m.nLineDelim = 0
m.nLT = OCCURS("<",m.cString)
m.nGT = OCCURS(">",m.cString)

DO CASE
	CASE m.nLT>m.nGT
		m.cLineDelim = "<"
		m.nLineDelim = m.nLT
	CASE m.nGT>m.nLT
		m.cLineDelim = ">"
		m.nLineDelim = m.nGT
	CASE m.nGT+m.nLT < 5 && arbitrary
		MESSAGEBOX("line delimiter not found")
ENDCASE 

IF m.nDelim <= m.nLineDelim OR m.nDelim<=5 OR m.nLineDelim<=5
	MESSAGEBOX("We have a delimiter issue.")
	RETURN 
ENDIF 
DO WHILE NOT FEOF(m.nIn)
	FOR m.i = 1 TO OCCURS(m.cLineDelim,m.cString)
		FPUTS(m.nOut,GETWORDNUM(m.cString, m.i, m.cLineDelim) + m.cLineDelim) &&if you want to keep line delimiter
	ENDFOR
	
	m.cStringTail = GETWORDNUM(m.cString, m.i, m.cLineDelim)
	
	m.cString = m.cStringTail  + FGETS(m.nIn,1000)
ENDDO 

CLOSE ALL
MODIFY FILE "output.txt" NOWAIT 
MODIFY FILE "test.txt" NOWAIT 

MESSAGEBOX("Done in "+ TRANSFORM(seconds()-nSec) +" seconds")
 
Thanks, the biggest issue is that I want to have this by as dynamic & fast as possible. I have been testing using FGETS() only returning 1 character at a time. This allows me to check each character to see if its a delimeter. Problem is the FGETS()/FREAD() take forever (10 minutes) on a 11Meg file. Anyway we currently have in place a somewhat involved process that takes the binary data and massages it into a text format that has CRLF and can be imported into a cursor. This process takes about 2-3 minutes.

Regards,

Rob
 
ALso note that cStringTail does nothing in this example because the fputs() is inserting CRLF which fgets() uses. If there were no CRLF... e.g. creating test with below, the tail does indeed matter.

Brian

Code:
*m.nNew= FCREATE("test.txt")
STRTOFILE(m.cString,"test.txt",0)
FOR m.i = 1 TO 100
*	FPUTS(m.nNew,m.cString)
STRTOFILE(m.cString,"test.txt",1)
ENDFOR 
*FCLOSE(m.nNew)
 
The speed issue isn't due to the low level file functions. Perhaps the "involved" code is doing more than it needs to to reach your goal.

You are aware of STRCONV()?

Brian
 
Thanks, I will work on these ideas when I get a break. I appreciate it.

Regards,

Rob
 
How about using FileToString() and then using ALINES() to break the data up at the segment delimiter? If you're not in VFP 9, you do then have to worry about the number of segments, but 65,000 is pretty substantial.

Tamar
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top