Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

UTF-8 validation 2

Status
Not open for further replies.

Olaf Doschke

Programmer
Oct 13, 2004
14,847
DE
I just wrote a FAQ about UTF-8 string encoding validation, see faq184-7900.

Bye, Olaf.
 
Here's a little showcase to explain the use of this validation. An idea to prevent double conversion is to only convert, if a string is not validating as UTF-8. This somewhat works, but has its limits.

Code:
LOCAL lcSomestring

* the normal way to STRCONV
lcSomestring = ''+0he4f6fcdf
? "initially ANSI:"+lcSomestring 
lcSomestring = STRCONV(lcSomestring,9)
? "converted to UTF-8 once:"+lcSomestring
lcSomestring = STRCONV(lcSomestring,9)
? "converted to UTF-8 twice:"+lcSomestring
lcSomestring = STRCONV(lcSomestring,9)
? "converted to UTF-8 three times:"+lcSomestring
? "single reconverted to ANSI:"+STRCONV(lcSomeString,11)
? "double reconverted to ANSI:"+STRCONV(STRCONV(lcSomeString,11),11)
? "triple reconverted to ANSI:"+STRCONV(STRCONV(STRCONV(lcSomeString,11),11),11)
? "result: every conversion and reconversion changes the string"
? "-----"

* the 'definite' way to validate and only eventually STRCONV
lcSomestring = ''+0he4f6fcdf
? "initially ANSI:"+lcSomestring 
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 once:"+lcSomestring
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 twice:"+lcSomestring
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 three times:"+lcSomestring
? "result: the first conversion is stable"
? "-----"

* where the 'definite' conversion fails
lcSomestring = ''+0he4f6fcdf
? "initially ANSI:"+lcSomestring 
lcSomestring = STRCONV(lcSomestring,5)
? "converted to Unicode once:"+lcSomestring 
lcSomestring = STRCONV(lcSomestring,5)
? "converted to Unicode twice:"+lcSomestring 
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 once:"+lcSomestring 
? "result: the final UTF-8 conversion doesn't remove unnecessary Unicode 0 bytes, as they are valid UTF-8 chars, too."
? "-----"

lcSomestring = ''+0he4f6fcdf
? "initially ANSI:"+lcSomestring 
lcSomestring = STRCONV(lcSomestring,5)
? "converted to Unicode once:"+lcSomestring 
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 once:"+lcSomestring 
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 twice:"+lcSomestring 
? "result: at least the final UTF-8 is stable."

? "-----"
lcSomestring = ''+0hc3a4f6fcdf
? "initially partial UTF-8:"+lcSomestring
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 once:"+lcSomestring
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 twice:"+lcSomestring
lcSomestring = IIf(ValidateUtf8(lcSomestring )=0, lcSomestring ,StrConv(lcSomestring ,9))
? "converted to 'definite' UTF-8 three times:"+lcSomestring
? "result: the first conversion is stable, but the single already UTF-8 converted '"+0hc3a4+"' is still getting double converted."

We learn, that this function does not mend ill initial strings, nor does it detect it has unicode as initial string encoding. In VFP every string can have any bytes and thus is somewhat just a binary byte array, normally displaying fine with whatever is your ANSI codepage setting. VFP does not note any meta data of a string, as its type or charset aka encoding.

STRCONV(x,9) does not only specify the result is UTF-8, it also sepcifies the initial x has to be 'DBCS' which in case of STRCONV and VFP is equivalent to ANSI - don't get confused about the meaning of DBCS being double byte charset, that's also true for ANSI, though only for very few asian ANSI codepages. The VFP help means ANSI with DBCS, I also misunderstood that for a few years. If the initial string already is UTF-8 it's still treated as ANSI (DBCS). In the end you should know what you input to get the desired output. If you can't ensure the initial charset, the previous validation step at least can prevent double UTF-8 conversion, where the inital string already is valid UTF-8, tough even simple uniocde strings like STRCONV('abc',6) are also valid UTF-8 strings, while they only get displayed without the extra space via a Unicode ActiveX control and not as UTF-8, if you visually expect 'abc' and not 'a b c '.

So finally this function is not there to mend or convert anything to UTF-8, but it helps checking, where an invalid UTF-8 string could cause trouble. STRCONV(x,9) will make any bytes into a string validating with this function, though it might still not be the UTF-8 you would get, knowing the real initial ANSI string. And of course it can't magically turn any ? back into the real character.

Last not least, to indicate this once more: The section commented [tt]* the normal way to STRCONV[/tt] shows the problem to solve and the section commented with [tt]* the 'definite' way to validate and only eventually STRCONV[/tt] is the solution I thought of. Input is either ASCII or ANSI non UTF-8 compliant or already UTF-8, then this'll convert it to something reconverting to ANSI with a single STRCONV(x,11).

Bye, Olaf.
 
Testing with "war and peace", a text mostly containing letters, of course, the mainly executing code part just is:

Code:
For lnCounter = 1 To m.lnStringBytes
   lnFirstByte = Asc(Substr(m.tcString, m.lnCounter, 1))
   If m.lnFirstByte<0x80 && that's ok and identical to ASCII
      Loop
   Endif
   *...
EndFor

And it's really slow for longer texts, just Substr() is the slow operation. It's not rising memory consumption. See the timing of the following experiment:

Code:
#Define cnStep 10000

Local lnLen, lcString, lcChar, lnT0, lnT
Create Cursor crsTiming (nLen I, nSeconds B)

For lnLen = cnStep To 30*cnStep Step cnStep
   lcString = Space(lnLen)
   lnT0 = Seconds()
   For lnCounter = 1 To m.lnLen
      lcChar = Substr(m.lcString, m.lnCounter, 1)
   Endfor
   lnT = Seconds()-lnT0
   Insert Into crsTiming Values (lnLen/cnStep, lnT)
Endfor
Takes about 30 seconds with non linear behaviour, as can be seen from graphing the recorded timings:

substrtiming_xqzr7w.png


Thus for better performance the bitshift or bitand operations play a minor role, we need to take the overall strings in short portions.

Code:
#Define cnStep 10000

Local lnLen, lcString, lcChar, lnT0, lnT, lnRest, lcBuffer
Create Cursor crsTiming (nLen I, nSeconds B)

For lnLen = cnStep To 30*cnStep Step cnStep
   lcString = Space(lnLen)
   lnRest = lnLen
   lnCounter = 0
   lcBuffer = ''
   lnT0 = Seconds()
   Do While lnCounter < m.lnRest
      lnCounter = lnCounter + 1

      If lnCounter+5 > Len(m.lcBuffer) And Len(m.lcString)>0
         lcBuffer = Substr(m.lcBuffer, lnCounter) + Left(m.lcString, 1000)
         lcString = Substr(m.lcString, 1001)
         lnCounter = 1
         lnRest = Len(m.lcBuffer)+Len(m.lcString)
      Endif
      lcChar = Substr(m.lcBuffer, m.lnCounter, 1)
   Enddo
   lnT = Seconds()-lnT0
   Insert Into crsTiming Values (lnLen/cnStep, lnT)
Endfor

substrtiming2_cij1k0.png


This behaves linear as I'd rather expect SubStr() to behave even for long strings. Seems getting the nth byte is harder the longer the string is, but why? The long string remains unchanged, all VFP needs to do is copy 1 byte into lcChar from lcString address+offset.

By the way, this is obvious already for 300K, "War and Peace" is 3.5M Bytes long, you can guess how slow it performs without introducing lcBuffer. After further experimenting with the ideal buffer size I'll update the validation function to make use of the mechanism.

Bye, Olaf.
 
Got it. The VFP functions themselves, including SUBSTR() are receiving the passed in parameters by value and not by reference. What's making SUBSTR() slow is passing in the whole "War and Peace" text to only return one char of it 3.5 million times. The observation of passing by value is also supported by the nature of the timing. The time necessary to iterate all single chars of a string rises quadratic. It's length times a SUBSTR() call and each call copies length bytes, thus overall length^2 bytes are copied. That's why chopping it into smaller pieces helps to win the war on the pieces. 1024 bytes (2^10) copied 1024 times means 1MB (2^20) copied, while i.e. 256 bytes (2^8) each copied 256 times means 65536 bytes copied (2^16), doing so 4 times you also iterate the full length 1024 bytes, but only copy 4*2^16=2^18, that's 256KB, still only 1/4 of 1 MB.

That sparks an idea I myself had already 12 years ago... For now put to a rest, maybe later this month...

I think of writing an FLL, but Ed Rauhs heap class comes in handy, too:

Code:
#Define cnStep 10000

Local lnLen, lnStringPtr, lcChar, lnT0, lnT
Create Cursor crsTiming (nLen I, nSeconds B)
apideclare()

For lnLen = cnStep To 30*cnStep Step cnStep
   lnStringPtr = HeapAlloc(gnHandle,0, @lnLen)
   * It's not necessary to initialize ther allocated memory 
   * for this performance test, but we do so outside of
   * time measuring anyway:
   =RtlMoveMemory(lnStringPtr, Space(lnLen), lnLen) 
   
   lnT0 = Seconds()
   For lnCounter = 0 To m.lnLen-1
      lcChar = Sys(2600,lnStringPtr+m.lnCounter,1)
   Endfor
   lnT = Seconds()-lnT0
   HeapFree(gnHandle, 0, lnStringPtr)
   
   Insert Into crsTiming Values (lnLen/cnStep, lnT)
EndFor
HeapDestroy(gnHandle)


#DEFINE SwapFilePageSize  4096
#DEFINE BlockAllocSize    8192

Procedure apideclare()
   Public gnHandle

   DECLARE INTEGER HeapCreate IN WIN32API ;
      INTEGER dwOptions, ;
      INTEGER dwInitialSize, ;
      INTEGER dwMaxSize

   DECLARE HeapDestroy IN WIN32API ;
      INTEGER hHeap

   DECLARE INTEGER HeapAlloc IN WIN32API ;
      INTEGER hHeap, ;
      INTEGER dwFlags, ;
      INTEGER dwBytes
         
   DECLARE INTEGER HeapFree IN WIN32API ;
      INTEGER hHeap, ;
      INTEGER dwFlags, ;
      INTEGER lpMem

   DECLARE RtlMoveMemory IN WIN32API ;
      INTEGER nDestBuffer, ;
      STRING @pVoidSource, ;
      INTEGER nLength

   * usage
   gnHandle = HeapCreate(0, BlockAllocSize, 0)
   * HeapAlloc(gnHandle,0, @nPtr)
   * HeapFree(gnHandle, 0, nPtr)
   * =RtlMoveMemory(nPtr, cSource, Len(cSource))  && take care len<=alloc size
   * HeapDestroy(gnHandle)

This runs fastest, it only copies the long string once, then addresses and reads single bytes from that memory block via SYS(2600)

substrtiming3_qhk6uf.png

As can be seen from the nseconds values the new line only goes up to about 1/3. It's always a bit riskier to make use of memory allocation and pointers, but it pays with another factor of 3, here. An FLL could even return the address of a VFP string variable and not even need a single copy as done here with RtlMoveMemory, so the string bytes then only are iterated once and not twice, another factor of about 2 is to be expected from that.

Bye, Olaf.
 
Olaf, this is a remarkable investigation into the manipulation of strings in VFP. You gave us, as usual, lots of stuff to consider, and directions to follow. It's always a pleasure to see how you do look into problem details, and how you rationalize about the VFP ways of doing things (and not limited to VFP, but regarding programming in general).

That said, and unless I'm failing to see something obvious, I think that you undervalued the capability of STRCONV() to be at the base of a UTF-8 validator. And, as it happens, a very efficient one.

Since UTF-8 is a representation of UNICODE, we can verify if a UTF-8 string is valid by changing it into UNICODE and then back to UTF-8 again. If we get back the first string, then it is valid UTF-8; if not, there is a problem with the encoding.

Code:
FUNCTION ValidateUTF8x (UTF8 AS String) AS Boolean
	RETURN STRCONV(STRCONV(m.UTF8,12),10) == m.UTF8
ENDFUNC

I downloaded Flaubert's Madame Bovary from Project Gutenberg, in French, and passed it through both functions. I think it's more interesting to perform these tests with a text that holds lots of sequences, which is not the case of an English one.

In my machine, the STRCONV() based function analyzed instantaneously (that is, under 1 second), while your function took 214 seconds to complete, both reporting the file as valid, as they should.

Furthermore, STRCONV() can produce a correct UTF-8 text out of an incorrect one, by replacing an invalid sequence with a bad-character mark (�).

Of course, the STRCONV() function does not provide, like yours, the reason why the encoding fails, and that, along with the indication of the error location, can be quite useful if one's need to correct a badly encoded text manually.

António
 
Yes, that's a considerable and also of course much simpler solution in regard of utf8 validation. Nice, I should have thought of that. It's simple and fast, as it's just two times passing the string.

Indeed one of the remaining advantges parsing a string char by char is to determine the reason and it would be easy to extend that to show the parsing error position, too. You might also get false positions from the bad char mark positions. I still want to speed this up, the FAQ version did not finish on War and Peace at all on my PC.

Yes, there's not much to test in English texts, but as far as I tested the coverage of execution branches some simple things like the apostrophe have a UTF-8 double byte char, though that could also be put as chr(39) in War and Peace. The use case I use this for is not really texts, it's identifiers for POS system registers and receipt identifiers, which could in the easiest case just be numbers, but is not limited to such in the general case. Actually I could blindly STRCONV() any input and finally convert it back, the UTF-8 converted version of the identifier is never showsn, is just part of signature data. The bad case is, the signature could differ if a test routine does not first convert to UTF-8, if the string already is UTF-8.

And besdies that, tt was fun to do, making a dull job more interesting. In very short the UTF-8 encoding has a simple schema: The number of 1 bits before the first 0 bit in the first character byte is the number of extra bytes belonging to the character. That also holds true for half the ASCII or ANSI 1252/1254 codepages up to CHR(127), where the first (most significant) bit is 0, so there are no extra bytes as there is no 1 bit before the 0 bit. It's not surprising UTF-8 has the higher popularity over simpler Unicode codepages needing 2 bytes for every character, most of latin language text is simply compatible with ASCII and needs less bytes. Besides this bit rule there are a few corner cases, which were nicely covered in the Wikipedia article I used to write these tests.

I just tested it, and so far this simple test holds true: The test cases I constructed to cover all the error cases of my function and to see whether all branches of code execute, result in the correct manner with .T. for any 0 and .F. for any error<0.

Bye, Olaf.





 
It's because of a much more attentive reading of the UTF-8 article, particularly the section concerning what might be the reaction of a UTF-8 processor in the face of an error, that lead me into checking the STRCONV() behavior. And once established the robustness of the converter, it was just one step further to use it as a validator.

When we have a bad and a corrected version of a UTF-8 document, we may think of a way to look for the corrections without the risk of being fooled by a false positive.

Code:
LOCAL BadUTF8 AS String
LOCAL CorrectedUTF8 AS String

LOCAL GoodSegment AS String
LOCAL SoFarSoGood AS String
LOCAL ErrorLocation AS Integer

LOCAL ARRAY GoodSegments[1]

LOCAL BadMark AS String

m.BadMark = CAST(0hefbfbd AS Char(3))

m.BadUTF8 = FILETOSTR(GETFILE())
m.CorrectedUTF8 = STRCONV(STRCONV(m.BadUTF8, 12), 10)

ALINES(m.GoodSegments, m.CorrectedUTF8, 2, m.BadMark)

m.SoFarSoGood = ""

FOR EACH m.GoodSegment IN m.GoodSegments

	m.SoFarSoGood = m.SoFarSoGood + m.GoodSegment + m.BadMark
	
	IF !LEFT(m.BadUTF8, LEN(m.SoFarSoGood)) == m.SoFarSoGood
		m.ErrorLocation = LEN(m.SoFarSoGood) - LEN(m.BadMark) + 1
		? m.ErrorLocation
		? SUBSTR(m.BadUTF8, m.ErrorLocation, 6), CAST(SUBSTR(m.BadUTF8, m.ErrorLocation, 6) AS W)
		RETURN
	ENDIF
ENDFOR

? -1

If the routine found an error location, then its SUBSTR() can be passed to your analyzer to check for the reason of the error.
 
By the way, I now made a FLL function working quite like Sys(2600,lnStringPtr+m.lnCounter,1) via the variable name, namely [tt]SubByte("lcString",m.lnCounter)[/tt].

It is much slower than SYS(2600). I will spare the source code at this point, but there are several steps to do, to get from the variable name to the byte at position m.lnCounter: Getting the name table index via the variable name (1), using that index to get a Locator for the variable (2), Loading the variable struct (3), getting the pointer (address) via a handle of the variable struct (4), and then finally returning the byte at some offset off that pointer (5). To make this faster there would need to be a way to reuse the address we know from step (4), as the SYS(2600) function call can do. Up to now the only info I can reuse is the Locator, so I spare steps 1 and 2 in further calls, but it's still as slow as with these steps. Seems like the functions MS provides in the winapims.lib you link into a FLL don't provide direct access to the VFP variables, but only to copies of them. And that of course is again as slow as working with the normal VFP SUBSTR() function.

So actually maintaining your own memory allocation via the Heap API functions is the fastest way to process a string byte by byte, it seems. It still has the overhead of copying the string once.

What I surely could do now I am at the point of having an FLL project (again), is writing an FLL with a ValidateUtf8 function.

Bye, Olaf.



 
It's been a long time since I delved into FLL coding, but I believe that accessing variables and parameters requires locking, because they keep moving around, so there is no guarantee on their memory address except for the locking period. So, I think that an address wouldn't be reusable between function calls. In that respect, a ValidateUTF8 should have to be self-contained.

 
atlopes said:
accessing variables and parameters requires locking
Yes, you remember that correctly.

You have to use _Load() to get at a memory handle and finally _FreeHand() to free the handle. From the handle you get an address via _HandToPtr() and that address was constant:
Code:
stringptr = _HandToPtr(val.ev_handle);
address = (long) &stringptr;
_RetInt(address, 10);

Sys(2600) of that address doesn't return the string, though. Maybe I'm doing something very stupid here. Also the val.ev_handle handle itself stays constant. You have to wait long or call SYS(1104) to force garbage collection and cause a rearrangement of RAM or MHAndles ot address mappings. Locking just ensures these handles and addresses are left intact by garbage collection.

I could try to lock a handle and not free it within the same function, keep it stored and only finally release it. Enough experimenting for today, though.

Bye, Olaf.



 
Iterating a 500,000 (five hundred thousand) bytes string on my PC, I get about:

1) 10.3 seconds with SubStr(lcString,m.lncounter,1)
2) 0.111 seconds with Sys(2600, m.lnAddress+m.lncounter,1) using a string copy on a new memory Heap under my own codes control.
3) 0.198 seconds with an FLL function only getting the next byte via a string pointer, incrementing it. An initialisation function loads the var, gets the start pointer adddess and locks the memory handle, the function I call 500,000 times is without parameters and makes use of the known string pointer I store in a "public" DLL variable. A separate function frees the handle only once after the full iteration of the string bytes.

I expected a factor of 2 between 2) and 3), but measure the other way around. I only get comparable timings, if I return 4 bytes (long instead of char) with each call of the FLL function, incrementing the pointer by 4 each time. So either calling an FLL function has an overhead in comparison to calling a native VFP function or the _RetInt() function (which is returning the byte or long value) has an overhead, or both call time and _RetInt() are contributing to the bad FLL function timing.

It's also not the the duration of the VFP FOR lnCounter loop, which has a major base time, it only consumes 0.007 seconds as measured with a loop doing nothing.

Wait a moment, I got
4) 0.046 seconds with a combination of three new FLL functions, SUBINIT(), SUBBYTE() and FREEHANDLES().

What is finally successful is to not only load the string variable and permanently lock its memory handle, but also to load the variable which is used on the VFP side to retrieve the single byte at m.lnCounter position. That function doesn't return this byte via _RetInt but via setting the byte variables struct ev_long property. So the interface of that FLL now is to Init with both the String variable name and the Byte variable name [tt]SubInit("lcString","lnByte")[/tt] to then make parameterless calls of [tt]SubByte()[/tt], which puts the next byte into lnByte and increments the counter. It's now taking about 50% for longer strings in comparison to SYS(2600):

Code:
#Define cnStep 100000
   SET LIBRARY TO string.fll
   For lnLen = cnStep To 5*cnStep Step cnStep
   lcString = Space(lnLen)
   lnByte = 0
   
   lnT0 = Seconds()
   SubInit("lcString","lnByte")
   For lnCounter = 1 To m.lnLen/4
      SubByte()
   EndFor
   FreeHandles()
   lnT = Seconds()-lnT0
   
   Insert Into crsTiming Values (lnLen/cnStep, lnT)
EndFor

But that interface is weird, you pass in the two involved variable names and further calls iterate the first (string) variable chars and put the byte values into the second variable (numeric).

In the end the most sensible way would be to fully implement the ValidateUTF8 function in C++, it's not that much to do, but I'm now satisfied my idea holds true, though only in a complicated way. You don't want to mess with locks of handles you forget to unlock, perhaps, so to make full use of it you'd need to implement more of a full blown string object as a shadow variable to a normal VFP string variable pointing to the VFP string address. And in the end the permanent lock on the lnByte variable also means you don't get its value on the VFP side, it's only accessible within the FLL. So while my FLL now demonstrates the concept it is still ill of really functioning.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top