Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Cpzro, different text (codepage) in one file 1

Status
Not open for further replies.

sashaDG

Programmer
Jun 20, 2022
112
BY
Good afternoon, tell me please. There is a file in which, with one cpzero 1251 command, part of the text looks as it should
and the other part is not readable. If cpzero 866, then the situation is vice versa. Tried to cast to code page 0 and then to 866(1251) - did not help. What else can be done with the file, in addition to dividing it into several.
 
The codepage associated with a file tells VFP how to interpret the content. A codepage change does not change the byte values in the file, it just changes the interpretation.
Edit: cpzero.prg does more than that, it actually converts (or tries to).
That information is not used by VFP to convert from DBF codepage to currently in-use OS codepage for display and back to DBF codepage when saving. (That's still true, though)

The codepages you specified are 1251 - Windows Cyrillic script and 866 which is DOS Cyrillic Russian. So the tendency is for Cyrillic content in both cases. But DOS and Windows encode it differently with their differing codepages, you have to convert the part of the data that's DOS or Windows to the other and vice vera.

If this data is used by both a DOS and a Windows application this could have happened and it will continue, there is no codepage both DOS and Windows use for Cyrillic characters. Well, there is Unicode, you could convert both codepages to UTF8, as one Unicode variant, but then both the Windows and the DOS application have to be reprogrammed to convert to their own ANSI codepage before display and back to UTF8 for storage in the DBF.

So there is no simple solution to this.

There also is no marker for each record, so it's not easy to come up with an automatic conversion per record.

What to do will surely depend on whether still both DOS and Windows applications are used. If there was a change to Windows and only that is used now, the simpler solution is to convert everything into codepage 1251, there is STRCONV() to do that, but since you don't know which records are still codepage 866 and which are 1251, the data is already botched in a way not easy to fix, unless it's an overseeable amount of data.

Edit: If cpzero.prg was used to convert this contributes to botching this data, so it becomes even more difficult to know how to interpret each string. Because if cpzero converts to codepage 1251 under the assumption the data in the DBF is coepage 866, but data was already converted or mixed, this makes it worse, not better.

Chriss
 
Actually, I have to reinstate what I said earlier: cpzero.prg does not change data bytes. The only thing it changes in the DBF file is a byte at offset 29, with these two lines of code:
Code:
=FSEEK(m.fp_in,29)
=FWRITE(m.fp_in,CHR(m.cpbyte))
I spare myself to explain what the values mean, but I can also explain why I thought against my better knowledge that cpzero.prg did change data bytes. I stored a string with bytes 0-255 put together with CHR(lnI) into a dbf with a given codepage, then changed that by cpzero.prg and read out the field and determined ASC(substring) and they changed. I can't really explain how that could be.

Well, the good news for you about my error, is that you still have the original data and did not cause a further mess by using cpzero. The bad news is that whatever mix of encodings are used in the data, there's still no easy way to figure out, most ANSI codepages use single-byte characters. That means you can't detect the codepage of a string by a prefix, suffix or by specific bytes or byte combinations only used in one encoding and not another, every byte value 0-255 has a meaning, be it a control code. So your only way of finding out what should be the codepage for each single field is by viewing it interpreted in the different codepages and deciding visually, manually which data is which codepage.

Your only help would be a categorization of data being alphanumeric by ISALPHA(). Or you find letters working in one codepage but not the other and look for the specific bytes of these characters.

Chriss
 
Chris, thank you very much for the detailed answer. Question like how each string is associated with a code page? Or will the answer to my question be in this paragraph?
The bad news is that whatever mix of encodings are used in the data, there's still no easy way to figure out, most ANSI codepages use single-byte characters. That means you can't detect the codepage of a string...


How to understand further mess ?
Well, the good news for you about my error, is that you still have the original data and did not cause a further mess by using cpzero.

Where can I learn about specific sites for strings and how to change them?
Your only help would be a categorization of data being alphanumeric by ISALPHA(). Or you find letters working in one codepage but not the other and look for the specific bytes of these characters.

Have you a nice day!
 
About the "mess":

Well, you know and see you can't get the data show up in any codepage, you only have some of it right in codepage 866 (Cyrillic DOS) and other data right in codepage 1251 (Cyrillic Windows).

CPZERO can only switch to one, not both, but cpzero does not change any data bytes. So the mess about mixed data was already there before and didn't get worse through usage of cpzero. You can get into bigger trouble with string conversion, if its not reversible. And most string encoding conversions from a bigger set of characters to a smaller set of characters are not reversable, fully. Even the two cyrillic codepages don't have a mapping of all thier characters to each other, they are not just the same 256 characters in different order, so converting between 1251 to 866 and back or vice versa you can have a loss and characters turn into '?'.

I think your goal can only be to have all data in the Windows codepage 1251. So the task now is to find out what strings are codepage 866 and to convert them to 1251.

You ideally would know which codepage each field is, but that's not stored with the string. You only have byte values, in the end.

In VFP you have STRCONV() - I mentioned that, but also and better for that conversion, CPCONVERT(). So to convert DOS string to Windows that's windowsstring = CPCONVERT(866,1251,dosstring). But you can't tell whether field is the DOS codepage 866. This would have been the simple way if all data was codepage 866, so at the time changing from DOS to Windows, this would have worked for all data. Now it's too late to have a simple conversion like that.

If you do CPCONVERT(866,1251,field) on a field that's already in the Windows codepage 1251, that'll not stop it from making changes, the bytes are interpreted as being from codepage 866 and so a string readable in Windows is becoming unreadable. Even unreadable in codepage 866.

Let me give an example, looking into the Wikipedia pages
Cyrillic Windows Codepage 1251: Cyrillic DOS codepage 866:
Code 0xE6 is ц in DOS and ж in Windows. So in both codepages it is a readable character. You can't decide from that byte being in the field, which codepage it is.

If you have a codepage 866 string with ц in it, CPVONVERT(866,1251,string) will convert this to Windows encoding and it will be seen as 'ц' again in Windows, with the byte code 0xF6. That's fine.

But a string already in codepage 1251 would not stay as it already is, it would be interpreted as codepage 866 and so a valid 'ц' in a Windows string would convert to 'Ў' or a 'ж' of a Windows string would be interpreted as a DOS 'ц' and become a Windows 'ц'. That's your dilemma. In both cases you would also only detecct a problem by reading the text, technically all interpretations are alphabetical.

What is CPCURRENT() for you? Is it 1251? If so then I think VFP will interpret characters of a string as if they are in codepage 1251 and categorize them with ISALPHA(), ISDIGIT() on that assumption. If you then find just a few characters in data that's not in these categories or a punctuation character, you could assume its DOS, but you never can be sure about that. You could analyze data that should mostly be cyrillic text this way:

Code:
Lparameters tcString
tcString = Alltrim(tcString)

Local lnI, lcChar, llDOSString
For lnI=1 To Len(tcSString)
  lcChar = substr(tcString,lnI,1)
  Do Case
     Case Isalpha(lcChar)
     Case IsDigit(lcChar)
     Case lcChar $ [.,!;() ']+chr(9)+chr(13)+chr(10)
     Otherwise
        llDOSString = .T.
        EXIT
  EndCase
EndFor
Return llDOSString

This will judge any string that's not only consisting of letters, digits and some allowed punctuation characters or whitespace (tab, CR, LF, space) as a DOS string. It's likely that, as text won't need other characters. Even if memo contains Foxpro code written in the DOS codepage you could need to covert it, but then you'll also need to allow many further characters besides whitespace and punctuation.

Chriss
 
One more simple thought: You might find a very characteristic letter that's only seen in fields that are still in the DOS codepage. Then you could only look for that specific byte within the character and memo fields. I doubt that'll catch all data needing conversion, but it might solve 80%-90% easily.

You also definitely need to set the codepage of the DBF to 1251 to work in Windows, so one more cpzero.prg with codepage 1251.

And, well, as already said, if the data is shared and used in both a DOS and a Windows application, you won't find a compromise, you can't have data in both codepages in the same field. So either you double each field and store in both codepages, no matter what codepage you set for the DBF file, or you use something like UTF8, but then definitely need to convert to either DOS or Windows codepage or use ActiveX controls able to work on UTF8 strings, but I doubt that's a solution for the DOS part of it, if that's still used. In any case it's much work to get data working for both DOS and Windows and the simplest and cheapest solution to that is to only use either the DOS or the Windows version.

Chriss
 
Chris thanks for the detailed answer and help, I'll try later. Learned something new for myself.
It is on SUCH people that the whole foxpro community rests!
 
One more idea for the function: It could do the categorization of characters on the passed in string and the results of CPCONVERT(tcString, 866, 1251) and of CPCONVERT(tcString, 1251, 866) and choose the best option with the fewest characters causing the OTHERWISE case of the DO CASE statement, for example.

Chriss
 
Thanks for the extra tips. The main work is with files that have 866, so I check for code 0 and 1251 and use the cpzero function. They did this before, but I understand that this is not true, but there were no comments about the encoding either. As a result, for each file I check the code table until the result satisfies me. I will keep your advice in mind, thanks!
 
sashaDB said:
They did this before

I don't know who "they" are and assume you mean they changed DBF codepages.

I wanted to understand what I saw earlier and indeed cpzero - as can be seen by the code I posted - does not change data bytes, but a DBF with a codepage differing from CPCURRENT() makes VFP try to convert, both when reading from such a DBF and when writing into it.

This code shows what's happening. It writes bytes 0-255 into two fields cLow with bytes 0-127 and cHigh with bytes 128-255. It creates a DBF for that matter in the CPCURRENT() OS codepage. Then it sets another codepage with cpzero to something else (I used 866) and reads back cLow and cHigh and displays all changes of byte values it detects. Though we know the data bytes are staying untouched, such changes are detected, which turns out to come from VFP converting when reading. Then the code experiment changes back the DBF codepage to cpcurrent to prove the data bytes actually didn't change.

I also introduce a 2nd record while the DBF is marked as codepage 866, though the OS codepage is a Windows codepage (1252 in my case, maybe 1251 in your case, what matters is that it differs from 866). Then immediately after writing in these bytes and reading them back you get transposed byte values. This 2nd record also stays with the same transcoded values after setting codepage back to CPCURRENT. So you can see when writing into a DBF marked as being a codepage differing from the OS, that makes VFP transpose the string into the DBF codepage.

Further conclusions you can deduce from that is that VFP handles anything in a string variable as being encoded in CPCURRENT(), as reading out a string from another DBF codepage already made the necessary transition (if it was convertible).

Overall, even if someone would be using a DBF that initially was in a Windows Ansi codepage, I assume the FP DOS versions would do the same and transpose the Windows ANSI texxt to DOS Ansi text, which should be sufficient. If the changed CP to DOS codepage then this would just mean reading text from a DBF in a DOS application would read text 1:1 and that's in Windows ANSI encoding, which would look wrong in DOS. Then also writing DOS Ansi strings they don't get transcoded to Windows ANSI and you could have a data mess that wouldn't work in both codepages.

Could be that FP DOS software wasn't aware of codepages, the FREE table format you can use in DOS, too, is not exactly the legacy DBF format of a FP 2.6 version DBF, though. Anyway, as "they" made a codepage change I do assume a DOS application plays a role or maybe a developer using FPDOS. Maybe "they" also didn't just change codepage marking of the DBF file, but also converted data inside, which would be better in case they want to use these DBFs in DOS permanently.

The essence of all this is, even if you get a DBF correctly displaying text with just changing the codepage of the DBF, if you need to change the DBF to a DOS codepage to read data in Windows, then that's causing VFP to always convert from the DBF DOS codepage to Windows codepage when reading in, but also converting Windows strings to DOS strings when writing to such a DBF. I guess that makes a codepage 866 table compatible to be used for both DOS and Windows, as long as you restrict yourself to all convertible letters having a byte code in both Windows and DOS codepages.

I would always regard it as the best situation if DBF codepage and OS codepage match, even though such a transcoding mechanism makes a DBF compatible as far as it can for both Windows and DOS. I think that only works, if the DBF codepage is the DOS codepage based on the assumption legacy FP wouldn't make these transcodings, because why would they have changed codepage, if DOS software works the same way just the other way around. I guess DOS software should be served data in its own DOS codepage and not need transcoding, while Windows VFP can live with DOS data DBFs. You'll just always be limited to characters that exist both in DOS and Windows. Some typiclal characters for drawing table cells in DOS don't exist in Windows and also some others are not convertible.

It now also makes more sense, as that means you can actually share data in Windows and DOS as far as that's possible.

Chriss
 
I forgot to post my experiment code:, if you're interested at all:

Code:
Clear

Local lcLow, lcHigh, lcDbf
lcLow = ''
lcHigh = ''

Cd GetEnv("TEMP")

Create Table currentOScp.dbf free Codepage=Cpcurrent() (cLow C(128), cHigh C(128))
lcDbf = Dbf()
For lnI = 0 to 127
    lcLow = lcLow +Chr(lnI)
    lcHigh = lcHigh + Chr(lnI+128)
EndFor lnI
Insert Into currentOScp values (lcLow, lcHigh)

? 'check cpcurrent data:'
CheckData()
Use
Do (Home()+"Tools\cpzero\cpzero.prg") with lcDBF, 866
Use (lcDBF)
?
? 'check cp866 data:'
CheckData()

* insert 2nd. record, while DBF is marked as cp 866:
Insert Into currentOScp values (lcLow, lcHigh)
? 'check cp866 data, new row:'
CheckData()

Use
Do (Home()+"Tools\cpzero\cpzero.prg") with lcDBF, Cpcurrent()
Use (lcDBF)
?
? 'check cpcurrent data, row1:'
CheckData()
Go 2
? 'check cpcurrent data, row2:'
CheckData()

Procedure CheckData()
* requires current workarea having cLow, cHigh fields c(128)
Local lnI
For lnI = 0 to 127
   If Asc(Substr(cLow,lnI+1,1))<>lnI
      ?? lnI,' becomes', Asc(Substr(cLow,lnI+1,1)), ','
   Endif
   If Asc(Substr(cHigh,lnI+1,1))<> lnI+128 
      ?? lnI+128,' becomes', Asc(Substr(cHigh,lnI+1,1)), ','
   Endif
EndFor lnI


Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top