Use GPT-4 for translating text 1

Gerrit Broekhuis · May 28, 2024

Hi,

I use GPT-4 (omni) with VFP and OPAI’s API to translate a textfile from one language to another. So far this seems to work very good, but I read in another thread that translations may have some issues (without mentioning them).

Not knowing what to expect I’m very curious what unknown problems I might run into. So if you have personal experience with this subject, I would like to know what issues you ran into (and hints on how to solve them).

Regards, Gerrit

Chris Miller · May 31, 2024

I assume you do something thaqt works, that's not the problem. But you're not giving a concrete solution - on the topic of asynchropnous execution.

Chriss

vernpace · May 31, 2024

Let me be clear: We do not want asynchropnous processing for this - Do you understand?

I take offense by you saying we are not providing a concrete solution. Who do you think you are?

Chris Miller · May 31, 2024

vernspace,

I do agree with your points (1)-(3).

vernspace said:
(4) Finally, you should consider using an EXE COM Server (if you have not already) for the above tips.

I questioned this and asked how concretely you mean this, but you didn't answer that by now, despite saying this works for you.

Well, what concretely works? Running a secondary EXE is not answering that. You'd still want to provide requests and get responses, which in itself requires more than just telling "put this into a separate EXE".

I gave that advice, I was still interested in how cocnretely you do this, in the interest of Gerrit and any future reader. But if you don't want to answer that, okay. I already have my own solutions.

Chriss

Gerrit Broekhuis · May 31, 2024

Guys,

Well, I'm still working on my "own" solution. I can only say I very much appreciate your help and I'm sure others will benefit from this too, creating their solutions.

I'm still wondering about the UTF-8 conversion. I tried conversions from Hindi and Chinese to Dutch. The source text was read from a txt file.
With UF8-8 conversion I do not get a good translation: "It appears that the text you provided is garbled and contains encoding issues, making it difficult to accurately translate. The text seems to be in a corrupted format, which often happens when text is not properly encoded or decoded.".

Without the UTF-8 conversion the translation is working.

So what would be the right path and working order to convert the txt source file into a proper JSON message?

TEXT --> JSON ?
TEXT --> CODEPAGE --> JSON ?
TEXT --> CODEPAGE --> UTF-8 --> JSON ?
TEXT --> ???

And will the steps be the same for every language in OPENAI's language list?

Regards, Gerrit

vernpace · May 31, 2024

Chris,

Congratulations!! You have just drove me out of this forum for good with you blah, blah, blah.

I blame myself for having come here in the first place.

Chris Miller · Jun 1, 2024

vernspace,

I don't really know what makes you so agressive, what can I do to calm your anger?

Gerrit,

tried conversions from Hindi and Chinese to Dutch. The source text was read from a txt file....Without the UTF-8 conversion the translation is working.

Well, then this was already UTF-8 and as I said, converting UTF-8 to UTF-8 interprets each single source byte as a character, which is garbling things. So you can only convert ANSI to UTF-8, not UTF-8 to UTF-8,

If a part of your query, the text you want to translate, already is in UTF8, there's maybe nothing else in the whole JSON string you need to convert to UTF8, as it's all English latin letters and some characters like punctuation, which have the exact same encoding in ANSI/ASCII and UTF8, because the fisrt 128 characters in UTF8 are also just single byte characters, mopst of them identical to Asni codepage 1252.

In general, you can always take anything coming from your source code, i.e. strings of code or user input as ANSI encoded strings, anything that comes from a file will depend, files do not at all need to provide the information about what encoding they are in, a Hindi or Chinese text could be UTF-8 or other Unicode encodings (UCS16, for example), in case they are UTF-8 you would just need to embed them as you already do, in other cases you'd need to apply conversions.

Chriss

Gerrit Broekhuis · Jun 1, 2024

Hi Chris,

I think Olaf's function ValidateUtf8() from

https://www.tek-tips.com/faqs.cfm?&rat1=10&fid=7900

could be used here. When testing it prevented double encoding to UTF8.

In my testing I used

Code:

LOCAL lcUTF8String 
lcUTF8String = IIf(ValidateUtf8(lcString)=0,lcString,StrConv(lcString,9))

lcString is the variabele filled with FILETOSTR().

It's not a cure for all, but I think I will add it to my library.

Regard, Gerrit

Gerrit Broekhuis · Jun 1, 2024

Another option was presented by António on

https://www.tek-tips.com/viewthread.cfm?qid=1774266.

Code:

FUNCTION ValidateUTF8x (UTF8 AS String) AS Boolean
	RETURN STRCONV(STRCONV(m.UTF8,12),10) == m.UTF8
ENDFUNC

Regard, Gerrit

Chris Miller · Jun 2, 2024

Okay, that's converting from UTF8 to Unicode and back to UTF8, which would only work out fine, if the original string is UTF8 and doesn't contain a bad by<te combination, whcih an ANSI text could have with a character that's not part of the 128 common single byte characters, mainly.

What you don't get from this when it's .F. is at what position the string doesn't comply with the UTF8 encoding scchema, i.e. what is wrong with the string.

But in the thread you point out, atlopes posted a better validation routine that gets to the bad position. In its core it's still using the SSTRCONV() function, too:

Code:

LOCAL BadUTF8 AS String
LOCAL CorrectedUTF8 AS String

LOCAL GoodSegment AS String
LOCAL SoFarSoGood AS String
LOCAL ErrorLocation AS Integer

LOCAL ARRAY GoodSegments[1]

LOCAL BadMark AS String

m.BadMark = CAST(0hefbfbd AS Char(3))

m.BadUTF8 = FILETOSTR(GETFILE())
m.CorrectedUTF8 = STRCONV(STRCONV(m.BadUTF8, 12), 10)

ALINES(m.GoodSegments, m.CorrectedUTF8, 2, m.BadMark)

m.SoFarSoGood = ""

FOR EACH m.GoodSegment IN m.GoodSegments

	m.SoFarSoGood = m.SoFarSoGood + m.GoodSegment + m.BadMark
	
	IF !LEFT(m.BadUTF8, LEN(m.SoFarSoGood)) == m.SoFarSoGood
		m.ErrorLocation = LEN(m.SoFarSoGood) - LEN(m.BadMark) + 1
		? m.ErrorLocation
		? SUBSTR(m.BadUTF8, m.ErrorLocation, 6), CAST(SUBSTR(m.BadUTF8, m.ErrorLocation, 6) AS W)
		RETURN
	ENDIF
ENDFOR

? -1

Chriss

Gerrit Broekhuis · Jun 2, 2024

Hi Chris,

Yes, I saw that too. On the other hand, we’re automating translation. So if there is an error in the original posting I accept an error when translating. Perhaps I cannot read the original file, so I couldn’t correct the initial error anyway.

I prefer to try to automate the translation. If there is an error, I don’t want a translation with errors.
But that’s just my use case, I cannot decide for others.

Regards, Gerrit

Chris Miller · Jun 2, 2024

Okay, but if an error is not in the section of the text to translate, but in another portion of the request, yoiu could still record the error and continue with execution anyway.

Chriss

Gerrit Broekhuis · Jun 27, 2024

Hi Kristy,

Yes, it’s intriguing! I have added this to our business software suite and users can now choose between Desktop OCR (using Tesseract) and AI OCR with Gpt4-o.

I can confirm it’s working, but of course our Desktop OCR is working too. It’s sometimes hard to notice the difference between the two. With non-western texts AI OCR is the winner.

Regards, Gerrit

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Use GPT-4 for translating text 1

Gerrit Broekhuis

Programmer

Chris Miller

Programmer

vernpace

Programmer

Chris Miller

Programmer

Gerrit Broekhuis

Programmer

vernpace

Programmer

Chris Miller

Programmer

Gerrit Broekhuis

Programmer

Gerrit Broekhuis

Programmer

Chris Miller

Programmer

Gerrit Broekhuis

Programmer

Chris Miller

Programmer

Gerrit Broekhuis

Programmer

Similar threads

Part and Inventory Search

Sponsor