Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Use GPT-4 for translating text 1

Status
Not open for further replies.

Gerrit Broekhuis

Programmer
Aug 16, 2004
316
NL
Hi,

I use GPT-4 (omni) with VFP and OPAI’s API to translate a textfile from one language to another. So far this seems to work very good, but I read in another thread that translations may have some issues (without mentioning them).

Not knowing what to expect I’m very curious what unknown problems I might run into. So if you have personal experience with this subject, I would like to know what issues you ran into (and hints on how to solve them).

Regards, Gerrit
 
I've experiences that ChatGTP sometimes translates more "freely" than other engines so if you want an exact translation, I suppose you have to be clear about this in the prompt to avoid problems.

You can compare results from a prompt like

"Translate the following French text into English:"

with

"Translate the following French text into English and ensure that the translation remains as close to the original as possible. Maintain the tone, structure, and choice of words."

and see the difference.

Sometimes the simple instruction leads to a better result, sometimes it can slightly change the meaning. The second prompt is too restritive, I guess the best result is somewhere in between.

Regards,
Manni

 
ManniB,

A couple of things here:

(1) Gerrit is talking about using the OpenAI API to write VFP code for language translation "completion". This is not ChatGTP - they are two different animals. For example, with the API you can translate a few thousand words of Spanish > Thai > Korean > Japanese > Hindi > English > and then back to Spanish. ChatGTP cannot do this. So I think it's a good idea to get on the same page for this thread.

(2) The prompt you mentioned should not include "Translate the following French text into English and ensure that the translation remains as close to the original as possible. Maintain the tone, structure, and choice of words." Using the API, this is handle with the temperature and top_p parameters in the JSON request. They are defined as:

Temperature controls the "creativity" of the generated text, between 0 and 2. A higher temperature will result in more diverse and unexpected responses. A very high temperature can result in hallucinations. Lower temperatures will result in more deterministic responses. The default value for temperature is 1.0, but you can experiment with different values to see what works best for your use case.

Top_p - An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. 1 is the default value. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Here is an example of a JSON request for language translation:

Code:
lcDirective = "Translate the following from English to Chinese (Traditional)"
lcLangText = FILETOSTR("SomeTextFile")

TEXT TO lcRequest TEXTMERGE NOSHOW
{"model": "gpt-4o",
 "messages": [{"role": "system", "content": "<<lcDirective>>"},
              {"role": "user", "content": "<<lcLangText>>"}],
 "temperature": 0.4,
 "top_p": 0.8}
ENDTEXT
 
Gerrit,

How are you rendering translated text? You know that UTF-8 is required and that VFP does not natively UTF-8. For example, you can capture a translation to a file and view it in Notepad which does supports UTF-8. See below for a English to Hindi translation:

Capture6_ykmzr3.png


But how can you translate the Hindi file to another language? You cannot use the API to send a file - you have to send a string. And you cannot FILETOSTR the Hindi file, because you get garbage - VFP does not support UTF-8.

The latin-based languages are less problematic (e.g. French, German, Italian), but they do contain special characters which are rendered as question marks unless UTF-8 is supported.
 
Hi vernpace,

thank you for your explanation. I didn't know the API has different functions for translation.

Regarding UTF-8, isn't it possible to use VPF's StrConv() function like

cUtf8Content = FILETOSTR("hindi.txt")
cDbcsContent = STRCONV(cUtf8Content, 11)

?

Regards,
Manni

 
Hi Manni,

I haven’t been able to try this yet, but I found a few possible other solutions online too for creating UTF-8 compatible files.

I guess that when you use filetostr() with a utf-8 compabible file and then strtofile() you should get a (new) file with the original content. That’s probably the easiest way to check if you’re on the right track.

OPENAI can use quite a few languages. How can we termine or know for each language if a conversion stap is required when VFP will use the text content? Does someone have a list or online source with this information?

Regards, Gerrit
 
Manni, Vernpace,

Just tried the above scenario.

If I use strconv(lcString,9) I get the UTF-8 string; using strconv(lcString11) on the UTF-8 sting I get back the original string.
So after double conversions I get a 100% copy (according to
In this user case I don't have the need to show the translated (or original) text in a VFP form. Perhaps that makes things easier for me.

Regards, Gerrit
 
Another thing very cool with OpenAI language translation is that it can be used to correct grammar and spelling - just translate a language to itself:

"Translate the following text from English to English"

This can be done with any language.
 
you cannot FILETOSTR the Hindi file, because you get garbage - VFP does not support UTF-8

That's a common misconception about what constitutes the problem and what not. While VFP itself wil always interpret the bytes you load from a file as ANSI - in its currently configured codepage, the bytes are the exact bytes of a UTF8 string and you can forward them in a request URL or body without the need to convert them with STRCONV.

So VFP is only incapable to display UTF-8, STRCONV can be used to convert as much of a UTF8 string into ANSI characters as much characters are also available within the limited 256 character codepage, but that does not mean you can't forward UTF8 to an API endpoint, the FILETOSTR() function does not convert from the file content to ANSI, it will read bytes 1:1 and not convert anything and you also only need to convert yourself with STRCONV, if you want to get the ANSI representation as good as ANSI can represent the given UTF8.

What can hinder the 1:1 transport of bytes is the usage of COM objects and the automatic conversions involved in using COM, but how that is handled is under your control, too, and in usual cases you don't need to care for that. One commonly known way to hinder any autoconversion when traversing from VFP to a COM object is to use binstring=CREATEBINARY(vfpstring). The resulting binstring variable has the same bytes, but when passing it as a parameter to a COM object or setting a COM object property to it, automatic conversions from ANSI to the encoding asssociated with COM are not done.

Also see
CREATEBINARY() is not necessary at all times, but can't harm, as it does nothing to the actual bytes, the display just changes to a hex string, but only within VFP, the length of a binary string is the same as the original and the byte composition, too.

So the only problem you have in VFP is entering a UTF8 string to forwarding it to an API. If you have a UTF8 string, that's good to go, even though VFGP won't display it correctly.

So the major problem VFP has is displaying other encodings, not reading, storing or forwarding them.

Chriss
 
Hi Chris,

Yes, that's how it works. Thanks for your additional euro. And that is what I tested too in the scenario mentioned earlier.
I don't use COM and I dont't think I need CREATEBINARY() for my application's purpose.

Regards, Gerrit
 
You are using COM, when you use any of the usual objects VFP needs to make http requests. COM is not just meaning having OLEPUBLIc classes in your VFP code, that's creating COM servers, but consumng, i.e. using COM objects is also adding COM and a barrier where autoimatic reencoding can happen, see SYS(987) and SYS(3101), for example.

To mention another misconception: ANSI is not part of UTF8, only the lower 128 characters (bytes 0 to 127) are usually in common with ASCII and then also in common with the first 127 codepoints of UTF8 which are encoded in single byte characters. That also means many characters of the ANSI codepage you use, ivolving things like accents, too, may need STRCONV conversion to UTF8 before sending them out. You can - and should - make that transcoding from ANSI to UTF8 the last step before sending a request string or body.

Chriss
 
Gerrit,

Here are a few tips which may help:

(1) It is highly recommended that you use the temperature and top_p parameters in the JSON request as show in the above post to Manni. Otherwise, both will default to 1 which can result in bizzare or incomplete translations.

It's also important to note that the accuracy of a translation depends heavily on the amount of training data OpenAI has on a given language. The more that a language is spoken worldwide, the more training data.

(2) You should make a COM code page translation before you send the HTTP request:

Code:
CASE toParm.Type = 10
     lcOpenAIKey = toParm.OpenAIKey
     lcAIRequest = toParm.AIRequest

     liCodePage = SYS(3101)
     SYS(3101,65001)

     lcReturn = This.GetOpenAIResponseText(lcOpenAIkey, lcAIRequest)

     IF LEFT(lcReturn, 6) == "Error:"
        lvReturn = lcReturn
     ELSE
        loOpenAI = _Screen.Json.Parse(lcReturn)
        lvReturn = loOpenAI.choices[1].message.content
     ENDIF

     SYS(3101,liCodePage)

(3) You should also set timeouts on the HTTP request: the fourth one (receiveTimeout) needs to be bumped up because a translation could take a long time on the server and may timeout which will result in a partial translation.

Code:
PROTECTED PROCEDURE GetOpenAIResponseText(tcApiKey AS String, tcRequest AS String) AS String
   LOCAL lcURL, lcResponse, loHTTP

   lcURL  = "[URL unfurl="true"]https://api.openai.com/v1/chat/completions"[/URL]

   loHTTP = CREATEOBJECT("MSXML2.ServerXMLHTTP.6.0")
   loHTTP.setTimeouts(30000, 60000, 30000, 600000)
   loHTTP.Open("POST", lcURL, .F.)
   loHTTP.setRequestHeader("Content-Type", "application/json")
   loHTTP.setRequestHeader("Authorization", "Bearer " + tcApiKey)
   loHTTP.Send(tcRequest)

   IF loHTTP.Status = 200
      lcResponse = loHTTP.responseText
   ELSE
      lcResponse = "Error: " + TRANSFORM(loHTTP.Status)
   ENDIF

   RETURN lcResponse

ENDPROC

(4) Finally, you should consider using an EXE COM Server (if you have not already) for the above tips.
 
Vernspace,

why (4)?

Makes me wonder: Why didn't you incorporated your tip (2) into your GetOPenAIResponseText proedure?
This procedure is very specific to the one OpenAI endpoint you have in your code that needs UTF-8 input, so this is well known within this procedure and therefore should be done within it, shouldn't it? Thikn about the OOP encapsulation principle.

I can give a reason why not: You can only use this SYS(3101,65001) COM transcoding of strings when you know the source strings are ANSI, transcoding UTF8 into UTF8 would cause wrong strings, for example.It's not, you migght think, neutral. Because SYS(3101,targetcodepage) will always assume the source bytes are ANSI bytes in codepage3 CPCURRENT(), so a UTF8 string double or triple byte character would be interpreted as 2 or more ANSI characters.

So, what you need is knowing the encoding of tcRequest. At least whether tcRequest is already UTF8 or not. You could define it this way, with an optional tlRequestisUTF8 parameter:
Code:
PROTECTED PROCEDURE GetOpenAIResponseText(tcApiKey AS String, tcRequest AS String, tlRequestisUTF8 AS Logical) AS String
If .t. is passed in you'd take it for granted tcRequest is UTF8, if not you could cause the transcoding with SYS(3101,65001).

You could make it one step more precise by instead using a tiRequestCodepage parameter to specify the codepage of tcRequest.
Code:
PROTECTED PROCEDURE GetOpenAIResponseText(tcApiKey AS String, tcRequest AS String, tiRequestCodepage As Integer) AS String
Sys(3101) will always assume CPCURRENT() as encoding of the string and if tiRequestCodepage differs from that SYS(3101) wont' help before first encoding to CPCURRENT, On the other hand if tiRequestCodepage is not already 65001 it very likely is CPCURRENT(), so I'd perhaps go with the tlRequestisUTF8 variant.

Chriss
 
Hi,

I will have to read more about the codepage settings probably (and do some testing as well). For my initial testing I do get proper results with English and Dutch languages, so the API is working.

I do indeed have a COM EXE. I just added the code for translations. It's working fine and I apologize for any confusion I may have created. And yes, I do have (large) timeouts.

I get an unwanted text in the AI's response: "Certainly! Here is the translated text:". What I'm asking is this: "Translate the following textfile from Dutch into English.". How do I tweak the question to get rid of the added text that was not in the original text. The rest of the respons is quite OK.

Regards, Gerrit
 
English and Dutch mainly use the 26 latin letters that have the exact same codepoiunts (and bytes) as UTF-8, so that's what's saving you, in general, though, that's not the case, i.e. á é í ó ú à è ë ï ö ü ij are single byte characters in codepage 1252 but will require 2 bytes in UTF-8.

It would mainly render the text you want to translate wrong, but AI might fix that flaw by having seen wrong codepage letters in the data it was trained with. You're not the first and will not be the last one not thinking about codepages when programming. So ChatGPT-4 might make that adjustment when it interprets the original text with erratic characters.

About your last requirement: Well, GTP is not just a translation service but a chatbot, so you get a bit of smalltalk and politeness. You've already been giving advice from vernspace to use parameters like temperature and top_p instead of giving the prompt "Translate the following French text into English and ensure that the translation remains as close to the original as possible. Maintain the tone, structure, and choice of words.", well, I don't know if there are other parameters that would remove the chattyness of ChatGPT or if there is a better AI model focussed exactly on text translation only without bells and whistles. I don't think so. But if there are no special parameters about that, a prompt including the instruction to respond only with the translated text and nothing else might work. If you watched any two or three videos about chatGTP you know that it would respond to such request with "Yes, I will do that" and thereby already goes back to being chatty, it's developed that way.

Chriss
 
Chris,

Why (4?): Have you ever seen the nasty "Visual Foxpro not responding" titlebar message when long processes are running? We use this COM EXE for many many things. Did you notice the CASE statement? We have other CASEs where GetOpenAIResponseText is called (e.g. OpenAI Text Generation) where COM code page translation is not required.

 
Gerrit,

You can get away with not using COM code page translation with some latin based languages (e.g. Dutch), but what Chris was saying is correct:

It would mainly render the text you want to translate wrong, but AI might fix that flaw by having seen wrong codepage letters in the data it was trained with. You're not the first and will not be the last one not thinking about codepages when programming. So ChatGPT-4 might make that adjustment when it interprets the original text with erratic characters. This is exactly what gpt-4o does. I know this because we came across this in testing.

For laughs and giggles, try translating a non-latin base language and see what happens.

The temperture and top_p parameters are essential - it's no big deal to implement.

 
vernspace said:
Have you ever seen the nasty "Visual Foxpro not responding" titlebar message

I know this, but what do you really suggest to avoid that? Just compiling a procedure as a separate EXE doesn't remove the wait state.

I would use async requests so the response can be processed later, that's not requirig your VFP code to be compiled as an olepublic class, nor does usage of VFP code as a COM server itself prevent this message.

So what exactly are you suggesting? Compiling as an EXE is quite a normal thing to do with a VFP application, that alone doesn't make it a COM server nor does this alone hinder the problems with responsiveness, which they actually are.

Chriss
 
I gave an explainer for how async requests work one or the other way in thread184-1827183, by the way. >ou should also know thread184-1820019 as you responded in that thread.

Chriss
 
Chris,

Not buying it. This works well for us, so that's it. Period.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top