GPT-4 Turbo with Vision for OCR

Gerrit Broekhuis · Apr 25, 2024

Hi,

OPENAI’s “GPT-4 Turbo with Vision” should be able to perform OCR.

With help from VFP forum members here and elsewhere I’ve been able to create textproposals and use speechrecognition in our applications.

For OCR we’re now using Tesseract, but I think GPT-4 may be a good alternative.

Has anyone tried this already? Any hints or tips?

Regards, Gerrit

Gerrit Broekhuis · May 14, 2024

Hi Vernpace,

Very nice indeed.

I see online a lot of people trying to tweak this model to get all text from an image (true OCR).
I got a much better result for an image with lots of rows and columns using this code:

Code:

"type": "text",
"text": "Get all plain text from this image, allow repetitions, use all columns and all rows."

I still think it's strange having to tweak like this. Getting all plain text is not the problem for OPENAI, they just have a too strict policy on this feature for whatever reason.

Another annoying thing is that most OCR applications prefer TIFF files, where GPT-4 Vision uses another file format (PNG, GIF, JPG and WEBP). Of course we can convert all these files (I use ImageMagick with VFP), but it's still causing extra work.

Regards, Gerrit

vernpace · May 14, 2024

Gerrit,

Yes, the API's interpretation of the text prompt is tricky - must be precise with Vision.

Here is some good news: Yesterday, OpenAI released a new model: GPT-4o (“o” for “omni”). You can use it now - just replace "gpt-4-turbo" with "gpt-4o"

See here:

https://openai.com/index/hello-gpt-4o

See the YouTube video presentation here:

https://youtu.be/DQacCB9tDaw

This is truly amazing stuff. While the presentation is focused on ChatGPT, "gpt-4o" is already working for the APIs that use "gpt-4-turbo". I tested it yesterday with language translations and it was two to three times faster.

Gerrit Broekhuis · May 14, 2024

Hi Vernpace,

I get similar results with this new GPT-4o model, only faster.

GPT-4 (and GPT-4o) are not made for OCR, but to a certain degree it still seems possible to do this. When asking GPT it responds that the AI portion should be used to process the text from other OCR packages (like Tesseract and Adobe). I already use Tessaract in my VFP applications and I intend to add some level of AI support.

Regards, Gerrit

Scott24x7 · May 27, 2024

So with the new ChatGPT 4o I thought I'd ask ChatGPT how it would achieve this and this was the response:

Integrating ChatGPT-4 into Visual FoxPro (VFP) 9 to pass images for information identification involves several steps. Here's a high-level overview of the process:

1. API Access
First, you need access to the OpenAI API, which provides ChatGPT-4 capabilities.

2. Image Processing API
ChatGPT-4 primarily handles text. For image processing, you might need to use additional services such as OpenAI's DALL-E for image generation or other image recognition APIs like Google Vision API, Microsoft Azure Cognitive Services, or Amazon Rekognition.

3. Setting Up API Calls in VFP
Visual FoxPro can make HTTP requests to external APIs. You will need to use these capabilities to interact with the necessary APIs.

4. Handling JSON
APIs typically return data in JSON format. VFP needs to parse this JSON to handle the responses.

Steps to Implement
Step 1: Obtain API Keys
Get API keys from OpenAI and your chosen image processing service.

Step 2: Make HTTP Requests from VFP
Use VFP's XMLHTTP or WinHttp.WinHttpRequest to make HTTP requests. Here’s a basic example of making a GET request:

Code:

LOCAL loHttp, lcUrl, lcResponse
loHttp = CREATEOBJECT("MSXML2.ServerXMLHTTP.6.0")
lcUrl = "[URL unfurl="true"]https://api.example.com/endpoint"[/URL]
loHttp.Open("GET", lcUrl, .F.)
loHttp.setRequestHeader("Content-Type", "application/json")
loHttp.setRequestHeader("Authorization", "Bearer YOUR_API_KEY")
loHttp.Send()
lcResponse = loHttp.responseText
? lcResponse

Step 3: Uploading Images
For uploading images, you might use a POST request. Here's a basic example of a POST request:

Code:

LOCAL loHttp, lcUrl, lcImagePath, lcResponse, lcImageBase64
lcImagePath = "C:\path\to\your\image.jpg"
lcImageBase64 = FILETOSTR(lcImagePath) && You might need to encode this to base64 if required by the API

loHttp = CREATEOBJECT("MSXML2.ServerXMLHTTP.6.0")
lcUrl = "[URL unfurl="true"]https://api.example.com/upload"[/URL]
loHttp.Open("POST", lcUrl, .F.)
loHttp.setRequestHeader("Content-Type", "application/json")
loHttp.setRequestHeader("Authorization", "Bearer YOUR_API_KEY")

LOCAL lcRequestBody
lcRequestBody = '{"image": "' + lcImageBase64 + '"}'
loHttp.Send(lcRequestBody)
lcResponse = loHttp.responseText
? lcResponse

Step 4: Parsing JSON Response
VFP doesn’t have native JSON parsing capabilities, but you can use a third-party library like VFP-JSON or write your own parser.

Code:

* Assuming you have a JSON parser
LOCAL loJson, loResponse
loJson = CREATEOBJECT("YourJsonParser")
loResponse = loJson.Parse(lcResponse)

Step 5: Integrating with ChatGPT
You can send text obtained from the image processing API to ChatGPT for further analysis.

Code:

LOCAL loHttp, lcUrl, lcResponse
loHttp = CREATEOBJECT("MSXML2.ServerXMLHTTP.6.0")
lcUrl = "[URL unfurl="true"]https://api.openai.com/v1/engines/davinci-codex/completions"[/URL]
loHttp.Open("POST", lcUrl, .F.)
loHttp.setRequestHeader("Content-Type", "application/json")
loHttp.setRequestHeader("Authorization", "Bearer YOUR_OPENAI_API_KEY")

LOCAL lcRequestBody
lcRequestBody = '{"prompt": "Your text here", "max_tokens": 100}'
loHttp.Send(lcRequestBody)
lcResponse = loHttp.responseText
? lcResponse

Example Workflow
Upload Image: Use VFP to send an image to the image processing API.
Get Text Information: Parse the response to extract text or other information from the image.
Send to ChatGPT: Use the extracted text to formulate a query to ChatGPT.
Process Response: Parse and handle the response from ChatGPT in your VFP application.
Considerations
Error Handling: Implement robust error handling for API calls.
Security: Securely store and handle API keys.
Performance: Consider the latency of API calls and handle them asynchronously if needed.
This approach should give you a foundational framework to integrate ChatGPT-4 and image processing capabilities into your VFP 9 application.

I haven't tried it yet, but thought this might be an interesting point to start from.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Gerrit Broekhuis · May 27, 2024

Hi Scott,

Gpt4-o’s OCR using the API is definitely working! In my document archiving software I have two options now for OCR: (1) desktop OCR with Tesseract and (2) using Gpt-4o.

Most of my source material is pdf. So first I determine the number of pages. If there are multiple pages I convert all pages to either TIFF (for Tesseract OCR) or PNG (Gpt’ OCR). The conversion itself is done in the background with ImageMagick. In a form I can see all images and select the pages I want for OCR. I run OCR for a single page each time, so for multiple pages I use a loop. All tasks and results are looked up and stored in a central MySQL database. Once the OCR is ready I show a form to add all text files to the archive.

My JSON message to GPT is quit different from your code. I use MSXML2.ServerXMLHTTP as well. However the actual communication with MSXML2.ServerXMLHTTP is done in a separate application that runs several AI-services. This application is allways in my systray. Right now I have 3 services (text gerenation, speech recognition and OCR); the next one will follow soon (translation).

The call for OCR is quite difficult for GPT. It may only give you part of the text. It’s more an image descriptor actually. But if you add tot the task that it should allow repetitions and some other things perhaps, you should be able to get quite good OCR results, even for handwritings and uncommon languages.

If you need anything in particular, just let me know.

Goot luck!

Regards, Gerrit

vernpace · May 27, 2024

Hi Gerrit,

When you are ready to tackle language translation, you should start another thread.

As of this writing, we have successfully implemented six OpenAI services for a client: (1) DALL-E-3 Image generation; (2) Text generation; (3) Speech-To-Text; (4) Text-To-Speech; (5) Vision; and (6) Language Translation. Language translation was by far the most difficult - for reasons that are not obvious. We can now use the OpenAI API to translate 139 languages:

We recently entered into an NDA for these services, so we can no longer share source code. However, we can offer tips which will be helpfull.

vernpace · May 27, 2024

Gerrit,

With respect to Vision, we learned in the OpenAI forums that it is best to resize a base64 image (or images) before it is sent. See this:

We can provide the code for this - You may want to modify if you don't need tcDetail, tcDimension, or tiBytes info

Code:

*!*    Resizes a image. Assumes "System.App" is already loaded.
FUNCTION ResizeImageGPT4Vision(tcFileSource AS String, tcFileDestination AS String, tcDetail AS String, tcDimension AS String, tiBytes AS Integer) AS VOID
   LOCAL lcFileExt, lcDetail, loImage, liMaxSize, liWidth, liHeight, liWidthNew, liHeightNew, liDetailThreshold
   LOCAL loSrcImage AS xfcBitmap 
   LOCAL loResized AS xfcBitmap 
   LOCAL loGfx AS xfcGraphics 

   liMaxSize = 1024
   liDetailThreshold = 700
   lcFileExt = UPPER(JUSTEXT(tcFileDestination))

   WITH _Screen.System.Drawing 
        loImage  = .Bitmap.FromFile(tcFileSource, .T.)
        liWidth  =  loImage.Size.Width
        liHeight =  loImage.Size.Height

        loImage.Dispose()

        IF liWidth > liMaxSize OR liHeight > liMaxSize
           IF liWidth > liHeight
              liWidthNew  = liMaxSize
              liHeightNew = INT(liHeight * (liMaxSize / liWidth))
           ELSE
              liHeightNew = liMaxSize
              liWidthNew  = INT(liWidth * (liMaxSize / liHeight))
           ENDIF
        ELSE
           liWidthNew  = liWidth
           liHeightNew = liHeight
        ENDIF

        IF liWidthNew > liDetailThreshold OR liHeightNew > liDetailThreshold
           lcDetail = "high"
        ELSE
           lcDetail = "low"
        ENDIF

*!*     Load the original Image 
        loSrcImage = .Bitmap.New(tcFileSource) 
*!*     Create a New Image with the desired size 
        loResized = .Bitmap.New(liWidthNew, liHeightNew, .Imaging.PixelFormat.Format32bppARGB)
*!*     Set the image resolution to be the same as the original 
        loResized.SetResolution(loSrcImage.HorizontalResolution, loSrcImage.VerticalResolution) 
*!*     Obtain a Graphics object to get the rights to draw on it 
        loGfx = .Graphics.FromImage(loResized) 
*!*     Set some properties to ensure to have a better quality of image 
        loGfx.CompositingQuality = .Drawing2D.CompositingQuality.HighQuality
        loGfx.InterpolationMode = .Drawing2D.InterpolationMode.HighQualityBicubic 
        loGfx.SmoothingMode = .Drawing2D.SmoothingMode.AntiAlias
*!*     Draw the source image on the new image at the desired dimensions 
        loGfx.DrawImage(loSrcImage, 0, 0, liWidthNew, liHeightNew) 
*!*     Save the resized image 
        DO CASE
           CASE lcFileExt == "GIF"
                loResized.Save(tcFileDestination, .Imaging.ImageFormat.Gif)

           CASE INLIST(lcFileExt, "JPG", "JPEG")
                loResized.Save(tcFileDestination, .Imaging.ImageFormat.Jpeg)

           CASE lcFileExt == "PNG"
                loResized.Save(tcFileDestination, .Imaging.ImageFormat.Png)
        
        ENDCASE

        loSrcImage.Dispose()
        loResized.Dispose()
        loGfx.Dispose()

   ENDWITH

   tcDetail = lcDetail
   tcDimension = TRANSFORM(liWidthNew) + " x " + TRANSFORM(liHeightNew)
   tiBytes = GetFileSizeX(tcFileDestination)

ENDFUNC

This is very usefull when sending multiple images. Each of the six images below originally were 1792 x 1024 and between 3.5 - 4.5 MB

Gerrit Broekhuis · May 28, 2024

Hi Vernpace,

I use Base64 coding for OCR.
I will create a new thread for tranlations via the API.

Regards, Gerrit

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

GPT-4 Turbo with Vision for OCR

Gerrit Broekhuis

Programmer

Gerrit Broekhuis

Programmer

vernpace

Programmer

Gerrit Broekhuis

Programmer

Scott24x7

Programmer

Gerrit Broekhuis

Programmer

vernpace

Programmer

vernpace

Programmer

Gerrit Broekhuis

Programmer

Similar threads

Part and Inventory Search

Sponsor