GPT-4 Turbo with Vision for OCR

Gerrit Broekhuis · Apr 25, 2024

Hi,

OPENAI’s “GPT-4 Turbo with Vision” should be able to perform OCR.

With help from VFP forum members here and elsewhere I’ve been able to create textproposals and use speechrecognition in our applications.

For OCR we’re now using Tesseract, but I think GPT-4 may be a good alternative.

Has anyone tried this already? Any hints or tips?

Regards, Gerrit

Scott24x7 · Apr 26, 2024

Not only does it perform OCR, it does so more effectively than any product or process I've ever encountered. It is also extremely capable of then translating that. I ran a letter from 1798 through it (hand written, extremely difficult to read), and it OCRd it around 95% accurately. If it is very clear, then even easier.

Yes, there are some major applicable uses for this and if it can be linked into VFP, it will be a mind bogglingly useful capability.
I'll start looking at this as well.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Mike Lewis · Apr 27, 2024

I ran a letter from 1798 through it (hand written, extremely difficult to read), and it OCRd it around 95% accurately.

That is very impressive.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

TamarGranor · Apr 29, 2024

There was a great article in the April issue of Scientific American about one of good uses for AI that I've heard of. (I'm highly skeptical of LLMs and other things in that vein, particularly for ethical reasons.)

Here's a link, though I suspect it'll only be accessible to subscribers:

https://www.scientificamerican.com/...-that-decoded-an-ancient-scroll-and-changed/.

The basic idea is that they used AI in a couple of ways to read ancient scrolls that were burned so badly, they couldn't be unrolled.

Tamar

Scott24x7 · May 1, 2024

So I sent this business card that I picked up (and is one of the craziest layouts for a card example I have), this was the response that ChatGPT came back with:

Company Logo: IX Terenine
Name: Aaron Shelley
Title: Senior Solutions Architect
Email: aaron.shelley@terenine.com
Office Phone: 866-379-3581
Direct Phone: 423-702-7038
Mobile Phone: 423-664-7383
Fax: 423-643-8031

The prompt I gave it was: identify the text on in this image, as if it were a business card:

It's 100% spot on.
Here's another example with a difficult card where I've also asked it to translate the card into English (without giving it a base langage):

The response was:
Logo: K V H
Name: 榊原明秀 (Sakihara Akihide)
Position: データセンターサービス部シニアスペシャリスト (Data Center Services Department, Senior Specialist)
Company: KVH株式会社 (KVH Co., Ltd.)
Address: 東京都港区南青山1-1-1 新青山ビル西館 T107-0062 (1-1-1 Minami Aoyama, Minato-ku, Tokyo, New Aoyama Building West Wing, T107-0062)
Telephone: TEL 03-4560-8842
Fax: FAX 03-5772-5940
Email: akihide.saito@kvh.co.jp
Website:

http://www.kvh.co.jp

(And these are very old cards, these contacts no longer exist, so I'm not "spamming" someone.

There is certainly merit in exploring connecting ChatGPT to interface with VFP.
I haven't had time to explore this in depth yet, due to other commitments, but I'll certainly be circling back on it soon.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Mike Lewis · May 1, 2024

Tamar said:
Here's a link, though I suspect it'll only be accessible to subscribers:

It's OK. I managed to read it OK even though I am not a subscriber.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads

Gerrit Broekhuis · May 6, 2024

Hi,

Thanks everyone for your input. I haven't started this project yet, as I was in the US last week to visit New York.
Hopefully later this week I will start exploring my options to get this type of OCR working.

Does anyone know if GPT3.5 models will still be working and be accessible when I upgrade from GPT3.5 to GPT4? I guess I will need to create a new API code if it does?

Regards, Gerrit

Scott24x7 · May 9, 2024

Hi Gerrit,
Yes, you can use either 3.5 or 4.0 if you have the 4.0 upgrade.
I use both.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Gerrit Broekhuis · May 9, 2024

Hi Scott,

I’m still using my 3.5 prepaid subscription and yesterday I made my first succesfull trial with gpt-4 vision.
So there doesn’t seem to be a reason to upgrade to the more expensive gpt-4 platform.

Regards, Gerrit

Scott24x7 · May 9, 2024

Gerrit,
The reason will depend on how many prompts you want to send in a day/hour.
You will hit a limit really fast (I think it's 25 in 3 hours) on the trial.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Gerrit Broekhuis · May 9, 2024

Hi Scott,

Good to know that. I was advised by the community @ openai not to upgrade to GPT-4. As I may have said before I have a prepaid subscription for GPT-3.5 (not the “free” trial).

I may upgrade once I run into the limit you describe.

Regards, Gerrit

Gerrit Broekhuis · May 10, 2024

vernpace · May 10, 2024

I'm having a very hard time understanding what is being accomplished in this thread - I thought it was supposed to be about GPT-4 Turbo with Vision, not ChatGPT.

First of all, it appears that GPT-4 Turbo with Vision does not support OCR. See here:

https://community.openai.com/t/gpt4-ocr-image-recognition/439478

For more informatiom on GPT-4 Turbo with Vision:

For API documentation:

https://platform.openai.com/docs/guides/vision

For the community forum:

https://community.openai.com/tags/c/api/7/gpt-4-vision

Scott24x7 · May 11, 2024

Vernpace,
I see this as an overall discussion, understanding the issue before we "dive into it". I am appreciative of what is presented and discussed here, though it might seem a little unorthodox compared to the way threads within this forum usually develop.
This is new to many of us. This is an evolution.

Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.

Gerrit Broekhuis · May 11, 2024

Hi Vernpace,

I'm still working on this, but I already get good results using model "gpt-4-turbo-2024-04-09" for OCR. I have to compare this to the results I get with my VFP application's integrated OCR, but AI looks very promising. I'm sure there will be newer and even better models to come.

If you want you can join the party and get this working!

Regards, Gerrit

vernpace · May 11, 2024

Gerrit,

Since you are getting good results for an OCR, could you please provide the VFP code for the JSON request? Since it's an OCR, I'm assuming you are using a local base64 image and not a URL, right?

Gerrit Broekhuis · May 11, 2024

Hi Vernpace,

This works for an online image. Local images need base64 encoding.

{
"model": "gpt-4-turbo",
"max_tokens" : 2000,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What text is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "

https://www.something.org/imageforOCR.jpg"

}
}
]
}
]
}

NOTE: the url above is not working, try this with your own url for testing. Unfortunately the indents were lost when copying the JSON.

I know this is just the beginning but we have to keep learning, don’t we?

Regards, Gerrit

Gerrit Broekhuis · May 13, 2024

This is how I added support for a local image (only PNG, GIF, JPG and WEBP are allowed).

Code:

LOCAL lcImage, lcBase64Image
lcImage = "c:\demo\ImageToTextOCR\demo.png"
lcBase64Image = STRCONV(FILETOSTR(lcImage),13)


* Create the request payload for GPT-4 Vision with local image
TEXT To lcPayload TEXTMERGE NOSHOW
{
  "model": "gpt-4-turbo",
  "max_tokens" : 2000,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What text is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,{<<lcBase64Image>>}"
          }
        }
      ]
    }
  ]
}
ENDTEXT

The results are ok for plain text recognition. However when trying with a page with multiple columns (like an Excel screenshot) doesn't work. In that case I don't get the entire text, but part of the text plus a page discription.
For example I get this at the start of the "content" value in the JSON response:

"The text in the image is a list showing the record of backorders or outstanding purchase order details. Here are the headers and some data entries:"

I'm trying to find a way to get the entire text for these image files too, but haven't succeeded yet. Adequate documentation is hard to find.

Regards, Gerrit

vernpace · May 13, 2024

Here is what I came up with - similar to yours, but no curly brackets around <<lcBase64Image>>

Code:

TEXT TO lcRequest TEXTMERGE NOSHOW
{"model": "gpt-4-turbo",
 "messages": [{"role": "system","content": "<<lcDirective>>"},
              {"role": "user", "content": 
             [{"type": "text", "text": "<<lcQuestion>>"},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcBase64Image>>", "detail": "<<lcDetail>>"}}
]}]}
ENDTEXT

I also included a "system" prompt to give additional directives if needed. Also included is "detail" which is either "high" or "low" depending on the size of the image - note that "auto" is automatic.

I think an Excel screen shot may be problematic for you. See below:

vernpace · May 13, 2024

Almost forgot,

Last night, I ran a "face recognition" test sending 10 base64 images - the images were from my brother's wedding where there were hundreds of people:

lcDirective = "Please provide a response without using any asterisks (*) or pound signs (#). For headings, prepend only with large roman numerals. For subheadings, prepend only with small letters in parenthesis."

lcQuestion = "Is the woman in the first image shown in any of the other images based on her face only? If so, describe (in detail) the the other images where the woman is shown."

Code:

TEXT TO lcRequest TEXTMERGE NOSHOW
{"model": "gpt-4-turbo",
 "messages": [{"role": "system","content": "<<lcDirective>>"},
              {"role": "user", "content": 
             [{"type": "text", "text": "<<lcQuestion>>"},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage1>>", "detail": "<<lcDetail1>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage2>>", "detail": "<<lcDetail2>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage3>>", "detail": "<<lcDetail3>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage4>>", "detail": "<<lcDetail4>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage5>>", "detail": "<<lcDetail5>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage6>>", "detail": "<<lcDetail6>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage7>>", "detail": "<<lcDetail7>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage8>>", "detail": "<<lcDetail8>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage9>>", "detail": "<<lcDetail9>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage10>>", "detail": "<<lcDetail10>>"}}
]}]}
ENDTEXT

The response was perfect:

The woman from the first image appears in several other images based on her facial features and attire.

1. In the third image, the woman is seen in the foreground, her face partially visible, observing a group of bridesmaids dressed in blue. This is during what seems to be a wedding ceremony set against a lush, mountainous backdrop.

2. The sixth image shows the woman engaging in a dance or friendly interaction at a social event. She is dressed in the same outfit, with a detailed beige dress and a similar hairstyle, confirming it's the same event and person. The setting appears to be an outdoor patio area with heaters and umbrellas, suggesting a lively evening reception or party.

These images illustrate her participation in the festivities of what seems to be a wedding event, depicted in various settings of the celebration.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

GPT-4 Turbo with Vision for OCR

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor