Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

GPT-4 Turbo with Vision for OCR

Status
Not open for further replies.

Gerrit Broekhuis

Programmer
Aug 16, 2004
316
NL
Hi,

OPENAI’s “GPT-4 Turbo with Vision” should be able to perform OCR.

With help from VFP forum members here and elsewhere I’ve been able to create textproposals and use speechrecognition in our applications.

For OCR we’re now using Tesseract, but I think GPT-4 may be a good alternative.

Has anyone tried this already? Any hints or tips?

Regards, Gerrit

 
Not only does it perform OCR, it does so more effectively than any product or process I've ever encountered. It is also extremely capable of then translating that. I ran a letter from 1798 through it (hand written, extremely difficult to read), and it OCRd it around 95% accurately. If it is very clear, then even easier.

Yes, there are some major applicable uses for this and if it can be linked into VFP, it will be a mind bogglingly useful capability.
I'll start looking at this as well.


Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.
 
There was a great article in the April issue of Scientific American about one of good uses for AI that I've heard of. (I'm highly skeptical of LLMs and other things in that vein, particularly for ethical reasons.)

Here's a link, though I suspect it'll only be accessible to subscribers:
The basic idea is that they used AI in a couple of ways to read ancient scrolls that were burned so badly, they couldn't be unrolled.

Tamar
 
So I sent this business card that I picked up (and is one of the craziest layouts for a card example I have), this was the response that ChatGPT came back with:
aaronshelleyfront_p5wfr4.jpg


Company Logo: IX Terenine
Name: Aaron Shelley
Title: Senior Solutions Architect
Email: aaron.shelley@terenine.com
Office Phone: 866-379-3581
Direct Phone: 423-702-7038
Mobile Phone: 423-664-7383
Fax: 423-643-8031

The prompt I gave it was: identify the text on in this image, as if it were a business card:


It's 100% spot on.
Here's another example with a difficult card where I've also asked it to translate the card into English (without giving it a base langage):
akihidesaitoback_uygvhl.jpg


The response was:
Logo: K V H
Name: 榊原 明秀 (Sakihara Akihide)
Position: データセンターサービス部 シニアスペシャリスト (Data Center Services Department, Senior Specialist)
Company: KVH株式会社 (KVH Co., Ltd.)
Address: 東京都港区南青山1-1-1 新青山ビル西館 T107-0062 (1-1-1 Minami Aoyama, Minato-ku, Tokyo, New Aoyama Building West Wing, T107-0062)
Telephone: TEL 03-4560-8842
Fax: FAX 03-5772-5940
Email: akihide.saito@kvh.co.jp
Website:
(And these are very old cards, these contacts no longer exist, so I'm not "spamming" someone.

There is certainly merit in exploring connecting ChatGPT to interface with VFP.
I haven't had time to explore this in depth yet, due to other commitments, but I'll certainly be circling back on it soon.



Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.
 
Hi,

Thanks everyone for your input. I haven't started this project yet, as I was in the US last week to visit New York.
Hopefully later this week I will start exploring my options to get this type of OCR working.

Does anyone know if GPT3.5 models will still be working and be accessible when I upgrade from GPT3.5 to GPT4? I guess I will need to create a new API code if it does?

Regards, Gerrit
 
Hi Gerrit,
Yes, you can use either 3.5 or 4.0 if you have the 4.0 upgrade.
I use both.


Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.
 
Hi Scott,

I’m still using my 3.5 prepaid subscription and yesterday I made my first succesfull trial with gpt-4 vision.
So there doesn’t seem to be a reason to upgrade to the more expensive gpt-4 platform.

Regards, Gerrit
 
Gerrit,
The reason will depend on how many prompts you want to send in a day/hour.
You will hit a limit really fast (I think it's 25 in 3 hours) on the trial.


Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.
 
Hi Scott,

Good to know that. I was advised by the community @ openai not to upgrade to GPT-4. As I may have said before I have a prepaid subscription for GPT-3.5 (not the “free” trial).

I may upgrade once I run into the limit you describe.

Regards, Gerrit
 
I'm having a very hard time understanding what is being accomplished in this thread - I thought it was supposed to be about GPT-4 Turbo with Vision, not ChatGPT.

First of all, it appears that GPT-4 Turbo with Vision does not support OCR. See here:
For more informatiom on GPT-4 Turbo with Vision:

For API documentation: For the community forum:
 
Vernpace,
I see this as an overall discussion, understanding the issue before we "dive into it". I am appreciative of what is presented and discussed here, though it might seem a little unorthodox compared to the way threads within this forum usually develop.
This is new to many of us. This is an evolution.


Best Regards,
Scott
MSc ISM, MIET, MASHRAE, CDCAP, CDCP, CDCS, CDCE, CTDC, CTIA, ATS, ATD

"I try to be nice, but sometimes my mouth doesn't cooperate.
 
Hi Vernpace,

I'm still working on this, but I already get good results using model "gpt-4-turbo-2024-04-09" for OCR. I have to compare this to the results I get with my VFP application's integrated OCR, but AI looks very promising. I'm sure there will be newer and even better models to come.

If you want you can join the party and get this working!

Regards, Gerrit
 
Gerrit,

Since you are getting good results for an OCR, could you please provide the VFP code for the JSON request? Since it's an OCR, I'm assuming you are using a local base64 image and not a URL, right?
 
Hi Vernpace,

This works for an online image. Local images need base64 encoding.

{
"model": "gpt-4-turbo",
"max_tokens" : 2000,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What text is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": " }
}
]
}
]
}

NOTE: the url above is not working, try this with your own url for testing. Unfortunately the indents were lost when copying the JSON.

I know this is just the beginning but we have to keep learning, don’t we?

Regards, Gerrit
 
This is how I added support for a local image (only PNG, GIF, JPG and WEBP are allowed).

Code:
LOCAL lcImage, lcBase64Image
lcImage = "c:\demo\ImageToTextOCR\demo.png"
lcBase64Image = STRCONV(FILETOSTR(lcImage),13)


* Create the request payload for GPT-4 Vision with local image
TEXT To lcPayload TEXTMERGE NOSHOW
{
  "model": "gpt-4-turbo",
  "max_tokens" : 2000,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What text is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,{<<lcBase64Image>>}"
          }
        }
      ]
    }
  ]
}
ENDTEXT

The results are ok for plain text recognition. However when trying with a page with multiple columns (like an Excel screenshot) doesn't work. In that case I don't get the entire text, but part of the text plus a page discription.
For example I get this at the start of the "content" value in the JSON response:

"The text in the image is a list showing the record of backorders or outstanding purchase order details. Here are the headers and some data entries:"

I'm trying to find a way to get the entire text for these image files too, but haven't succeeded yet. Adequate documentation is hard to find.

Regards, Gerrit
 
Here is what I came up with - similar to yours, but no curly brackets around <<lcBase64Image>>

Code:
TEXT TO lcRequest TEXTMERGE NOSHOW
{"model": "gpt-4-turbo",
 "messages": [{"role": "system","content": "<<lcDirective>>"},
              {"role": "user", "content": 
             [{"type": "text", "text": "<<lcQuestion>>"},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcBase64Image>>", "detail": "<<lcDetail>>"}}
]}]}
ENDTEXT

I also included a "system" prompt to give additional directives if needed. Also included is "detail" which is either "high" or "low" depending on the size of the image - note that "auto" is automatic.

I think an Excel screen shot may be problematic for you. See below:

Capture5_u0bcyw.png
 
Almost forgot,

Last night, I ran a "face recognition" test sending 10 base64 images - the images were from my brother's wedding where there were hundreds of people:

lcDirective = "Please provide a response without using any asterisks (*) or pound signs (#). For headings, prepend only with large roman numerals. For subheadings, prepend only with small letters in parenthesis."

lcQuestion = "Is the woman in the first image shown in any of the other images based on her face only? If so, describe (in detail) the the other images where the woman is shown."

Code:
TEXT TO lcRequest TEXTMERGE NOSHOW
{"model": "gpt-4-turbo",
 "messages": [{"role": "system","content": "<<lcDirective>>"},
              {"role": "user", "content": 
             [{"type": "text", "text": "<<lcQuestion>>"},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage1>>", "detail": "<<lcDetail1>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage2>>", "detail": "<<lcDetail2>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage3>>", "detail": "<<lcDetail3>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage4>>", "detail": "<<lcDetail4>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage5>>", "detail": "<<lcDetail5>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage6>>", "detail": "<<lcDetail6>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage7>>", "detail": "<<lcDetail7>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage8>>", "detail": "<<lcDetail8>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage9>>", "detail": "<<lcDetail9>>"}},
              {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<<lcImage10>>", "detail": "<<lcDetail10>>"}}
]}]}
ENDTEXT

The response was perfect:

The woman from the first image appears in several other images based on her facial features and attire.

1. In the third image, the woman is seen in the foreground, her face partially visible, observing a group of bridesmaids dressed in blue. This is during what seems to be a wedding ceremony set against a lush, mountainous backdrop.

2. The sixth image shows the woman engaging in a dance or friendly interaction at a social event. She is dressed in the same outfit, with a detailed beige dress and a similar hairstyle, confirming it's the same event and person. The setting appears to be an outdoor patio area with heaters and umbrellas, suggesting a lively evening reception or party.

These images illustrate her participation in the festivities of what seems to be a wedding event, depicted in various settings of the celebration.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top