Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

MEMO field Manipulation 1

Status
Not open for further replies.

Scott24x7

Programmer
Jul 12, 2001
2,826
JP
Hi All,
I just learned something odd about MEMO fields... in that there is a fixed "width" to their "line". That's all well and good, but I discovered that this width ignores CR/LF. So if I have a memo field that looks like this:

This is some text
in a memo field
that I just made up

If I interrogate that data using MEMLINES(myMemoFiled) it will return 1 (assuming that default MEMOWIDTH is set to say 80.

What I need to be able to tell though is, how many ACTUAL lines do I have? (Where "CR/LF") is for lack of a better word, the delimiter.

I had hoped for something like "SET MEMOWDITH = 0" would allow a variable length, but there's no such thing I can find. I'm a bit rusty in some of the broader VFP string manipulation functions, and since I'm still flying only on my "HELP" file makes it hard to look through lots of functions.

Any suggestions?


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
But that gives me an array from the memo field, and that's neither what I want or need...
And even in that case I need to specify a delimiter and I tried CHR(13) but that doesn't seem to work.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
Another solution :
Code:
SELECT OCCURS(CHR(13),mymemo) + IIF(EMPTY(mymemo),0,1) FROM mycursor

Respectfully,
Vilhelm-Ion Praisach
Resita, Romania
 
VGULIELMUS,
Very interesting... and it has reviled something else to me. This is data generated from OCR output, which is saved into a MEMO field. Since each line from the OCR is on its own line in the memo, I assumed there must be a CR/LF at the end of each line... but this shows me only 1 CR/LF at the end of each memo! I didn't think this possible... is there some other delimiter that I may be missing that allows for separation of text on another line? I'm baffled.


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
Please try CHR(10) instead of CHR(13), I mean :
Code:
SELECT OCCURS(CHR(10),mymemo) + IIF(EMPTY(mymemo),0,1) FROM mycursor

or CHR(11) (soft line break)
Code:
SELECT OCCURS(CHR(11),mymemo) + IIF(EMPTY(mymemo),0,1) FROM mycursor


Respectfully,
Vilhelm-Ion Praisach
Resita, Romania
 
VGULIELMUS,
Thanks! The CHR(10) is it, but then it gets kind of tricky, because the last line just "ends". There is no CHR(10), CHR(13) or CHR(141), just ending character.
So if I do something like lnLineCount = OCCURS(CHR(10),myMemo) using my example in the first posting, it would have a result of 2.
This is ok, I can work with that. I can just append a CHR(10) to the end of the memo and now I have a "Line Delimiter" that I can work with.
Cheers!

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
You're correct, OCCURS(CHR(10),myMemo) is inaccurate and for that reason, the expression includes an IIF() :
OCCURS(CHR(10),mymemo) + IIF(EMPTY(mymemo),0,1)
The meaning of IIF() is to add 1 (the last line), but only for non-empty memos

Respectfully,
Vilhelm-Ion Praisach
Resita, Romania
 
Right, that makes total sense.

Also, tbleken, I see the idea you have there, and worked through part of it, but my "intention" (and maybe I'm nuts or stupid) is that I wanted to slowly "tear away" from the memo field.

My rational is, I'm trying to parse data from it, which is somewhat "unpredictable". So part of my intention to "discover" a method for parsing was to remove the stuff that I can identify, and leave the "rest" behind. And then see what that looks like.

I'm not a mathematician nor do I have a background in data parsing, so I figure much of what I will need to do is refine a procedure over time, with some trial and error, and looking at what's left.

For things like an email address (which is what I decided to start with, because of it's somewhat near unique identifier of @ in the middle), I want to pluck it out. The first thing I did was pull 150 cards OCR data into a memo, and just "looked" at it to see what it would be like. I see a few things like:

email.address@someurl.com
E email.address@someurl.com
Email email.address@someurl.com
E: email.address@someurl.com

Those aren't too bad.

But then sometimes there's 2 lines together

emil.address@someurl.com T:+65 9999 9999
E: emial.address@someurl.com 123 Any Street
e. email.address@someurl.com Company Name

Still those aren't the worst, sometimes there's just bad data involved:

em#%l.addre70@som URl.eom
email.addressOsomeurl.com 365@#09CLwezx Hong Kong
236 EVoimewet Street E: email.addr ss@0jwe.com

So you see the challenge...
I figured email addresses will be a more easily identifiable target. I looked as well to see if they more frequently occur in one place or another, but they really don't. They occur at top, at bottom, stand alone, mixed with other text, in the middle... every possible location. So that part can't be used.

Then as I am able to extract more out and analyze the "leftover" (which is actually what I call that MEMO field), it should help to filter down what remains, and what to throw away or ignore.
At least, this is my current thinking... even with the last line missing a CHR(10), I still know it's the last position in the text string, so that doesn't hurt me when I combine it with things like AT() and LEN()).

I'm sure this will takes several weeks or months to perfect, but this is officially the "starting" point.


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
vgulielmus
I realized an elegant solution to this... since this field gets populated by an INSERT statement, I just APPENED the CHR(10) to the data filed at the time it gets loaded if it is missing. Now all my lines end with CHR(10) from the OCR data, and it makes working through this without exception handling for lines that are at end much easier (since every memo field data must have a last line).
So it's made it much easier. Still a long way to go on my PARSE data routine, but this has made it much easier to manage.

Just as a note for tbleken's point as well, it may turn out that each line needs to be interrogated, so the ALINES() function may be useful. But there are times when I want groups of data within the memo, (specifically address data, which may be 1, 2, 3 or in once case I've seen 5 lines). For that reason I use a MEMO field for holding the "Address" element of the data. Maybe others will say that's a bad idea, I don't know yet. I do keep country, city and postcode separate, but I get a lot of address especially in places like Hong Kong (where they have no post code) like:

Some Hirise Building
Level 17
222 Street Tseung Kwan O (Which is like a "township")
Kowloon New Territory
Hong Kong

Where the only part of that address that gets pulled into it's own field is "Hong Kong" and that goes into country field.
Singapore has a similar tendency, though they have post code, they still list building and block numbers. So a typical address there might look like:

Some Building Name
2222 Some Street
Block 220 #07-02
123456 Singapore

So there the first 3 lines become the address, and by putting them into memo field I don't have to muck with ADDRESSLN1, ADDRESSLN2, ADDRESSLN3. If they appear in order in the text (some do some don't), I can just replace them into the ADDRESS memo field without doing anything else. Oh, it's my expectation that at the "end" of the parse whatever is leftover IS the address, rather than trying to parse the address out. (That's my other approach) or Address + Unuseable info, which the unusable stuff gets deleted, and the address gets replaced.

That theory could be wrong though, but I'll continue to operate on it for now, as it's the approach I've decided.


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
Scott, Scott, Scott.

ALINES is giving you a lines array, what's wrong with that?

Instead of MLINE(memo,lineno) you say laLines[linno] after the initial ALINES(laLines,memo).
That's it.

Or even simpler
Code:
=ALINES(laLines,memo)
For Each lcLine in laLines
  ? lcLine
Endof

So what's so difficult about this?

You give up too early, you are impatient and you don't thoroughly read through the answers you get.
I even sketched out to you, how you could make use of _screen.TextWidth() to find the portion of the memo wordwrapping at a certain editbox width by reversely finding the text porting exceeding that width.

Anyway, what you want is the core functionality of ALINES, it breaks up at line feeds. Exactly what you want. That's even the simplest solution for that.

Bye, Olaf.
 
Olaf,
It's not a difficulty, and maybe you guys have the right answer, but I've been kind of trying this my way...
Because the line may have more than one thing that I need from it, like my example that shows for instance the email example:

Email : some.address@someurl.com My Company Name

From that I need the email address (easy part to get).
BUT I don't want the stuff in front of it (Email : ) that gets dropped off, it's useless data to me.

Then "My Company Name" is also useful, but I don't actually know that yet. So what I do is extract the part I want, and throw away what I don't want and leave the part behind that I don't know if I need yet or not for later.

The big issue comes with something like this:

Email : some.address.com My Company Name
Mobile: +8180 9999 9999 x 2345 221 West 3rd Street #403
My Building at My Block
Somecity, SomeTerritory, Some Country 999-9999

Now the address in this case is:

221 West 3rd Street #403
My Building at My Block

This of course could all be reversed as well.
So I have to create a "clever" way of processing that. My intention is to take away everything I can identify with the hope that what is left over is the address, since I think trying to identify what is, and what is not the address is the hardest part. There is some tricky work to get the "right" phone number matched with the right phone number field.

In some complex cases, I have seen cards that have 2 addresses on them (for various reasons) so they become particularly problematic.

I have now successfully finished the first step: accurately getting the email address out (and note that a lot of OCR garbage from images create "erroneous @ signs" which is what we "key" on to identify email addresses. And then I have one card that has 3 email addresses on it, so I have to figure out what they "belong" to and why. But at least first step is done. Now I start looking at more of the data.

The reason for not going "line by line" is that one of the things I read about parsing unstructured data was not to look at it granularly, but as a whole. So I was trying to do that without looking only line-by-line. Not sure if I will be able to fulfill it entirely that way, but time will tell. This one is a real "experiment" for me, as I've never even in the past tried to do something like this. If I could find a "Parse Business Card Data API" I would grab it! :)


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
The ALINE issue is separate from your parsing ideas, you didn't used it right in the first place, even if it won't help you in cases you have two "coumns" on a card like mail/phone/fax/website on the left side and postal address on the right side. Of course you then get lines with a phone number left and zipcode+city right, for example. What you identify this way is the need to partition the card scan into two OCR areas.

The idea to parse what can be identified and see, whether the rest assembles a postal address is partially okay, but you forget about things like profession or job description and anything else like a quote or statement. And what about curved text? slanted text?

Address parsing is a hard thing and there are services for that of course, from USPS for example:
Bye, Olaf.
 
Well, that's kind of what I was hoping to avoid address parsing really. And the extent that postal services does it is not needed for my part, and there are no encoding needed to understand for their mail handling systems (on a strange note, I actually know an enormous amount about this area, as in my "other life" I'm a philatelic expert, particularly US philatelic and last year did an entire study on the barcode system that is used in US mail sorting... but that's another story). :)

Since I'm new to parsing this kind of data, where you have to look for tiny clues, and lot of combinations I figured to take the approach to "syphon" out what I can, and if I can only get 80% it's still a big win. If we have to put in a few bits of data by hand, that's ok, but I want to eliminate as much work as I can.

I'm going to look a little deeper into Transym API as well, as I've only scratched the surface of it, relying on EZTWAIN to just "do what comes natural". The output is still impressive. About 95% of the cards are spot on, with a few not working well, and the biggest part of 5% remaining are "misinterpretations" brought on mostly by the OCR trying to "ocr the logo". But again, since Logo's are all over the place on the cards, we can't tell it to "ignore" them. At least, not that I've discovered yet between Transym and EZTwain. Their support for VFP is limited (EZTWAIN) and for Transym, as nice a product as it is, they have 0 idea about VFP, and no examples of it's use with their code samples. So I'm kind of "on my own" there.

Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
Oh, I'm also exploring a product called Abbyy FineEngine. I tested their business card scanning software which was spectacular, but there is no API to it, because they sell/license the engine seperately. I have a feeling from their website it is pricey. But I've asked for a trial of the FineEngine, but they haven't gotten back to me yet (weekend I guess...). But it is very promising, so maybe I won't have to put all this work into the PRASE if I can get a good engine to do it... pass it the OCR data from my system captures, and let it pass me back a CSV or direct data.
That I can work with.


Best Regards,
Scott
ATS, CDCE, CTIA, CTDC

"Everything should be made as simple as possible, and no simpler."[hammer]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top