Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Searching in a big memo-field for a special info. 1

Status
Not open for further replies.

german12

Programmer
Nov 12, 2001
563
DE
I haven't worked with memo fields in VFP for a very long time.
Every week I get a magazine with stock market information, and
before I throw them away after having read them, I can save their text-contents in a memo field.
Such a memo field can then hold the text of around 150 pages per magazine, which is then around 13,000 m-lines of content in a memo field.
Of course I don't want to read everything, I just want to place a keyword and then - let's say - read the next 5 lines when that keyword had been found in the memo-field.
Example:
Keyword = “IBM”
Then there should be 5 lines displayed in a text box, or an edit box or in a cursor
contain "IBM" and a litte bit more information.
Now, of course, "IBM" may have been mentioned several times in the newspaper.
Then it would be nice if you could see the following 5 lines in the cursor after the first search in the cursor.
(Namely when the search term “IBM” was found again.)
So the memo field would have to be searched completely but only
a few lines to be defined are displayed.

Thank you for your help in advance.
Klaus

Peace worldwide - it starts here...
 
Simplest idea, copy over the text into notepad and search in there, it'll scroll to found places.

As code: AT() and ATC() are able to tell a position, and then you can use SUBSTR() from AT(C) position -200/+200, for example. ATC searches case insensitive, so I'd prefer that.

And AT/ATC have parametrs to not only find the first occurrence. So you can do somehting like that:

Code:
LOCAL lnOccurrence, lnFoundAt
lnOccurrence=1
Do While .T.
   lnFoundAt = ATC('searchterm',memo,lnOccurrence)
   If lnFoundAt=0 
      Exit && nothing found, leave the while loop
   Else
      ? 'result ',lnOccurrence,':'
      ? Substr(memo,Max(1,lnFoundAt-200),400) && found position -200 and (up to) 400 characters.
      ? '++++++++++++++++++++++++++++++++++++++'
      lnOccurrence=lnOccurrence+1
   Endif
Enddo
Of course, store these substrings into a cursor and display that in a grid with textbox replaced by an editbox and you can list these snippets of text instead of just printing them to the screen.

Edit: Changed MAX(0,...) to MAX(1,... for obvious reasons. This bit of the SUBSTR expression is for the case the searchterm is found so near at the start that the position-2000 would be negative.

Chriss
 
Chriss,
Thank you for your good explanation about finding fragments in a memo field.
I tested your program - and understood it.
It runs fine.
The problem that only now becomes clear to me is the following:
Ctrl-A when copying the entire contents of a magazine only works properly if the magazine has a single-column text.
Unfortunately it's different here - my magazine only has one column on some pages - but then again on the following page it has 2 or 3 columns
and if you copy all the pages with Ctrl-A, then of course the text in the copy runs into each other - the text then becomes incomprehensible because the topics in the magazine are comprehensible according to columns.
It's clear to me that this has nothing to do with VFP - I then
tries to transfer the entire copied text into Notepad+ - and each line is shown separated by CR and LF, and this causes
the text is also readable in Notepad.
But everything would be better in VFP because you don't have to read the entire text, just what VFP filters.
Is there a solution for how to put the copied text into a memo field in column format?

I hope I have made myself clear.
Klaus

Peace worldwide - it starts here...
 
Screenshot_2024-02-19_185749_fbb7km.jpg


Peace worldwide - it starts here...
 
This is a sample of the magazin.
In this case "Jahr der Entscheidungen" has nothing to do with "Renk Debüt wird Volltreffer"
So the copy should be line-wise with LF + CR

When I write this, I get the idea to copy first into Notepad and then get the text from there into a memo-field in VFP.
What do you think?


Peace worldwide - it starts here...
 
Is this HTML or some custom magazine application for the ePaper version?

If a simple to use solution like CTRL+A mangles lines, of couse the solution to that is either copy article by article or find an application that enables to separate this. a VFP memo isn't column oriented at all, It's just like notepad - the simple MS version, one text, one font, one size. ANSI or UTF-8, but no formatting no semantical way to cut out columns, like notepad++ offers.

If you have text like this in a memo:
[pre]This is one text This is text
of column 1 of a multi of another column
column text of the memo[/code]

The only way to detect columns here is noticing that the only longer seies of successive spaces is ledin to the same "tab" position (even though no tab charcters are involved. I wouldn't want to write a program to allow separating the texts in columns, even if you tell it how many coumns to detect. The alignment of the left side of column2 might not be the same position within the lines, as fonts are proportional, not monospaced, usually, you might not even have line breaks, so there is not even a rough position available. And selecting columns is not something VFP memo offers.

I wonder if it's not possible to get the articles continuous text from the ePAper files instead of using copy&paste. It might be a binary or mixed format like PDF, but the layout in columns may be separate from a running text that's exctly what you need, the pure text in the right order without column layout.

german12 said:
because you don't have to read the entire text, just what VFP filters.
WEll, if you search and notepad++ scrolls to the found places and highlights the searched word, who hinders you to concentrate reading just the portion around the highlight? Tha advantage of VFPs substrings is minimal for me, It's of yourse only your opinion that counts, but when you want to read the full text belonging to one substring, the programming challange is to scroll text to the place, nothing easy but simply the way searching in notpad works, even the simple notepad, or any text in a browser you search in with CTRL+F.

Chriss
 
It's a relatively slow process, but a screenshot app will isolate say a column from a magazine page. I use SnagIt, but there are free ones. You can print it or save it in many formats.

Of course this would be impractical if you need many shots in a short time. Don't know of an app that would do that.

Steve
 
Thank you very much Chriss & Steve
It is a PDF which I can download additionally via Internet (beside a Print which I receive by Deutsche Bundespost.)

I worked with Notepad++ again - and I think I'll stick with the method.

Advantages
The search is very fast and transferring the contents of the PDF is also quick with a Ctrl-A and Notepad ++ works get it according to columns.

With Notepad++ you can also quickly navigate repetitive search words.

Notepad++ also has its own search mask, which you can set so that the search jumps back to the beginning of the file to be searched.

Since it is pure text that needs to be searched,
you can also accumulate many print editions in a Notepad++ file
and then also have the opportunity to follow the comments on a stock corporation historically, whereby I
the Files should probably be accumulated in such a way that the most recent development is at the beginning.

I realize that due to the lack of separators in VFP with memo fields, it would be far too much effort and yet inaccurate.

Originally I was thinking of creating a file with the names of
To set up stock exchange companies and then with a mouse click from the memo field to have all the information visible and saved automatically in files per company for this purpose only, but the multi-columns and missing markers for the end of an article do not allow this.
[bigsmile]
I've learned something new again.

Thanks again
Klaus





Peace worldwide - it starts here...
 
Glad you became a bigger fan of notepad++.

Of course having data in VFP (or any database) always adds further methods of processing, so I wonder, is it feasable to work on the text files as pasting the articles into notepad creates them and get them in memos from there, the way you need it, perhaps one record per article? Together with issue number / publishing date, etc.

You could make a word index by a) defining a table of keywords like the ticker symbols and find them with ATC(), store record ids of ticker symbol, article and the ATC() result together so any search for them is already "done" with one general indexing table. Index that on the ticker symbol id and finding all articles about a company is done.

Everytime you add an article you run through this new memo with all your keywords. Everytime you add a keyword (anything, not only ticker symbols), go through all articles and add to the cross table of articleid/keywordid and atc() position. Doing that indexing could also be done as a daily task scheduled and running in the backgound, actual searches then can be very fast.

Navigating to the ATC() position within an article is still the problem to tackle, but it could be easier using the RTF control, setting it's Text property and then setting selstart to the atc position the rtf control should scroll there.

Editbox actually also has the selstart and could scroll there and it wouldn't even be as hard as I thought it would be.

Chriss
 
Hi Klaus,

It is a PDF which I can download additionally via Internet (beside a Print which I receive by Deutsche Bundespost.)

Why your detour? Did you know that PDF also has a search function - just use CTRL + F and type in what are looking for.

hth

MarK
 
Klaus can surely answer that for himself, but some reasons I can think of is:

1. What goals he already mentioned: Searching in all PDFs. Notepad++ has a feature of searching in multiple files and building up a result list that can "beam" you into the file of the result.
2. Extending point 1 by extracting all articles of a search into one new txt (or PDF).
3. Smaller file size of the extracted texts.

There are always multiple ways to solve the same thing, you can surely also find old news about a company in the internet, too. Which would even speak for neither keeping PDFs nor Texts... And there are other counterreasons for not extracting/converting PDFs, as you have a rising amount of disk space available, i.e. the rate at which hardrive sizes grow is faster than the groth of the PDF files. Anyway, I'm also interested in Klaus answer.



Chriss
 
Thank you, Chris & Mark for your questions and hints.
I want to try to explain what my goal is or was.

1. Every week I receive a magazine via the Internet with lots of pictures and text.

2. I can also download this magazine as a PDF - but with one restriction - it can only be read if I log in to the internet at the same time.

3. Trying to read the downloaded PDF directly doesn't work - the reason is obvious to me - the editor certainly doesn't want the information to be passed on without cost to the recipient.
So I can only read the information with my password.

4. The information that I have access to is also limited in time.
When new issues of the magazine appear, they are no longer available accessible online.

5. Each issue contains a wealth of news about public companies that are traded on the stock exchange.

6. Reading on the screen is very tiring because the magazine has multiple columns and there are a lot of pictures in it.
When searching in the file, the cursor jumps back and forth irregularly between the columns (the topics) on the right or left until it reaches the last section, and when a magazine like this has 150 pages, searching using keywords makes concentration extremely difficult.

7. Saving the specifically read information on your own computer is only possible if you mark all pages with CTRL-A and then save this text file as plain text (i.e. without images).
downloads.

8. That wouldn't be bad either, but this file would not be properly readable because the contents of the columns and thus also the reports are mixed together.

9. The best option is to save this in Notepad++ because Notepad++ saves line by line - but not in a confusing way, but rather
vertically separated by reports.

10. Unfortunately, there is no visible end marker for a topic in Notepad++ either - and that of course makes things more difficult
the transfer of the topics, for example, into a database program such as VFP.

11. That was also my original goal - VFP should display the information by entering a search word (or by taking over company names from a file to be created) - i.e. with a mouse click, for example, display everything that fits the topic.

A search using an ID wouldn't be that helpful because it's very common in the text that
only the name of a company, but not its ID, appears.
Example: "Tesla" - but not their ID - or only sometimes.

Conclusion:

Mark's question is explained in point 6.

The good ideas for further processing in a database
In my opinion, the way Chriss describes them fails
the declaration under point 10.
Chriss - maybe I misunderstood something - your idea that search and save improves after each attempt and even
Files separated and specifically with the power of VFP
Developing in the process is really fascinating.

For a better explanation, I am attaching a text file that was obtained via Notepad++.

The text file also contains a table of contents showing the companies covered in the magazine.

I hope I made something clearer.
The topic is - as Chriss has already explained - very diverse in terms of possible solutions.

Notepad++ is currently a very quick way to
but - as explained above in #10 - I don't know how to get the ending
of a topic and wants to send this information to a database program
can be transmitted automatically.

Manual entry requires too much time because
you then have to read through all the lines to find any end and
then marking must be entered into the text.
With 10,000 to 13,000 lines per issue it is not efficient.

Btw: Every magazine contains a table of contents which companies on which pages are discussed
I copy them as a sample here. The sample is included in the attached *euro1.txt
Air Liquide 47
Airbus 8
Aixtron 21
Allianz 49
Arm Holdings 22
AXA 48
BASF 49
BHP Group 46
Bilfinger 36
Biogen 15
Booking Holdings 49
Bumble 23
Commerzbank 14, 24
Continental 14
Corticeira Amorim 51
CTS Eventim 8
Datagroup 47
Delivery Hero 56
Deutsche Pfandbriefbank 6
Deutsche Telekom 26, 49
DHL Group 24
Dürr 21
Eon 30
Electronic Arts 44
Endeavour Mining 39
Etsy 47
Evotec 21
FMC 36
Fresenius 47
Fresenius Medical Care 47
Gaztransport et Technigaz 17
Gerresheimer 49
Glencore 47
Heidelberg Materials 48
Heineken 22
Hensoldt 49
Hochtief 48
Home Depot 46
JetBlue Airways 22
JP Morgan 53
L’Oréal 22
Lucid Group 47
Lyft 14
Medtronic 46
Mercadolibre 49
Mercedes-Benz 48
Microsoft 44
Moderna 48
Mutares 21
Nestlé 48
Newmont 48
Nintendo 44
Nvidia 47
Occidental Petroleum 38
Palo Alto 46
ProSiebenSat.1 20
Realty Income 47
Renk 20
Rio Tinto 47
Roche 23
Sartorius 20
Schneider Electric 59
Sony 53
Take-Two Interactive 44
Telefónica Deutschland 47
Tesla 12
Thyssenkrupp 20
Tod’s 9
Torex Gold 39
TUI 7
Unicredit 23
Uniper 25
Vitesco 49
Walmart 46


Klaus

Peace worldwide - it starts here...
 
 https://files.engineering.com/getfile.aspx?folder=e169dc84-769b-4bce-ae3f-0d659f01bd2d&file=Euro1.txt
I'm a bit struggling with the understanding why the textx are better readable in Notepad++.

Is it perhaps using a monospaced font? Then an Editbox can do the same in VFP when you just set the font to something like also used for code: Courier New.
I can't imagine this helps, though. As texts in a magazine surely will use usal proportional fonts, even serife fonts. And that also makes a difference in reading on screen, besides the column formatting.

Anyway, thanks for pointing out the limited time you have reading while logged in only. That explains a lot.

Chriss
 
Chriss,
The "better readability" referred to the difference between the text obtained directly (via Ctrl-A from the screen pages) versus the transfer tranfsferred to Notepad++-then.

In Notepad++ you can read in column format (without the text of an adjacent column protruding there).
This had nothing to do with a text field or an edit box in VFP that could be with the font in Notepad++.
Klaus

Peace worldwide - it starts here...
 
Hm,

so you say the reulting text differs whether you paste into a memo or into a notepad++ editor?

The source clipboard format will only depend on the source PDF, so I wonder how that can even differ.

I experimented a bit with a PDF containing 2 column text. Not every PDF is the same in that aspect, but I don't think that's important, I see a difference of how the paste arrives in VFP vs Notepad++, indeed.

If you analyze the clipboard text you copy from the PDF before pasting, by counting linefeeds with ? Occurs(Chr(13),_cliptext), this is 0. Once I paste the text in notepad and copy it once more from there, it will contain line feeds, i.e. ? Occurs(Chr(13),_cliptext) suddenly is a few hundred.

And if you paste that again into a memo, it is just as in notepad++.

In essence, it's VFPs limited clipboard access that seems to cause this. It's not about chr(13) vs chr(10) only, both characters don't exist in _cliptext. But you can get the same result in memos as in notepad++, if you copy once more, when the Notepad text is the source, VFP also sees chr(13)+chr(10) line breaks and formats the text exactly the same way as Notepad++.



Chriss
 
No. May be that due to language I have written it false.
Again:
What I wanted to say, has nothing to do with the text.
It has something to do with how the text looks when copied and transferred and that depends on where the copy is transferred to.
If I copy the magazine - i.e. copy past the result into WORD, for example, and don't transfer it to Notepad++, then it is not easy to read because the text of the magazine only becomes easy to read when it ends up in Notepad.
I haven't tried it in an edit/text field of VFP yet.
But I can still do it.


We agree, I already wrote that above - but hadn't tested it yet.


The only difference is that you are better off using a text field.
This brings us one step further...

But the problem is when the meaning of a text changes and the text
has nothing to do with the search term - how can VFP recognize that?

eg. a few copied lines from the *.txt as an example.

Applied Materials in €
2023 A J O 24
90
110
130
150
170
190
seit
16.02.23
+76,5
%
KU R Z M E L D U N G E N
Bezos verkauft Amazon
Amazon-Chef Bezos hat zum dritten Mal
in diesem Monat Amazon-Aktien verkauft,
diesmal für rund zwei Milliarden Dollar. Das
geht aus einer Börsen-Pflichtmitteilung


..and more text about Amazon:

When I look for the company APPLIED Materials how can I see
that the text leads to AMAZON?

What (how much, how many lines) to show on a scrollable edit box in VFP?)



Klaus


Peace worldwide - it starts here...
 
Put your last question to rest for the moment.

What you need to understand is how the different paste results areise at all.

VFP is not good at the clipboard with _cliptext. Not only because you only get texts on the clipboard, but also in case what's on the clipboard is just text, it's not necessarily ASCII or ANSI text, it usually is Oemtext or Unicode. In short even when copying just text from a PDF in an example I get 6 clipboard formats populated. That cqan also be seen in VFP, if you don't just use _cliptext or interactively paste with CTRL+V, you can enumerate what's on the Windows clipboard with the help oof foxtools.fll:

Code:
Set Library To foxtools

? OpenClip(0)
nFormat = 0
Do while .t.
   nFormat = EnumClipFm(nFormat)
   If nFormat =0
      Exit
   Endif
   ? nFormat, Occurs(Chr(10),GetClipDat(nFormat))
EndDo
? CloseClip()

What I do here is go through all formats that are on the clipboard currently and get data in these formats to count eh number of linefeeds (chr(10)) in them.
The different formats are described in the foxtools help:
foxtools help said:
[pre]nFormat Description (define type)
1 cf_Text
2 cf_Bitmap
3 cf_MetaFilePict
4 cf_SYLK
5 cf_DIF
6 cf_TIFF
7 cf_OEMText
8 cf_DIB
9 cf_Palette[/pre]

When I copy all text of a multi column PDF and run the code, I see 6 formats on the clipboard! Linefeeds only occur in some clipbaoard formats, but not all.
If you experience bad readability of what you paste in VFP or also in Word, but the result looks fine in Notepad++, then this points out Notepad++ is more clever at picking out the text in the best clipboard format.

You could also get multiple lines in VFP using GetClipDat(nFormat) for a format that reports linefeeds. In my case some formats, including the simplest 1 cf_Text have 0 linebreaks, and that's what you see in bad pastes. But about 300 in Format 7 - cf_OEMText. So that's what should be "pasted" or stored into a memo.

Then the text itself will look the same as in Notepad++.

That Notepad++ is still better at finding resauls and scroilling to them, I agree, but I was only asking myself how you could get so qualitative different results from pasting the same copied text to different applications. It's because even a text on the clipboard can be stored and used in many formats. And _cliptext or CTLR+V in a VFP window, the paste does not come from the best format, likely. That also explains why it differs, if you copy something twice. The process of copying from Notepad also puts linebreakss in the format VFP processes.

What plays a role here is that VFP is no Unicode compliant application, but uses Ansi codepages. That's already far better than only ASCII, still limited to 256 characters at a time. I would never have thought that the data formats could differ in linebreaks, though, no matter how limited text formats are, they all have line break characters.

Chriss
 
Nice answer, Steve
That will take some time...but today nothing can be ruled out, even if AI is still in its infancy.
The problem could perhaps be solved something like this at the moment (but the coding is certainly too difficult for me).

What could be done meanwhlle?
When you read information, you usually scroll the screen from top to bottom, line by line.

At some point there comes a line where you notice that the following lines no longer match the keyword you are looking for.

If a bar with the sentence numbers is also displayed when scrolling, then you could set a marker (that could be the keyword+* at the end of the article behind its last line by e.g. clicking on the sentence number - and this marker would ensure that when the next time the program is called up, it goes up to the point where the marker was set, before searching again.
To search for further new infos next time when you call the program again , you could possibly skip many lines to read and immediately display the next piece of information (if it still matches the search - or simply enable the next search with a the a new keyword..
In the long run, the search would become more and more efficient when the file grows.

Just an idea.....

Klaus



Peace worldwide - it starts here...
 
Klaus,

that assumes they keyword you search for is contained mayn times from begin to end. That's probable, or from another persepctive, if a company name only is mentioned once, the articles main topic is likely something else.

I'd take anythinhg unusual and not text as a sign of a change, not necessarily the end opf the article, likely just a page switch, because most likely you'll also have the page number and maybe even more of page header/footer lines in the copied text. It's not easy to find a sequence of numbers that you can be sure is the page number, because any number within text could be misinterpreted, too, but that's one thing to look at, repeating patterns per page, ideally of course per article, but I doubt you find something universal.

Some magazines start all their articles with a big letter, no idea whether that would be separate after copy&paste or when the formatting is stripped off the first article word becomes a normal word, too and is no specialty. Single letters are also no good marker for start of an article, it's too weak to be taken as that. An article also often ends with the short of the author, there might be list of them for a magazine, but they can coincidentally be words, too.

One quite string marker in texts is the poiunt, which makes it easy to separate sentences, some abbreviations would butch that, but it's not a showstopper. You already mention sentence number, not sur if you mean that, Notepad++ gives line numbers. If you have automatic wrapping of overly long lines in text, Notepad++ display them in multiple lines, but the line numbering then is not increasing per line. That's a nice feature, but not sentence numbering, it's still just line numbering.

Well, no matter if you take it by line number or sentence number, you could get lines with ALINES() and look for keywords in all of them instead of the whole text. Then you can determine minimum/maximum line number or sentence number with a keyword. If there are multiple articles about the same topic, the min/max line will span all articles in between, too, though.

So one thing that's possible then is store a weighing factor that you set 1 for lines in which keywords are found, or N, the number of times they are found in a line, and that you reduce by 0.2 in neighboring lines until you reach down to 0 again. Then the interesting lines are those with a weighing factor>0.


Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top