Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

VFPxWorkbookXLSX extracts quote marks as " and < > symbols as < and &a 4

Status
Not open for further replies.

Rajesh Karunakaran

Programmer
Sep 29, 2016
542
MU
Hi friends,

I am using VFPxWorkbookXLSX for reading XLSX file. It has columns with email addresses. When I read those columns, my values are coming as quote marks (") being converted as &quot; and the '<' and '>' characters being converted as &amp;lt; and &amp;gt; respectively.

Are these by design? Or, is there any setting in the class itself?
I know I can remove these or replace by CHRTRAN or STRTRAN after reading.
But, I want to know if this can be configured!

Thanks
Rajesh
 
Neither reading the documentation or code, I'd say it's by design. Anytime you have to do with SGML markup like HTML and also XML is, you'd escap special characters like <, >, & with this, so they can be stored within tag attributes or embedded as their values. It's obvious how < and > as part of XML would generally sabotage the XML, isn't it.

And Rajesh, you can can read the docs and the code yourself, too, plus just think a bit about the reasons, even if you don't know much XML you surely have seen it and noticed it's based on tags that use < and > brackets. So it's clear the XML tag strucutre is vulnearable if < and > would be allowed as part of values in attributes or nested within tags, isn't it?

Chriss
 
Looking into it for you:

Usage of HTML entities within XML is the norm. It's not the invention of Greg Green nor VFP.
See
Also see this example, VFP is only involved here in using msxml.domdocument, not making the replacements:

Code:
oDom = CREATEOBJECT('msxml.domdocument')
oNode = oDom.createTextNode( [special characters &,<,>] )
? oNode11.text
? oNode11.xml

The xml tells you it's normal to have these characters translated within the XML, but the text is still interpreted without them, so there actually is automatic translation. No need to do something like that. Look into Gregs code to see where it does such translations itself, might as well be, but there's nothing unusual or bad or hard to handle about this.

Chriss
 
Hi Chriss,

Yes, I know about XML tag structure characters and if '<', '>' etc are allowed within the content, as it is, they could sabotage the XML, as you said. But, from that 'GetCellValue' method, in the return value, I was expecting the pure text which is actually one must have keyed into the XLSX cell.

My XLSX cell A1 contains
Code:
"My Name" <myemailidname@mydomain.com>

So, when I call 'GetCellValue', I was expecting the same text value in return!

By the way, I created the object like
Code:
loExcel      = CREATEOBJECT("VFPxWorkbookXLSX", "VFPxWorkbookXLSX.vcx")

loExcel gives a methods called 'GetStringXML' and 'GetXMLString'. Both have parameter as TCSTRING.
But, when I run, it says, 'Property GETXMLSTRING' is not found'.
Maybe, I am using a version in which this was not implemented yet?

Any idea?

Rajesh

 
I have been using VFPxWorkbookXLSX for several years. I've found a half dozen bugs in it here and there. Minor things, and things I've corrected.

It's free and open source software. You can take what is there and tweak it as you desire. Add a SET STEP ON in there. Extract out the code into a PRG and test it / develop it as you see fit.

Look at this library as the free gift it is. Take what Greg Green has given you out of his kindness, hard work, tweaks and maintenance, and augment it as you see fit. And if you find bugs or incomplete features correct / add them. Then, create a PULL request on the GitHub page and send them upstream for others to receive in the future.

We are part of a community and all of our effort into these types of things should benefit ourselves and others.

--
Rick C. Hodgin
 
Dear Rick,

I have gone through your reply word by word. I feel like you're in the impression that I was complaining about the library rather than seeking suggestions or help. Maybe, it's my way of writing that created such a thought in your mind.

I am a stainless admirer of this forum and those great people who dedicate their valuable & precious time for creating such fantastic libraries and give out as free gifts to the community. I think, nobody could ever put up a complaint on things like that whatsoever, lack of feature, error, bugs or whatever as it's indeed a blessing to have such people and their products for a community like this.

So, all our friends here, who may be thinking that I am bringing out complaints about such open source libraries, I apologise and please erase those thoughts from your mind.

I feel great to be here amongst you guys!

Rajesh
 
Dear all,

To solve the basic problem mentioned in this thread, I wrote a method to replace those expressions with their actual characters.

Rajesh
 
Cellvalue is translated, but XML has the escape sequences for special characters. So GetXMLString or GetStringXML should give you the escape sequences, that's normal.

msxml.domdocument also gives you the escape sequences when you get the xml and only th retranslated original text, when you want the text or cell value of nodes.

It may be that as Rick has found several minor bugs, Greg isn't strict about this, but I don't see an error in the specific case you report. If you want the xml, that has to have the escaped special characters, if you want your original value back, you don't want the xml, nor the inner xml between tags, you want the inner text or the excel cell value, that is the unescaped string, so there has to be another method for that. The XML in the name of the method you use indicates, that you get back the xml or inner text of an xml tag, and that's not the original value, yes.

I've got version R32 (last non-beta version), by the way.


Chriss
 
Rajest--

Yes, the GetCellValue() method should return the original text contained in the cell. Please provide me a sample xlsx file showing the error and I will debug the problem.

When reading an existing xlsx file, I store the string Id in the xl_cells cursor and add the string value to the xl_strings cursor (both as plain text and as xml text). This string id is then used to return the string value in the GetCellValue() method. This is to mimic the way that string values are stored in the xlsx file. However, if you in code store a string value to the cell, I store the string value directly into the xl_cells cursor and -1 as the string Id. I do this for speed in creating and saving the xlsx file (the string values are written to the xlsx file as 'in-string' values).

The methods 'GetStringXML' and 'GetXMLString' are both protected and that is why you get the error message not found. These can only be accessed by the class itself. I did this as the 'protected' methods are not meant to be used by others outside of the class and as such, are not documented. The documented methods I work to maintain backward compatibility if I add additional functionality to the class. However, protected methods I can change and will not necessarily try to maintain backward compatibility (only insure is working in all calls as expected).

Greg Green
(I am the developer of the class)
 
Hi Greg. I wanted you to know that every time I use your code in my projects it makes me happy. Your gifts of talent and time are very much appreciated by me and my customers.

Thank you for having that heart.

--
Rick C. Hodgin
 
Hi Greg,

akes sense. I would expect a function like SetCellValue to do the translation to XML and GetCellValue to retranslate. It must be done non-recursive. And then you can, if you need to, even store &amp; as &amp;amp; and the non.recursive retranlsation gives back &amp; what has been stored in. And that's also the way the XMLDomDocument works. And there would be no need for a string id and making a difference of xml vs value, you simply do the opposite escape/unescape once in each direction of going from value to XML and XML to value.

If you work on given XML you'd also unescape the xml once, and escape anything that's written in. Protection of the method is of course correct, the user of your class should not be concerend with the XML details, only cell values, as the class abstracts usage of an excel sheet, not the xml behind it.

So Rajesh, there you have made another error in trying to use a protected method, it's intended for internal use only and you only use public methods. It could maybe made a step easier to see by intellisense in having a base class that has private methods, a subclass that is the level used by other developers that won't show this private methods and then public methods that should be able to use the private methods will have to call the base class with baseclass::privatemethod() or have another private method that does so. That way you overcome the unfortunate implementtion of protected in VFP. It shoudn't be necessary to do that, though. The developer using a class should be responsible for knowing or at least learning OOP principles once you don't understand why an obviously existing method is said to not exist.

There could also be an incentive to not just use some class, but subclass it for your own use, then you inherit all the protected stuff and con use it, but are guided to only use it internally. After all these visibility aspects are there to guide developers in the correct usage. If you get an error you use it wrongly.

Chriss
 
You could also ask for classes that are user-friendly for non-OOP developers that only need the sample usage of instanciating and calling methods. Fine, but then ou lose the guidance of this OOP feature of the different levels of visibility. A way to offer it would be going the indirection of OLE, making a class an OLE class you get more control about which things are shown by intellisense. But then users always go through a layer of OLE.

Which again speaks for an end level class you suggest to use for non-oop developers, perhaps.

And after all that, I second Rick, it's always nice of all developers to provide some of their work publicly. Not all of us working in corporate environments are entitled to do so, we're adding to the intellectual property of the companies we work in and it becomes tedious to redo something in private to be able to make it publicly available, not necessarily public domain but at least open source in some more or less easily and freely licensable form with GNU, creative commons or university licenses and the like.

It's a pity so many companies decide for working closed source, often only because our customers want that without knowing the damage done by it.

It's not rarely a talk of how many potential VFP knowledge is out there with developers in companies that even when not considering the legal issues can't devote time for it.

Chriss
 
Rajest--

I created a quick test with a xlsx file that has the following cells filled with string values:

2022-01-02_15-35-57_obgvyl.jpg


As you can see, I used your string given above in the first cell. When I read the xlsx file using the class, I get the following results:

2022-01-02_15-37-26_vgo4q8.jpg


This correctly displays the text as entered into Excel. Below is the code that I used (see the readxlsxfile.prg contained in the zip file for the class).

2022-01-02_15-39-29_kkyp9g.jpg


As far as I can see, there is not any problem with the value returned by GetCellValue() method. If you still have a problem, please send a sample xlsx file and code that you are using.

Greg
 
Dear Greg,

Thank you so much for taking from your valuable time to go through our entire discussion and to check it with a live example.
Great of you! ...and sorry for replying late. I was travelling and not accessing tek-tips since few days.

Yes, as you got my point, I would expect the result of GetCellValue from a cell to be simply what you see in the cell when you look at your excel sheet. From your explanation I see that the class design also was in the same way. Information about the methods 'GetStringXML' and 'GetXMLString' also are very much clear.

In fact, my previous implementations used to correctly give the expected results. But, this time with my client's file, I was getting this problem. Actually, contents of my clients xlsx file is generated by another program. Maybe that program explicitly adds these representations (like &amp) into the cell for some reasons. However, at present, I cannot share that file here as the data which belongs to the client, is real-life I think. Let me try to extract some portions, edit it and share here so that we can see how it works.

Will get back here soon!

Thanks a lot all,
especially that you so much for your time dear Greg!

Rajesh
 
I would suspect an error there somewhere, like double translating, so & becomes &amp; and then &amp;amp;

The job of GetCellValue is not to recursivly retranslate, as you should indeed be able to store any value, including the escape sequences.

So siply look at the xml itself, the representation of " by &quot; would be normal and valid and GetCellValue should return " for it, as Greg demos. If the XML contains $amp;quot; you would get out &quot; and that would be correct, the error would be storing values into the XML after double escaping them. If, on the other hand you find " within the XML that's simply invalid XML and even Excel should reject to load and display it.

I can also imagine &quot; to be in the excel cells intentionally, because it's data posted in HTML and then browsers make the last conversion. In HTML5 it's more common to have UTF-8 encoding and then not use HTML entities at all, but in XML this rule is strict, you can't pick what to allow in tag values by encoding only, the few XML notation relevant characters always need to be replaced b HTML entitites.

One possible reaon is someone using an XML library or class like DOMDocument and not realizing it converts this automatically. When the XML generation code escapes characters and then adds that to nodes, it becomes double escaped and, well, then it's also technically correct you only get it unescaped once, but you could correct the error as you already did by making the second unescaping yourself. I wouldn't change Gregs class for that nor a subclass, that could be decided to with the context of where the results are used.

Chriss
 
Hi all,

Sorry for this late reporting here! I was travelling.
I still didn't receive any info from the client as to how the XLSX file is generated.

By the way, opening the XLSX file with the class takes a bit of time. However, after opening, the reading of rows one by one is very fast. Not sure, the delay in getting it opened is because of any problem with the XLSX file or I am missing something.

Unfortunately, I cannot share my client's XLSX file here.

Rajesh
 
Can you just unzip the xlsx manually (after adding the extension zip or using any archiver that detects a zip archive from it's inner structure, like 7zip).

And then run this?

Code:
? '&amp;quot;' $ filetostr(getfile('workbook.xml'))

You can find the workbook.xml file inside the unzipped folder structure with the workbookname as folder name and a subfolder called xl, so:
[pre]somedrive:\xlsxcontainingfolder\workbookname\xl\workbook.xml[/pre]

As that must be valid XML, no matter if office XML or any other, the few characters we mentioned already must be escaped by HTML entities (like &quot; or &amp;). It also doesn't become wrong XML, if the characters are twice HTML encoded, therefore and as you get out '&quot;', for example, where you expect just ", you would find &amp;quot; within the XML file before it's read out by Greg's VFPxWorkbookXLSX class.

If that's the xml contains a double escaped ", which would not point out whether it's intentional or not. It also still is valid XML, but you'd finally know the source of this is not an error in Greg's reader, he escapes it once, not twice, and that's also correct, if you store a HTML entity into a cell, i.e. when you open up excel and explicitly write in &quot or write in a fully qualified mail address in HTML escaped format, then you do have &quot; within the excel cell and you will have &amp;quot; in the XML of the excel cell and you'd want to also get back the HTML entity, not let a reader unescape it until there are not HTML entities. That means it's not the job of an XML reader to un-escape such HTML entities until they are gone. Because XML must not only be capable to store ", but also to store &quot;, which means storing that as &amp;quot; and getting back &quot;, not ".

See? The HTML entities just as the more usual backslash escape mechanism also must allow to be able to encode escape characters or sequences themselves. Therefore it also isn't strictly pointing out an error, that you read out &quot;, that's only an error if there's no &amp;quot; in the xml file with the workbook data itself.

Caution, though, there's a chance the workbook.xml only contains reference values of content stored in another file of the xlsx archive. So a .f. also is not necessarily pointing out Greg's class has an error.

Besides all this, alone Greg's test proves well enough there is no wrong reading back of what's in the XML cells. So it would be a very special complex case like XML is stored in a Unicode format with double byte characters, instead of utf-8, which is the default for most XML and within normal letters and some more characters including & and ; is just single byte characters. Also, in that case I would expect the output to contain chr(0), too, if it's not decoded to ANSI.

As we also still have no clue what exact code you're using to read out a cell value, it's hard to say what could have gone wrong. Your posts indicated that you at least tried to use the protected methodds GetXMLString, which could work once you changed the mwethod to be public in the class, but then would correctly not unescape anything. You have to use the GetCellValue method. If you "hacked" the class to be able to use GetXMLString, you stepped on your own foot. GetXMLString, with the emphasis on XML, returns the escaped cell value, including &quot; for ", because " is reseved in XML to be the ddelimiter of attributes of XML tags. So while in itself the " character can of course be within an XML file, it's escaped in XML values. Otherwise " would end an attribute and cause all kinds of parsing errors.2

Overall, Rajesh, I think you understand the nature of escaping characters, you just haven't thought down to the bottom of it. I hope this time I explained it in enough detail so it became clear why and how such characters are escaped, and that a feature like unescaping HTML entities automaatically until no more are left is counterproductive. It would help your case, but not be generally correct XML parsing.

Chriss
 
To summarize the most important point:

Getting out &quot; in itself doesn't point out an error. Whether it could point out an error depends on
a) your usage of Gregs XLSX reader
b) the pure internal xml content within the xlsx file, which actually just is a zip archive.

The one line test code above will only have relevance, if it's .f., if it's .t. it points out the value you get back is corresponding to what you have to expect. And the reason for it could very wll be, that the original company stores its data in an encoding that already uses HTML entities before storing it to csv or excel files for you. Then it remains our task to decode it.

One reason to use HTML entities is obviously when that is a web app that ensures browsers display it correctly even after several layers of transport and also hinder HTML tag attributes to throw over a parsing of the HTML. So it's not just a XML issue to escape such characters, HTML tags have the same principles of attributes delimited with ".

Chriss
 
Chriss,

Even if I had heard of I had never tried exploding an XLSX file. Thanks for the explanations.

I tried finding &quot in ....\xl\workdbook.xml and even in .....\xl\worksheets\sheet1.xml.
For both, it returned .F. Then I even tried with only '&'. That also returned .F. !

I tried like
Code:
? '&quot' $ filetostr('E:\in_mails\mail_recs\xl\workbook.xml')

But, when I see the value through the XLSX class I am getting like
&quot;rajesh k&quot; &amp;lt;rajesh@mydomain.com&amp;gt;

The attached file is extracted from the original file. I have edited the contents without disturbing the quote, < and > etc marks. This file also gives the same output. You may check with this file.

Chriss, as you said, this XLSX might have been created by an application other than Microsoft Excel, maybe a Web application as you said. I have not asked my client about these kind of details.

Rajesh
 
 https://files.engineering.com/getfile.aspx?folder=e5436bc4-64b3-46c1-87e4-579eb40a2fdc&file=INBOUNDMAILS-2recs.xlsx
The text strings in a xlsx file are stored in the file sharedstrings.xml normally. The string value can also be stored directly in the sheet.xml file as in-line string values; however, Excel does not seem to use this ability.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top