Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

VFPxWorkbookXLSX extracts quote marks as " and < > symbols as < and &a 4

Status
Not open for further replies.

Rajesh Karunakaran

Programmer
Sep 29, 2016
545
MU
Hi friends,

I am using VFPxWorkbookXLSX for reading XLSX file. It has columns with email addresses. When I read those columns, my values are coming as quote marks (") being converted as &quot; and the '<' and '>' characters being converted as &amp;lt; and &amp;gt; respectively.

Are these by design? Or, is there any setting in the class itself?
I know I can remove these or replace by CHRTRAN or STRTRAN after reading.
But, I want to know if this can be configured!

Thanks
Rajesh
 
Hi Greg,

I once again unzipped the XLSX (attached in my previous post) and checked in the SharedStrings.xml for these expressions. Couldn't find any. I can see the texts as it is.

Earlier you were asking for a sample file.
If you have some time to spare, can you check with the file from my previous post ?

Thanks in advance.

Rajesh
 
I see, the mail addresses are in sharedStrings.xml, for example your mail (anonymized) <t>"rajesh k" &lt;***@domain.com&gt;</t> is in there without escaped ", which makes me wonder why you get out &quot;

Actually the XML string should escape the ", but it doesn't

Chriss
 
Chriss,

If I view SharedStrings.xml in my FireFox browser, I see as
<t>"rajesh k" <rajesh@mydomain.com></t>
I viewed in my MS Edge, MSIE and Chrome. All show same.

If you're seeing it as
<t>"rajesh k" &lt;rajesh@mydomain.com&gt;</t>
I am wondering, why?
Did you open that in a text viewer to look inside?

Rajesh



 
Use a text editor, not a browser. Browsers render HTML entities as they should be displayed.

Also, if you just use FILETOSTR, as I said, you also don't translate the encoding by interpreting the XML by a browser, you simply read the XML text itself, which we're now interested in specifically to see if the data is correctly encoded. It's not, as a quote must be HTML encoded. The riddle is why you get out &quot; when the XML file itself contains a simple ".

You also don't read a html file into a browser to check whether it has a certain tags of HTML entities, do you? It's called HTML entity as it's a halfways speaking abbreviation of what it is &amp; has amp for ampersand. And a Browser displays that as &. So if you want to see whether HTML contains &amp; or just &, you don't use a browser.

I know you finally want to read out a mail address as displayed in a browser or in Excel itself. But that depends on it being correct XML for a corrct XML reader to read it out correctly.

It seems to me Gregs class always translates " into &quot; and &quot; into ", so if the XML is wrong the cell value is converted, that conversion into HTML entities should only be done when writing a value into an XML node, not when reading it out as value. And to make the confusion complete, a GetXMLString shoudl read out the XML as is, just like a text editor will display it.

So in short perhaps this list of what to expect helps:

Excel Cell value or browser display:
"name" <user@domain.com>

XML file content of a tale cell tag <t> including the opening/closing tag:
<t>&quot;name&quot;&lt;user@domain.com&gt;</t>
Notice: ]XML with correctly escaped characters. As already discussed in length things like pointy brackets < and > would cause havoc in the XML parsing, an inner XML just like inner HTML does never contain < or >. Exceptions confirm the rule.

GetXMLString (private method, not intended for your use, but only for internal use):
&quot;name&quot;&lt;user@domain.com&gt;
That is just the innner XML of the <t></t> tag. It's still XML, with some characters escaped as HTML entities

GetCellValue (the method of Greg's class you should use to read a cell value):
"name" <user@domain.com>
Notice: Well, in short what Excel or a browser also displays

As you get out &quot; where I actually erroneously just set " stored within the XML, then there's something really weird about all this. At least the < and > are converted to &lt; and &gt;. So far, so good. I guess the XML creation then is partly okay, but not fully done. Makes me wonder. So is it actually a XSLT file created without Excel by some library or is the source Excel itself, ie data is put into sheets by Excel automation. The latter shoudl do it correctly, shouldn't it?

And you don't need a text editor, you have VFP, you can use FILETOSTR and you can use MODIFY FILE to look into any text files, like XML or HTML. Or Hexedit for any binary file. Not interpreted and rendered into a document object model or anything like that.

A browser is fine to look into XML if you want to use its features of allowing to expand or collapse nodes and subnodes, but it's not siplaying XML 1:1, that should be clear.

Chriss.
 
Chriss,

Fantastic explanation! Thank you.

Yes, when I look into XML file as text, I can see those characters and escape expressions that you have mentioned. Also, as you have assumed, I also think, the XLSX is created using some utility through code. Basically, I think it's an email content extractor which reads through the emails and writes the content into an XLSX file. Maybe, the escaping is not handled properly.

Anyway, I have a small function in my program which translates these escape expressions into their character representations. So, generally the problem is solved. Also, I assume that for an email address specification, the only escape sequences we can expect are for characters ", < and >.

By the way, even though I take is as my problem is solved, we shall continue with this discussion if someone has to contribute something, isn't it?

Thank you all dear friends for your time
Rajesh

NOTE: Another matter is, opening my XLSX with say around 700 rows through the class, takes considerable time. Not sure if it's because of any problem in the XLSX itself. Does the class load the full file content onto memory before setting it ready for further processing ? I'm posting this as separate thread.
 
Rajesh,

It's fair to say your own fix solves the problem, I just also hope Greg can use your attachment to see whether something is wrong about his class or rather just the XLSX sample. I also see our other thread. I'll answer it there.

I don't see the sequence you get out. And yes, for mails only ",<, and > play a role. Doing escaping twice, you could also get &amp;, the HTML entity for &, because that's part of any HTML entity, also those of ",<, or >. In a first pass these become &quot; &lt; and &gt;, in a second pass &amp;quot; &amp;lt; and &amp;gt;

And these double escaped sequences would explain why you get out &quot; when &amp;quot; is unescaped, but only unescaped once. In that case it would be correct behavior of Gregs class, as it's task is not to unescape as often as needed to get rid of all HTML entities, Because if someone actually wants to store HTML entities into Excel cells of an XLSX, the double escaping is the way this will end up in the XML file, that's not wrong in itself. In short the HTML entities like any good escaping alos offer a way to encode themselves in a way that unescaping gets them back. One more commonly used escaping is using backslashes as prefix of encoded characters, like \t for tabs, \n for newline and \r for carriage return in Javascript, and to encode a \it's doubled to \\. Now if someone wants to escape \\ that will lead to \\\\, which also is not unfortunate but just the consequence of wanting to store the escape sequences themselves.

Chriss
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top