Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

dealing with 'funny' chars in textarea input

Status
Not open for further replies.

1DMF

Programmer
Jan 18, 2005
8,795
GB
Hi,

I have a problem where our DB is displaying weird characters.

the data entered is....
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k?

The db is showing...
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k?

I've tried to use the unicode module like so...
Code:
    # use encoding for unicode characters
    use Unicode::String;
    Unicode::String->stringify_as( 'utf8' ); 
    # notes
    my $nt = $cgi->param('NT');
    $nt =~ s/\'/\'\'/g;
    $nt = Unicode::String::latin1( $nt );

but that's made things worse...
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k?

So what am I doing wrong and how do I deal with these dodgy chars?

Thanks,
1DMF



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
It's difficult to answer this without the complete picture... what type of DB is it? What character set and encoding does the DB use? How does the data get into the database, and what encoding does the client that it is entered through use? And when you say "the db is showing"... where is it showing that? On a terminal of some kind? In a browser? If so we need to know what character set and/or encoding that is using, etc. There could be many links in that chain.

Annihilannic.
 
You can get this kind of effect when storing UTF-8 data in CHAR or VARCHAR columns, because the DB stores the double bytes as two separate ascii characters. Depending on how much control you have over the DB, you could try making them NCHAR or NVARCHAR instead. The long hyphens and the currency symbols seem to be the areas affected, which kind of confirms this. Try retrieving the DB rows and printing out the hex values of the data to confirm this diagnosis.

You are either going to have to clean the data before you store it as CHAR or VARCHAR, or change the columns to NCHAR or NVARCHAR (for Oracle, MySQL, or SQL Server at least; DB2 is either using Unicode to store data in CHAR or VARCHAR fields or not, depending on version and configuration)

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
stevexff, it's neither.

It is 'ntext' , I changed it to just 'text' but it's made no difference?

what type of DB is it?
MS SQL 2000 on SBS 2003.

What character set and encoding does the DB use?
How do I find that out?

How does the data get into the database, and what encoding does the client that it is entered through use?
it is entered into the DB by perl via an AJAX request... i'm assuming you meant the charset type on the web form, if so it's
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

And when you say "the db is showing"... where is it showing that? On a terminal of some kind? In a browser?
MS Access form.

Hope this helps to narrow down the cause.



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
I checked and ntext is meant to mean unicode, so it should handle the chars.

I changed the charset to UTF-8 and it has made no difference.

Also If I use SQL enterprise manager to view the data it is showing the funny chars there so it has nothing to do with the MS Access front end!

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
I've tried using encodeURIComponent instead and changed the perl code to
Code:
    # use encoding for unicode characters
    use Unicode::String;
    Unicode::String->stringify_as( 'utf8' );
    # notes
    my $nt = $cgi->param('NT');
    $nt =~ s/\'/\'\'/g;
    $nt = Unicode::String::utf8( $nt );

Nothing seems to make a difference?

Also I've added a 'die' to see what perl is getting as data and it displays it like this...
Code:
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k

So is that telling me it's SQL corrupting the data, or is that test pointelss because the browser is just displaying the characters correctly?

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
View the source of the output in plain text to see if the browser is displaying it correctly or if it's coming correctly from the server.

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
I've had a problem with something similar before. From the point where the data is input on the form, to storing it on the DB, then getting it back off the DB and sending the data back to the browser, there are any number of points where the data may be incorrectly mapped as ascii rather than UTF-8. Do a binary chop on this pathway by starting in the middle. Use something crude to get the data off the DB in as raw a form as possible. This will let you see if it has been corrupted before it was written to the DB. Your columns are NCHAR or NVARCHAR, which means they should support UTF-8 correctly. Based on the outcome, you can at least focus your attentions on whether the corruption occurs on the way in or on the way out. From there, it is just a matter of stepping through the links to determine where the mapping goes awry.



Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Do a binary chop on this pathway by starting in the middle.
what does that mean?

Use something crude to get the data off the DB in as raw a form as possible. This will let you see if it has been corrupted before it was written to the DB.

How? how do I know if it's the DB corrupting it, if i'm reading it after the event?

I've tried using a Javascript alert but I don't know if that is showing me 'raw' data or not.

It does seem to escape the chars to some %xx but I don't understand the codes as there are onnly a few URL escape codes.

However, I've sent it via AJAX without using any of javascript encode/ecsape commnads and the DB shows the same problem.

The AJAX send the data as follows...
Code:
    http_request.setRequestHeader('Content-type', 'application/x-www-form-urlencoded');

Where do you tell XMLHttpRequest what charset to use? or does it use the one from the current webpage?

Also is the only way I will get the raw data as perl sees by printing it out to a text file?



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
OK, I think i've narrowed it down to being either AJAX or PERL.

In JavaScript I do the following...
Code:
var nt = encodeURIComponent(frm.notes.value);  
alert("notes :\n\n" + nt);
and get the attached screen shot.

So it has encoded the chars, then in perl I do the following...
Code:
use CGI;
my $cgi = new CGI;
my $nt = $cgi->param('NT');
$nt =~ s/\'/\'\'/g;
&create_test($nt);

=====================
###############
## data test ##
###############
sub create_test {

    # create remote file name
    my $filename = "test-" . time . ".txt"; 
    my $path = DIR_TO_DOCS . "/comp_reports/";


    # create file
    open(LOCAL, ">$path$filename") or die "test failed";
        flock(LOCAL, 2);
        binmode(LOCAL);
        truncate(LOCAL, length($_[0]));
        seek(LOCAL, 0, 0);
        print LOCAL "$_[0]";
    close(LOCAL);      



}

when I view the generated text file I get
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k?

So it cannot be anything to do with the DB end, wouldn't you say?

So it is JavaScript or Perl corrupting the data, i'm leaning towards JavaScript just because I'm assuming that the CGI module should handle URL encoded chars and is considered the standard module to use for obtaining URL encoded data from STDIN.

Can anyone confirm this?

Does this mean Javascript is not capable of handling / sending these chars correctly using AJAX?

What do I do now I've narrowed down the issue to pre DB insert?


"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
 http://www.homeloanpartnership.com/test.gif
App%201%20[red]%E2%80%93[/red]%20sum

What are these? Three URI escape codes for just one symbol? This might be part of the problem here, when Perl unescapes these it would get three symbols back instead of one.

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
This might be applicable to this somehow:


If the characters you're using aren't in the ASCII range (0-255) they'd need multiple bytes to display them, so if a unicode string is trying to be parsed as ASCII (one byte per character) that long dash might show up as three different characters, and therefore be encoded as three separate % encodings.

I have one idea that should get around this pretty well: base64-encode it before sending it over ajax.

See the source here:
Base64 encodes any data (often binary) as plain text (A-Z, a-z, 0-9 and a couple symbols), so you wouldn't need to worry about URI encoding any unicode characters. Then, in Perl,

Code:
use MIME::Base64 "decode_base64";

print decode_base64($data);

Kirsle.net | My personal homepage
Code:
perl -e '$|=$i=1;print" oo\n<|>\n_|_";x:sleep$|;print"\b",$i++%2?"/":"_";goto x;'
 
Do a binary chop on this pathway by starting in the middle.

what does that mean?

I am learning C, I just read about binary search method, which chops a list in half, searches, chops useful half in half again, searches it, etc

So a binary chop means cutting up your whole code process into little bits (1/2 then 1/2 again=1/4, etc
 
Sorry about the confusion on the binary chop. MrCBofBCinTX covered that pretty well, I think.

Do you know if you are setting the headers correctly on your messages? Most browsers should cope happily with UTF-8 if you set the charset on the Content-type: header - see for more details.

Don't know enough about JavaScript to comment on its UTF-8 handling, though...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
App%201%20%E2%80%93%20sumWhat are these? Three URI escape codes for just one symbol? This might be part of the problem here, when Perl unescapes these it would get three symbols back instead of one.

My thoughts exactly when I saw the JS alert, WTH has it escaped!!!

Which is why i'm strongly leaning towards JS as the problem.

Base64 is certainly an idea, i'll have a go and see how I get on.

steveff -> Not sure how you send the correct content-type via AJAX, I am assuming (rightly or wrongly) that AJAX uses the current doc content-type for charset before encoding and sending.

Admitedly I was using ISO originaly in my page but now have the X/HTML coded with
Code:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

I'll go post in the AJAX forum to see what the syntax is for sending the correct content-type via the AJAX transmission.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
base64 gives exactly the same result, I encode with
Code:
var nt = Base64.encode(frm.notes.value);

I decode with the perl code you supplied and the resulting text file has
App 1 – sum assured to include existing mortgage (£110k) plus new mortgage (£150k) – existing policy has been retained (also £110k) and new policy applied for is £260k therefore client will have total life cover of £370k?

So what is going on????

even the base64 encode is encoding 3 chars for the single char, why on earth is this happening?

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
Seems like you can set the charset on the async request for AJAX on XMLHttpRequest JS object. See for details. Don't know if this helps any, though...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Tried that also with no effect...
Code:
    http_request.open('POST', url, true);
    http_request.setRequestHeader('Content-type', 'application/x-[URL unfurl="true"]www-form-urlencoded');[/URL]
    http_request.setRequestHeader('Content-length', parameters.length);
    [b]http_request.setRequestHeader("Accept-Charset", "UTF-8");[/b]
    http_request.send(parameters);

I've become defeatest and so dealt with it in the simplest way without wasting anymore time on it....

In perl I simply do this...
Code:
    # notes
    my $nt = $cgi->param('NT');
    $nt =~ s/\'/\'\'/g;
    $nt =~ s/–/-/g;
    $nt =~ s/Â//g;

and it works fine!

I'll just have to deal with any other chars as they crop up, well unless someone has a eureka moment and comes up with a solution that works.



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!"

Google Rank Extractor -> Perl beta with FusionCharts
 
After adding Encode and HTML::Entities, I found I still had to do a few hand-made substitutions myself on a different project.

I think that, overall, dealing with charsets still isn't right in many areas. Kind of sucks to have to battle like this, but if it works, it works.

I'm sure you will occasionally have to tweak something else :), I did.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top