Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

web request via Socket issue

Status
Not open for further replies.

LucL

Programmer
Jan 23, 2006
117
0
0
US
Hey Guys,

I've got a weird problem. I'm using standard sockets to pull up web pages (I know java has a built in class to do this but there is a special reason I'm using sockets).

Anyways, when the data starts flowing in, I get weird "8a", "40a", etc strings in between some of the HTML sometimes. I can't figure out where this is coming from.

For example...

I try to pull up The headers come in fine, and so does all of the HTML but the HTML data has a lot of junk in it...

</title>
<meta name="keywords" content="
32
<meta name="description" content="
13b
<meta name="robots" content="INDEX, FOLLOW">
<meta name="revisit-after" content="10">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>

The 32 isn't suppose to be there, and I have no idea where it's coming from. Same with the 13b. There are extra line breaks in there as well.

Anyone know what this is?

It seems to persist over many domains and servers and is only applicable on socket based web queries (the other webrequest class seems to not have these in the result).

Thanks!
Luc
 
Do you know about web page host in linux machine? seems to host web page on linux machine. The line break is little difficult from the one use in Windows machine.

You should show us some code that we can run and tell you anything wrong.

what do you mean <meta name="keywords" content="32 ? You have
find a character value of 32 followed by the double quote?
 
Ok, it took me forever to figure this out and it's something really stupid, BUT I still need help with this.

The code is below. You can try it out. Basically, by changing the outgoing headers to use the HTTP 1.0 standard, the output is fine, but if you use the HTTP 1.1 standard, the output contains a lot of junk data (random numbers here and there).

Can anyone explain why this is?

Here is the code:

import java.io.*;
import java.net.*;
import javax.net.SocketFactory;

public static void get_example(){

Socket s = null;

try
{
String host = " String file = "/";
int port = 80;

s = new Socket(host, port);

OutputStream out = s.getOutputStream();
PrintWriter outw = new PrintWriter(out, false);

outw.print("GET " + file + " HTTP/1.1\r\n");
outw.print("Accept: text/plain, text/html, text/*\r\n");
outw.print("Host: " + host + "\r\n");
outw.print("User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)\r\n");
outw.print("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n");
outw.print("Accept-Language: en-us\r\n");
outw.print("Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n");
outw.print("Accept-Encoding: compress32bit\r\n");
outw.print("Connection: close\r\n");
outw.print("\r\n");
outw.flush();

InputStream in = s.getInputStream();
InputStreamReader inr = new InputStreamReader(in);
BufferedReader br = new BufferedReader(inr);
String line;

while ((line = br.readLine()) != null)
{
System.out.println(line);
}

// br.close(); // Q. Do I need this?
} catch (Exception ex){
ex.printStackTrace();
}
}

------------------

Now below is the output just as it came into the socket...

HTTP/1.1 200 OK
Date: Fri, 19 Dec 2008 08:40:19 GMT
Server: Apache/2.0.52 (CentOS)
Accept-Ranges: bytes
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

bd
<html>

<!--
THIS FILE IS AUTOMATICALLY GENERATED, DO NOT EDIT IT!!!
System configuration files on this machine are automatically
generated from a revision controlled repository.
-->



29
</title>
<meta name="keywords" content="
32
<meta name="description" content="
13b
<meta name="robots" content="INDEX, FOLLOW">
<meta name="revisit-after" content="10">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>

<frameset rows="100%,*" frameborder="no" border="0" framespacing="0">
<frame src="8a
</frameset>

<noframes>
<body bgcolor="#ffffff" text="#000000">
<a href="23
here to go to
30
</body>

</noframes>
</html>

----------------------------------

Notice all the random numbers separated by line breaks? If you change the header to 1.0 they disappear.
 
That makes perfect sense. Thank You!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top