Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations sizbut on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Thoughts on PHP parsing of URL 1

Status
Not open for further replies.

theniteowl

Programmer
May 24, 2005
1,975
US
I need to create a PHP script to parse out the URL/Querystring.

I will use the info to redirect the request through my own template file but there are a few things I need to consider.

The URL may be a full URL like or it may be a relative path like myfile.htm or /subfolder/myfile.htm, so I have to look for both possibilities.

The URL may contain parameters like myfile.htm?someval=12&anotherval=Test
Or it may be like myfile.htm#something.
I want to preserve these so they can be passed on to the content page that will be loaded.
Are there other things on the URL I should be looking for as well? I do not want to block capabilities the pages would have if their links were loading the pages directly rather than going through my script.
Can a URL contain both ?myval=something and #something and if so is there a requirment for the order in which they appear?

I am using header("Location:......") to redirect the page with the newly modified URL. If the original link is to a page outside of the local web site I would like that page to open in a new window rather than altering the current browsers path. Can this be done from PHP or do I need client-side code for it?

Any thoughts on other things I may need to look for when processing the URL for the redirection?

I have been looking around at bits of code for parsing and none so far seem to accomodate whole and relative paths or cover all of the possible parameters I may have to look for so I may have to go from scratch. Any suggestions on approach or which methods would be particularly useful would be appreciated as I am unfamiliar with most PHP methods and functions.

Thanks.

It's hard to think outside the box when I'm trapped in a cubicle.
 
the query string of a clicked link is contained in the $_SERVER['REQUEST_URI'] array element, even after a redirect from an htaccess file.

if you're talking about "on the fly link rewriting" from an "included" template then first off you're going to have to adopt one of two strategies:

1. have a spider that goes through all of your directories looking for dodgy links and rewrite them back into the filesystem. you could set this spider to run once a day or whatever).
2. abandon "includes" and move to a file_get_contents method of templating. this will have an impact if the templates contain php code. once you have the file in a string you can then run your rewrite code.

of the two - i'd go for the former if performance is an issue, and a hybrid of 1 and 2 (primary=2 but the file gets rewritten as well as the string) if not).

as a start - the regex for picking up the links (turn case sensitivity off) is
Code:
<a\b[^>]*>(.*?)</a>

hth
Justin
 
I am not trying to rewrite the links from the page. I am picking up the $_SERVER['REQUEST_URI'] value and parsing it out to determine how I will alter it before doing the redirect.

Say the content page has a link pointing to /myweb/sub1/myfile.htm, I first determine that it is a relative path to this site and has to go into a sub folder called sub1 to load the myfile.htm file.
I use the subfolder and filename to set the parameters for my menu script so it knows which options to display/highlight then pass just sub1/myfile.htm to my index.php script so it knows the path relative to it's own position. The actual URL I would output for the redirect would be something like: /myweb/index.php?vT=1&M=2&view=sub1/myfile.htm
Or at least a URL encoded version of the above to account for illegal characters in the path.

If the content page link was passing it's own parameters for either bookmarks or their own code to use I want to recognize and preserve those values when I call index.php.

My menu system is generated from an array where T= indicates the top level menu option that was selected so that it can be highlighted when the menu is generated, M= the sub menu option that was selected so that it knows which menu option to highlight and what the default content page to load would be if no view parameter is passed, and view= indicates which content page to load other than the default for that menu option. The parameters T and M tell me which folder the path is relative to so I only need to indicate the name of the file or any additional sub folders and then the file to load under the view parameter.

So when a link in a content page is clicked I need to intercept it and then parse out the address to determine if it is outside of this site or not and if it is local determine the path to the file and which menu options are appropriate to show/highlight for files under those folders.

It should be relatively easy but for my inexperience, but there are the potential problems of dealing with parameters passed from the content page links so they are persisted through so the newly loaded content page can use them (may not be an issue, I have just not been able to test yet) and whether I can cause a new window to open for links determined to be outside of the local web site using server-side code.

I have to test to see if this system may cause problems for form submissions also. It is tough testing dealing with FTPing files back and forth to the server.
Last night I setup Apache and PHP on my home PC so I have a better test environment to work with but I do not yet have FTP access to the home environment so I cannot test anything while at work.

It's hard to think outside the box when I'm trapped in a cubicle.
 
So when a link in a content page is clicked I need to intercept it and then parse out the address to determine if it is outside of this site or not
you can't (with php) unless you rewrite the attribute before displaying it. a link that is outside the site will cause the browser to get the resource that corresponds to the link. it won't ever query your server (without url rewriting).

if it is local determine the path to the file and which menu options are appropriate to show/highlight for files under those folders.

i think you might want to check out the parse_url function.
 
you can't (with php) unless you rewrite the attribute before displaying it. a link that is outside the site will cause the browser to get the resource that corresponds to the link. it won't ever query your server (without url rewriting).

All of the subfolders for the site are using an htaccess file with RewriteRule to call my processing page in which I hope to pick up the URL and parse it out. So the link actually executes but the htaccess file bounces it to my processing page where I determine the path and push it through my index.php file.

i think you might want to check out the parse_url function.
I have been looking into that function but it is stated not to work with relative URLs which is primarily what I will be processing so I have to either use a combination of functions or write one function to do it all.


It's hard to think outside the box when I'm trapped in a cubicle.
 
All of the subfolders for the site are using an htaccess file with RewriteRule to call my processing page in which I hope to pick up the URL and parse it out. So the link actually executes but the htaccess file bounces it to my processing page where I determine the path and push it through my index.php file.

it makes no difference.

consider three links:
Code:
<a href="[URL unfurl="true"]http://www.somesite.com/page.php">Some[/URL] Site</a>
<a href="/page.php">Check out this page</a>
<a href="[URL unfurl="true"]http://www.yoursite.com/directory/page.php">A[/URL] page on this site</a>
links 2 and 3 will be directed by a user's browser to your web site. so your mod_rewrite rules will work just fine. link 1 will never touch your website. the browser will cause the OS to lookup the IP address of the address and will go directly to port 80 on that machine. so nothing that you do on your web site will affect this link UNLESS you re-write the link using php or js so it looks something like
Code:
<a href="sitelinkchecker.php?url=http://www.somesite.com/page.php">Some Site</a>
this means that all links would be delivered to your site for parsing. not very elegant.

I have been looking into that function but it is stated not to work with relative URLs which is primarily what I will be processing so I have to either use a combination of functions or write one function to do it all.
parse_url works just fine with relative urls. you want to use parse_url($_SERVER['REQUEST_URI']);
try this
Code:
<?
$foo = parse_url("/path/to/file/file.php?param=value&param2=value2");
echo "<pre>";
print_r($foo);
echo "</pre>";
?>
 
I see what you mean now (embarassed it did not occur to me) and that is not a problem, it just means that links going outside of the website are not processed. The only reason to do that would be to cause a new window to open for that link rather than redirecting the current window but that is not a big deal. I just like to give the client a new window for outside links so they do not lose the window they were just viewing for the local website or have to try to remember how to navigate back to where they were.
That is just fluff though. Properly formatted links would use target="_blank" anyway.
If I feel really strongly about it I could put in some javascript that would set actions to occur when an HREF is clicked and test the link client-side or even parse the links out client-side with an onload event and reformat them. Not perfect of course but it would work 90+% of the time.

I will play with parse_url. When I read the site it said this:
Note: This function doesn't work with relative URLs.

Note: parse_url() is intended specifically for the purpose of parsing URLs and not URIs. However, to comply with PHP's backwards compatibility requirements it makes an exception for the file:// scheme where tripple slashes (file:///...) are allowed. For any other scheme this is invalid.

Can you combine # and ? parameters on a URL? If so do they need to be in specific order like myfile.php?val=3#something?


It's hard to think outside the box when I'm trapped in a cubicle.
 
Can you combine # and ? parameters on a URL? If so do they need to be in specific order like myfile.php?val=3#something?
no reason you can't combine them. i don't know about order. i (think) always do #internalanchor?param=value

The only reason to do that would be to cause a new window to open for that link rather than redirecting the current window but that is not a big deal.

if it becomes a big deal i'd take another look at php content parsing if i were you!
 
if it becomes a big deal i'd take another look at php content parsing if i were you!

I may end up having to do that but will have to work out the details for parsing the URL anyway. There are advantages and disadvantages to both methods not the least of which is performance and it will benefit me to learn the content parsing in the long run anyway.

Right now the school is getting anxious about getting the new site up so I am working feverishly to get something running and can worry about reworking it later as long as it does not cause too much structure change. Dealing with the URLs is the biggest issue then I have to deal with with some style issues and getting rid of the old convoluted table structure.

Thanks.

It's hard to think outside the box when I'm trapped in a cubicle.
 
jpadie, sorry but I have another snag.
Using your code from this and the previous thread I thought things were working. In fact they were to a point but here is the situation.

I have a .htaccess file in a sub folder with these lines:
RewriteEngine on
RewriteRule \.(html|htm|php|php4|php5)$ /jfk/newjfk/error.php?error [R,L]

So when a request for any of these types of files comes into the folder it sends to the error page instead.
Inside that error page I need to know what the address was that they were attempting to get to so I can redirect accordingly. At this point though $_SERVER['REQUEST_URI'] returns the newly rewritten URL of /jfk/newjfk/error.php?error

Am I missing a step in the htaccess file to send the original URI on to the error.php page so I can parse it out and redirect?
Or is there another way to get it from the error.php page?
Or is this approach not going to work the way we had hoped?




It's hard to think outside the box when I'm trapped in a cubicle.
 
rats. i forgot this in the move from redirect to rewrite.

try the following rewrite rule instead
Code:
RewriteRule \.(html|htm|php|php4|php5)$ jfk/newjfk/error.php?error&url=%{REQUEST_URI}$qstr=%{QUERY_STRING} [R]
your originally requested file will be in $_GET['url'] and its query string in $_GET['qstr']
 
Great thanks, I will try it out this evening.

Just as I was getting my parsing code together I realized that the URI passed was not what I actually wanted. :)


It's hard to think outside the box when I'm trapped in a cubicle.
 
Well futz!
The above method works for the most part. I now have problems passing on the fragment after a hash mark though.

I have searched around a bit and found others with the same problem but no solutions.

There may not be a reliable way to do it given the way browsers handle the bookmark info and that causes the loss of a feature they may require.
I can always do a work-around for bookmarks but then the client would have to take special measures on their end which is what I am trying to avoid.

If you know how to get around this I would be glad to hear it. Otherwise I will stick with the current method for now and perhaps later work on a page parsing setup. It's the most complex and probably the slowest but still most flexible.

Thanks.

It's hard to think outside the box when I'm trapped in a cubicle.
 
The above method works for the most part. I now have problems passing on the fragment after a hash mark though.

i was being slack. i knew i'd forgotten to add something. add a [NE] to the end of the rewrite rule so it becomes
Code:
RewriteRule \.(html|htm|php|php4|php5)$ jfk/newjfk/error.php?error&url=%{REQUEST_URI}$qstr=%{QUERY_STRING} [R, NE]

NE stands for no escape. what is happening is that apache is escaping the hash as a special character.
 
This thing is really giving me fits.
Using this line:
RewriteRule \.(html|htm|php|php4|php5)$ jfk/newjfk/error.php?error&url=%{REQUEST_URI}$qstr=%{QUERY_STRING} [R, NE]
I was getting errors until I removed the space in [R, NE].

When I try to retrieve qstr I get an undefined error so I replaced the $ with a & so now it all comes out on the address line but the returned REQUEST_URI looks like this:

/jfk/newjfk/error.php?error&url=/jfk2/teams/test.php&qstr=V=1&T=Test
Trying to pull qstr out will only return V=1 so I lose the rest of the line.

I could lose the qstr= from the line in the htaccess file but then I cannot get just that section so that it does not lose the #Testing on the end.

Thoughts?
I need to be able to pull back the original requested path in REQUEST_URI and still be able to retrieve more than one parameter and the #whatever.

Any way to convert the = and & signs in this: qstr=V=1&T=Test#Testing so that I can get the entire string and then parse it out in PHP?

I do not really need the error in there any longer so what I need is to be able to split and pass a url like this:
/jfk/newjfk/error.php?V=1&T=Test#Testing
I think I could do it if there would never be more than one parameter passed.




It's hard to think outside the box when I'm trapped in a cubicle.
 
/jfk/newjfk/error.php?error&url=/jfk2/teams/test.php&qstr=V=1&T=Test
Trying to pull qstr out will only return V=1 so I lose the rest of the line.

let's not worry about the $_SERVER['REQUEST_URI'] for a second.

can you tell me what the results are for
$_GET['url'] and $_GET['qstr']?

these should contain the right info for you but if not there are ways to parse the $_SERVER['REQUEST_URI'] that are not troublesome.

you were quite right that there should not have been a space bfore NE and the dollar sign should, of course, have been an ampersand. sorry.


 

Here is the results of two of the tests.
All tests with URI: /jfk/newjfk/test/test.htm?A=1&B=2#Anchor

RewriteRule \.(html|htm|php|php4|php5)$ /jfk/newjfk/error.php?error&url=%{REQUEST_URI}$qstr=%{QUERY_STRING} [R,NE]
Address = /jfk/newjfk/error.php?error&url=/jfk/newjfk/test/test.htm$qstr=A=1&B=2
REQUEST_URI = /jfk/newjfk/error.php?error&url=/jfk/newjfk/test/test.htm$qstr=A=1&B=2
$_GET['url'] = /jfk/newjfk/test/test.htm$qstr=A=1
$_GET['qstr'] = nothing displayed

I believe nothing is displayed for qstr above because of the $ in the string so I replace it with an & in the string below for the next test.

RewriteRule \.(html|htm|php|php4|php5)$ /jfk/newjfk/error.php?error&url=%{REQUEST_URI}&qstr=%{QUERY_STRING} [R,NE]
Address = /jfk/newjfk/error.php?error&url=/jfk/newjfk/test/test.htm&qstr=A=1&B=2
REQUEST_URI = /jfk/newjfk/error.php?error&url=/jfk/newjfk/test/test.htm&qstr=A=1&B=2
$_GET['url'] = /jfk/newjfk/test/test.htm
$_GET['qstr'] = A=1

I believe qstr only shows the first value instead of the whole string because it runs into the & and thinks there are multiple values rather than just one string and only pulls out the value it thinks belongs to qstr.



It's hard to think outside the box when I'm trapped in a cubicle.
 
it's becoming obvious that i tend to use IIS, i'm sure. i'm learning with you here....

...so logically you could fix the lost parameters this with just this tweak
Code:
RewriteRule \.(html|htm|php|php4|php5)$ /jfk/newjfk/error.php?error&url=%{REQUEST_URI}&%{QUERY_STRING} [R,NE]

Then everything in the $_GET superglobal will be an old query parameter other than $_GET['error'] and $_GET['url']

i will give some thought to the document anchor. could you try changing the test urls to put the anchor first?
Code:
/jfk/newjfk/test/test.htm#Anchor?A=1&B=2
 
This works to get all of the variables but still loses the anchor.

If the anchor is positioned first then all variables are lost. I tested in IE and Firefox with the same results.

Firefox will continue to display the anchor in the address bar while IE does not but using parse_url will not display the anchor field.

Anyway to have the # converted to another character or variable name so it can be picked out of the URL after?

It's hard to think outside the box when I'm trapped in a cubicle.
 
Anyway to have the # converted to another character or variable name so it can be picked out of the URL after?

you could take the NE off and then parse the result. apache would then turn the # into an escaped character.

or you could use regex but i really am lousy at regex.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top