Code to read and parse html file 1

theniteowl · Aug 25, 2006

Hi All,
I am looking to read in a .htm file and parse the contents so they can be inserted into a template page and sent to the browser.

Primarily I need to strip out the head and body tags so the remaining HTML can be added directly into the existing page.
The pages would be on the same server.

Can PHP read in a .htm or even .php file as text information directly? I want to keep the code all server side.
The purpose is to allow content creators to create their html page however they wish but have that page appear within a specified area of the templated page and I do not want them to have to worry about using a specific template of their own or having to remember to remove the html, head, body tags so we do not end up with them embeded within the body of another page.

If anyone has any code to read an html file into a string or code that will parse out the string to return all code between specific tags it would be a great start.

At my age I still learn something new every day, but I forget two others.

jpadie · Aug 25, 2006

i assume you want the php in the html to be processed. if not, this is as simple as file_get_contents($filename) plus some regex (see below)

if yes:

Code:

ob_start();
include ($filename);
$contents = ob_get_contents();
ob_end_clean();

then you need a regex

Code:

preg_match('/<body\b[^>]*>(/s.*?)</body>/', $contents, $matches);
$excisedcontents = $matches[1];

theniteowl · Aug 25, 2006

Does this mean that any PHP code within the file is processed and the results of the page are returned into the $contents variable?
That sounds like what I would want. I had been thinking in terms of static content and just parsing out what I needed but you are right I do need to consider the resultant content after any server side code is processed.

I would then take the resultant string and search for any potentially important pieces such as styles, code blocks and eventually the HTML and then insert those pieces into appropriate locations of my template page. I may even parse out hyperlinks and modify them to fit the template system prior to rendering.

At my age I still learn something new every day, but I forget two others.

jpadie · Aug 25, 2006

with include/require the php is processed - hence the need to capture the content in the output buffer (you cannot otherwise return the 'included' content to a string).

with the other file functions the php is not processed (although it would be if you called it via an http wrapper but you might then lose control over the input variables).

for the parsing: the code i gave just dumps the stuff within the body tags. it's pretty easy to parse the file generally for style blocks or whatever. similarly easy to modify most links PROVIDED they are all correctly coded to start with. the complexity comes in from trying to protect the users from their own cock-ups. a good site on regex is

http://www.regular-expressions.info.

i reckon, however, that regex is enough of an arcane skill that a forum on tek-tips dedicated to regex might be worthwhile.

theniteowl · Aug 25, 2006

I have been using regular expressions quite a bit but do not really know how to write them. I can often modify them enough to get the job done but have not spent the time to truly learn them yet.

Is your expression above looking for a tag beginning with "<body" and then optionally having other parameters prior to the closing ">"?

I would either replicate the expression for each type of tag I want to find or more optimally make it a function that I can pass the search patterns into. I did something similar in Javascript with:

Code:

      var oRE = new RegExp(sStart + "[^>]*?" + sMiddle + ".*?" + sEnd, "i");
      mydiv = mydiv.replace(oRE, sNewString);

sStart and sEnd would be the patterns for the start and end tags I had to search for and sMiddle would be any specific pattern inside that tag that made it uniquely identifiable.
In this case though I was searching and modifying form fields so knowing which one I was searching for was more important.

In the PHP project though I will have to drill into the body tag to check for values also. The main reason being that if the user set a background color or image I have to move that setting up to my own body tag rather than losing it.
I do not think the task will be extremely difficult but I am sure I will run into plenty of issues I have not yet thought of.

At my age I still learn something new every day, but I forget two others.

theniteowl · Aug 25, 2006

Oh, off the subject but I do not get responses in the Apache forum.
Do you know if/how an htaccess file can be placed in the root folder so it affects all subs but not apply to the files in the root?

My htaccess file does a rewriterule when requests go to the sub folders but I do not want this to occur in the main folder. Consequently I have to either have a copy of the htaccess file in each folder off the root or I have to have another folder level under root and place all subs within that folder. That just opens up potential for them to put the files in the wrong location and thereby not have the rewrite rule occur breaking the template.

Perhaps the htaccess file can be set to test of the requested url was in the root and then stop processing if it were so the rewrite never occurs?

At my age I still learn something new every day, but I forget two others.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Code to read and parse html file 1

theniteowl

Programmer

jpadie

Technical User

theniteowl

Programmer

jpadie

Technical User

theniteowl

Programmer

theniteowl

Programmer

Similar threads

Part and Inventory Search

Sponsor