Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing web data 1

Status
Not open for further replies.

dagoat10

Programmer
Jun 3, 2010
74
US
I have gotten so close to parse a page that i read from another website, but i have not succeeded i can't get it to get the line to stop breaking at the wrong place. if the number is equal to the rank it should use a line break otherwise keep reading the line.

Code:
<?php

$file = file("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG");[/URL]
$player_count = 1;

foreach ($file as $file_line => $files)
{
    $files = strip_tags($files);
    $count = $player_count+1;
    $chars = htmlspecialchars($files);

    $chars2 = explode(",", $chars);
    $chars3 = $chars2[0] . $chars2[1];

    if($file_line >= 1459)
    {
        
    if($chars3 == $count)
    {
        echo "Char:" . $chars . "Chars2 " . $chars3 . " Numb: " . $count;
        echo "<br>";
        $player_count++;
    }
        echo  htmlspecialchars($files);
    }

   
}

?>

so i get this :

8 Tony Romo DAL QB 347 550 63.1 34.4 4,483 8.2 280.2 26 Char: 9 Chars2 9 Numb: 9
9 203 36.9 80T 61 17 34 97.6 9 Tom Brady NE QB 371 565 65.7 35.3 4,398 7.8 274.9 28 13 214 37.9 81T 43 12 16 96.2 Char:10 Chars2 10 Numb: 10

instead of this:

8 Tony Romo DAL QB 347 550 63.1 34.4 4,483 8.2 280.2 26 9 203 36.9 80T 61 17 34 97.6 Char: 9 Chars2 9 Numb: 9

9 Tom Brady NE QB 371 565 65.7 35.3 4,398 7.8 274.9 28 13 214 37.9 81T 43 12 16 96.2 Char:10 Chars2 10 Numb: 10

what am i doing wrong?
 
You are assuming that line break characters exist between the records. Being a webpage this is normally not the case.

Rather line breaks are denoted by <br> tags. Or even by other tags such as divs surrounding the text.

file() expects line breaks in the traditional fashion of \n\r, while a web page can have a line break be displayed by different methods. However the source may have no such line breaks.

For instance:
Code:
line1<br>line2<br>line3<div>text1</div><div>text2</div>
This html code would appear as 5 lines of text, with a "line break" in between, however reading that through the file() command would show no line breaks at all. Hence only one line is produced.

How exactly is the website outputting that text? Perhaps an example of the ouput may help decide how to parse it.

----------------------------------
Phil AKA Vacunita
----------------------------------
Ignorance is not necessarily Bliss, case in point:
Unknown has caused an Unknown Error on Unknown and must be shutdown to prevent damage to Unknown.

Behind the Web, Tips and Tricks for Web Development.
 
this looks more like a job for preg_match_all

 
Line #1459 : 1
Line #1460 :
Line #1461 : Drew Brees
Line #1462 :
Line #1463 :
Line #1464 :
Line #1465 :
Line #1466 : NO
Line #1467 :
Line #1468 :
Line #1469 :
Line #1470 :
Line #1471 :
Line #1472 : QB
Line #1473 :
Line #1474 :
Line #1475 :
Line #1476 :
Line #1477 :
Line #1478 :
Line #1479 :
Line #1480 :
Line #1481 :
Line #1482 :
Line #1483 :
Line #1484 : 363
Line #1485 :
Line #1486 :
Line #1487 :
Line #1488 :
Line #1489 :
Line #1490 :
Line #1491 :
Line #1492 :
Line #1493 :
Line #1494 :
Line #1495 :
Line #1496 :
Line #1497 :
Line #1498 :
Line #1499 :
Line #1500 :
Line #1501 :
Line #1502 : 514
Line #1503 :
Line #1504 :
Line #1505 :
Line #1506 :
Line #1507 :
Line #1508 :
Line #1509 :
Line #1510 :
Line #1511 :
Line #1512 :
Line #1513 :
Line #1514 :
Line #1515 :
Line #1516 :
Line #1517 :
Line #1518 :
Line #1519 :
Line #1520 : 70.6
Line #1521 :
Line #1522 :
Line #1523 :
Line #1524 :
Line #1525 :
Line #1526 :
Line #1527 :
Line #1528 :
Line #1529 :
Line #1530 :
Line #1531 :
Line #1532 :
Line #1533 :
Line #1534 :
Line #1535 :
Line #1536 :
Line #1537 :
Line #1538 : 34.3
Line #1539 :
Line #1540 :
Line #1541 :
Line #1542 :
Line #1543 :
Line #1544 :
Line #1545 :
Line #1546 :
Line #1547 :
Line #1548 :
Line #1549 :
Line #1550 :
Line #1551 :
Line #1552 :
Line #1553 :
Line #1554 :
Line #1555 :
Line #1556 : 4,388
Line #1557 :
Line #1558 :
Line #1559 :
Line #1560 :
Line #1561 :
Line #1562 :
Line #1563 :
Line #1564 :
Line #1565 :
Line #1566 :
Line #1567 :
Line #1568 :
Line #1569 :
Line #1570 :
Line #1571 :
Line #1572 :
Line #1573 :
Line #1574 : 8.5
Line #1575 :
Line #1576 :
Line #1577 :
Line #1578 :
Line #1579 :
Line #1580 :
Line #1581 :
Line #1582 :
Line #1583 :
Line #1584 :
Line #1585 :
Line #1586 :
Line #1587 :
Line #1588 :
Line #1589 :
Line #1590 :
Line #1591 :
Line #1592 : 292.5
Line #1593 :
Line #1594 :
Line #1595 :
Line #1596 :
Line #1597 :
Line #1598 :
Line #1599 :
Line #1600 :
Line #1601 :
Line #1602 :
Line #1603 :
Line #1604 :
Line #1605 :
Line #1606 :
Line #1607 :
Line #1608 :
Line #1609 :
Line #1610 : 34
Line #1611 :
Line #1612 :
Line #1613 :
Line #1614 :
Line #1615 :
Line #1616 :
Line #1617 :
Line #1618 :
Line #1619 :
Line #1620 :
Line #1621 :
Line #1622 :
Line #1623 :
Line #1624 :
Line #1625 :
Line #1626 :
Line #1627 :
Line #1628 : 11
Line #1629 :
Line #1630 :
Line #1631 :
Line #1632 :
Line #1633 :
Line #1634 :
Line #1635 :
Line #1636 :
Line #1637 :
Line #1638 :
Line #1639 :
Line #1640 :
Line #1641 :
Line #1642 :
Line #1643 :
Line #1644 :
Line #1645 :
Line #1646 : 210
Line #1647 :
Line #1648 :
Line #1649 :
Line #1650 :
Line #1651 :
Line #1652 :
Line #1653 :
Line #1654 :
Line #1655 :
Line #1656 :
Line #1657 :
Line #1658 :
Line #1659 :
Line #1660 :
Line #1661 :
Line #1662 :
Line #1663 :
Line #1664 : 40.9
Line #1665 :
Line #1666 :
Line #1667 :
Line #1668 :
Line #1669 :
Line #1670 :
Line #1671 :
Line #1672 :
Line #1673 :
Line #1674 :
Line #1675 :
Line #1676 :
Line #1677 :
Line #1678 :
Line #1679 :
Line #1680 :
Line #1681 :
Line #1682 : 75T
Line #1683 :
Line #1684 :
Line #1685 :
Line #1686 :
Line #1687 :
Line #1688 :
Line #1689 :
Line #1690 :
Line #1691 :
Line #1692 :
Line #1693 :
Line #1694 :
Line #1695 :
Line #1696 :
Line #1697 :
Line #1698 :
Line #1699 :
Line #1700 : 58
Line #1701 :
Line #1702 :
Line #1703 :
Line #1704 :
Line #1705 :
Line #1706 :
Line #1707 :
Line #1708 :
Line #1709 :
Line #1710 :
Line #1711 :
Line #1712 :
Line #1713 :
Line #1714 :
Line #1715 :
Line #1716 :
Line #1717 :
Line #1718 : 11
Line #1719 :
Line #1720 :
Line #1721 :
Line #1722 :
Line #1723 :
Line #1724 :
Line #1725 :
Line #1726 :
Line #1727 :
Line #1728 :
Line #1729 :
Line #1730 :
Line #1731 :
Line #1732 :
Line #1733 :
Line #1734 :
Line #1735 :
Line #1736 : 20
Line #1737 :
Line #1738 :
Line #1739 :
Line #1740 :
Line #1741 :
Line #1742 :
Line #1743 :
Line #1744 :
Line #1745 :
Line #1746 :
Line #1747 :
Line #1748 :
Line #1749 :
Line #1750 :
Line #1751 :
Line #1752 :
Line #1753 :
Line #1754 : 109.6
Line #1755 :
Line #1756 :
Line #1757 :
Line #1758 :
Line #1759 :
Line #1760 :
Line #1761 :
Line #1762 :


This is actually one entire line, just the other way i parsed it has breaks in it.
 
i assume you just want the tabular values from the middle of the page?

 
i have tried the following regular expressions and have had no luck

Code:
preg_match_all("|<[^>]+>(.*)</[^>]+>|", $chars, $out);

eregi("<(.+)>(.+)<(.+)>", $chars, $out);

$chars is my expression that contains the html tags

what is wrong here?

 
I cheated a bit.
Code:
<?php

$file = file_get_contents("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG");[/URL]
str_replace("  ","",$file);
$start=strpos($file,"<tr class=");
$end=strpos($file,"</tr>",$start);
$row=substr($file,$start,$end-$start);
echo $row."<br>";

while(strpos($file,"<tr class=",$end)>0){
$start=strpos($file,"<tr class=",$end);
$end=strpos($file,"</tr>",$start);
$row=substr($file,$start,$end-$start);
echo $row."<br>";
}

?>

If you can't stand behind your troops, stand in front of them.
Semper Fidelis

Jim
 
even when i use just file and that code i get no results
 
i think that this code should work for you. as said, i think that preg_match is a better tool in this case.

Code:
<?php
//get the page contents
$file = file_get_contents("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG");[/URL]

//extract the table rows
preg_match_all('/(<tr.*?>.*?<\/tr>)/ims', $file, $matches);
$results = array();

//iterate the table rows and extract the table data
foreach ($matches[1] as $row){
	preg_match_all('/<td.*?>(.*?)<\/td>/ims', $row, $data);
	//clean the data up
	$holding  =array_map('trim', $data[1]);
	//get rid of hyperlinks
	$holding[1] = trimLink($holding[1]);
	$holding[2] = trimLink($holding[2]);
	//deal with malformed rows
	if (count($holding) == 20) $results[] = $holding;
}
print_r($results);

/**
 * function to remove hyperlinks from text
 * @param object $string
 * @return string
 */
function trimLink($string){
	$pattern = '/<a.*?>(.*?)<\/a>/ims';
	preg_match($pattern, $string, $match);
	return trim($match[1]);
}
?>
 
it does not work due to i have PHP 4.1.2, file_get_contents does not work
 
just change it to

Code:
$cH = curl_init("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG");[/URL]
curl_setopt($cH, CURLOPT_RETURNTRANSFER, true);
$file = curl_exec($cH);
curl_close($cH);
... as for the original script
 
wait so i use this before the previous code u provided or before my code?
 
you use the above in place of the file_get_contents line.

Code:
<?php
//get the page contents
$cH = curl_init("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG");[/URL]
curl_setopt($cH, CURLOPT_RETURNTRANSFER, true);
$file = curl_exec($cH);
curl_close($cH);

//extract the table rows
preg_match_all('/(<tr.*?>.*?<\/tr>)/ims', $file, $matches);
$results = array();

//iterate the table rows and extract the table data
foreach ($matches[1] as $row){
    preg_match_all('/<td.*?>(.*?)<\/td>/ims', $row, $data);
    //clean the data up
    $holding  =array_map('trim', $data[1]);
    //get rid of hyperlinks
    $holding[1] = trimLink($holding[1]);
    $holding[2] = trimLink($holding[2]);
    //deal with malformed rows
    if (count($holding) == 20) $results[] = $holding;
}
print_r($results);

/**
 * function to remove hyperlinks from text
 * @param object $string
 * @return string
 */
function trimLink($string){
    $pattern = '/<a.*?>(.*?)<\/a>/ims';
    preg_match($pattern, $string, $match);
    return trim($match[1]);
}
?>
 
does not work because i do not have curl enabled. I could if i was root, but i use a client that connects to the server(SSH), but i only have permission to install things under my local directory, so what do i do now?
 
php4 is no longer supported. it is really not a good idea to remain with a host that does not provide a maintained code base.

Code:
<?php
$fH = fopen("[URL unfurl="true"]http://www.example.com/",[/URL] "rb");
$contents = '';
while (!feof($handle)) {
  $contents .= fread($handle, 8192);
}
fclose($handle);
<?php
//get the page contents
$fH = fopen("[URL unfurl="true"]http://www.nfl.com/stats/categorystats?tabSeq=1&statisticPositionCategory=QUARTERBACK&season=2009&seasonType=REG","r");[/URL]
$file = '';
while(!feof($fH)):
 $file .= fread($fH, 2048);
endwhile;

//extract the table rows
preg_match_all('/(<tr.*?>.*?<\/tr>)/ims', $file, $matches);
$results = array();

//iterate the table rows and extract the table data
foreach ($matches[1] as $row){
    preg_match_all('/<td.*?>(.*?)<\/td>/ims', $row, $data);
    //clean the data up
    $holding  =array_map('trim', $data[1]);
    //get rid of hyperlinks
    $holding[1] = trimLink($holding[1]);
    $holding[2] = trimLink($holding[2]);
    //deal with malformed rows
    if (count($holding) == 20) $results[] = $holding;
}
print_r($results);

/**
 * function to remove hyperlinks from text
 * @param object $string
 * @return string
 */
function trimLink($string){
    $pattern = '/<a.*?>(.*?)<\/a>/ims';
    preg_match($pattern, $string, $match);
    return trim($match[1]);
}
?>
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top