Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Lightweight Semantic Tags in HTML????

Status
Not open for further replies.

lichtjiang

Programmer
Feb 26, 2007
39
US
Calling it "lightweight semantic html tag" may be misleading. But this is what I mean.

In news article analysis applications, it is extremely important to just extract story content text since otherwise the template such as advertisement stuff may have negative impact (e.g. celebrity names appear in a poll side bar are irrelevant in the news reported by the page but they are hard to be discarded automatically with no errors). But there is NO universally standard way to label begining and ending of a body of news stories.

In Blog post analysis application, it is obviously important to extract a single post from a Blog. But there is NO consistent tag or label for telling one post from another in a page even from the same blog hosting server.

In both cases, some heuristics or machine learning algorithms may be applied. BUT what if we have some standadized tags or labels for doing this? That would be what I call "lightweight semantic html tags". Or are both problems too trivial? Any thoughts? Thanks a lot!
 
That would be called a different service. In fact, a lot of existing news web sitse directly use links to those news pages in html in their RSS instead of using xml to describe story text (if that's what I understood both replies:). But problem is that w/o extra service, how to label different parts that can be seen from so many websites and that are important for web/text mining? Or maybe XHMTL powerful in this way? (not read xhtml standard yet...)
 
RSS is XML
XHTML is XML

XML is just a way to mark something up to give context.

The RSS standard uses an XML format to mark up data to say, for instance...
"this is the story title"
"this is a short description"
"this is the link to the story"
etc

The long and the short of it is that it's just data with no presentational information. But the data is marked up to give it context.

This 'feed' is then published and made available for other sites to read.

When the other sites read the data they, themselves, wrap the data in HTML tags to present on their own pages.
This might be done by using a server side language to read the XML data and output HTML or you might take the XML data and transform it into XHTML using XSLT

<honk>*:O)</honk>

Earl & Thompson Marketing - Marketing Agency Services in Gloucestershire
 
I totally apprecicate the way that all data represented in XML, since HTML is for view layout while xml is for data description. But here is the question. We, programmers or users, can only see webpages in html, (e.g. news stories on NYTimes). Without any XML-like label, it poses difficuty for us to restore the raw data (like story text only or a single Blog post), which otherwise seems to be unnecessary. The situation is, data is stored somewhere in XML, news websites read and wrap them in html and do not label them. Bottom line of my curiosity is then, how to 100% correctly extract a single post from a Blog or a news story text from a story page?
 
lichtjiang said:
I totally apprecicate the way that all data represented in XML, since HTML is for view layout while xml is for data description. But here is the question. We, programmers or users, can only see webpages in html, (e.g. news stories on NYTimes). Without any XML-like label, it poses difficuty for us to restore the raw data (like story text only or a single Blog post), which otherwise seems to be unnecessary. The situation is, data is stored somewhere in XML, news websites read and wrap them in html and do not label them. Bottom line of my curiosity is then, how to 100% correctly extract a single post from a Blog or a news story text from a story page?
Bottom Line: XSL

Never be afraid to share your dreams with the world.
There's nothing the world loves more than the taste of really sweet dreams.

Enable Apps
 
thanks. but I am new to XSL. I agree that such technique is good. But it didn't answer my problem. or maybe I didn't make my question clear.

I found it is not trivial to extract a single post from a Blog web page or a news story from a news website. It is common that a Blog web page contains more than just one post. In the latter, there are advertisement, poll bar, among others. In either case, there is no consisent label to mark the boundary of what we are interested in, a single post updated today or text written by a journalist on a news event.

So, with such difficulties, I want to know how to do this kind of extraction nicely. And I wonder why there is no standard way for labeling different part of a webpage from website to website in the same category (news, or blog hosting). Or maybe I am not ware of good solutions. Thanks!
 
I found it is not trivial to extract a single post from a Blog web page or a news story from a news website.
That really depends on the website in question. If they provide an RSS feed then it is easy. If they don't then perhaps they don't want you to be able to extract data from their page. You could of course create a web request from your server-side language and parse the results but this is prone to error (especially if the website changes their output). You would also have to have permission for the website to use this method.


____________________________________________________________

Need help finding an answer?

Try the Search Facility or read FAQ222-2244 on how to get better results.

 
Thanks for all helpful answers.

Yes, my doubt should be originated from what is standardly called "Screen Scarping" or "Web Scraping" or "web page scraping"...

Most (is there an exception?) websites consider content they provide should be proprietory and thus mix it with other stuff in a web page without any explicit markups for the content. In fact, on tek-tips.com, if we do not consider the possibility to query for each thread or post in database, is there a possibility to nicely extract a single post without errors? (I have no intention to do this so I didn't read this page in html^_^) Even worse, some websites are so anti-web scraping, they may change their web page layout from time to time or block any automated access to their websites to protect their valuable business.

So, "Semantic web" may be of help to programmers. But I doubt it will be welcomed by business people. So, if this world is technically oriented, there will be less unnecessary headache, such as finding methods to do web page scraping:)
 
What you're actually doing is proposing a new microformat.

Maybe one day a microformat for news stories can be worked out. Maybe on another day (probably many years after the first one) a significant number of news sources will use it. Good luck with that.

Till then, RSS is your best bet.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd
 
Most (is there an exception?) websites consider content they provide should be proprietory and thus mix it with other stuff in a web page without any explicit markups for the content
And so they should.
Just because content is made public via a web interface does NOT mean it is public domain to be scraped or otherwise stolen by anyone who thinks they have a right to use it.



Chris.

Indifference will be the downfall of mankind, but who cares?
Woo Hoo! the cobblers kids get new shoes.
People Counting Systems

So long, and thanks for all the fish.
 
I tend to agree with ChrisHunt.

Web content isn't mixed up with other stuff to stop you scraping, it's just how it is.

If I wanted someone to be able to use data from my site then I'd publish an XML feed of some sort to allow them to use the data.

I ask why, if you were aware of the term 'screen scraping' you didn't just use that term?

<honk>*:O)</honk>

Earl & Thompson Marketing - Marketing Agency Services in Gloucestershire
 
Foamcow, I didn't intend to hide "screen scraping" from previous discussion of my question. It is from your reply that I realized that there is something commonly known as "screen scraping" or in my problem better called "web page scraping":)

But anyway, my question comes from it but goes beyond it and I think "semantic web" is not always bad to business people in terms of proprietary data protection. I agree there is some privacy or even law issue (?) regarding extracting data from a web site. But for research purpose, this cannot be avoided and any step towards helping researcher nicely obtain clean data is appreciated in my opinion. Take a look at how google and blogpulse bring to our daily lives may throw some light on this. Or any kind of information extraction application will be pushed forward accordingly ...

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top