lichtjiang
Programmer
Calling it "lightweight semantic html tag" may be misleading. But this is what I mean.
In news article analysis applications, it is extremely important to just extract story content text since otherwise the template such as advertisement stuff may have negative impact (e.g. celebrity names appear in a poll side bar are irrelevant in the news reported by the page but they are hard to be discarded automatically with no errors). But there is NO universally standard way to label begining and ending of a body of news stories.
In Blog post analysis application, it is obviously important to extract a single post from a Blog. But there is NO consistent tag or label for telling one post from another in a page even from the same blog hosting server.
In both cases, some heuristics or machine learning algorithms may be applied. BUT what if we have some standadized tags or labels for doing this? That would be what I call "lightweight semantic html tags". Or are both problems too trivial? Any thoughts? Thanks a lot!
In news article analysis applications, it is extremely important to just extract story content text since otherwise the template such as advertisement stuff may have negative impact (e.g. celebrity names appear in a poll side bar are irrelevant in the news reported by the page but they are hard to be discarded automatically with no errors). But there is NO universally standard way to label begining and ending of a body of news stories.
In Blog post analysis application, it is obviously important to extract a single post from a Blog. But there is NO consistent tag or label for telling one post from another in a page even from the same blog hosting server.
In both cases, some heuristics or machine learning algorithms may be applied. BUT what if we have some standadized tags or labels for doing this? That would be what I call "lightweight semantic html tags". Or are both problems too trivial? Any thoughts? Thanks a lot!