Offset Plain- and HTML-text indices

Kirsle · Aug 24, 2007

Hey.

I'm working a little more on my Tk::HyperText module, and decided that it was high time I fixed some of the limitations of it, specifically dealing with indices for the insert() and delete() methods.

e.g. on Tk::Text and Tk::ROText you can do

Code:

$widget->insert ('0', 'something'); # insert 'something' at the very beginning of the contents of the text

$widget->insert (10, 'something else'); # insert 'something else' before the 10th character in the textbox

$widget->insert ('end', 'another thing'); # insert 'another thing' at the end of the text, whereever that is

All that Tk::HyperText has been able to do thus far is use 'end', but now I'm working on trying to be able to use any index the programmer wants.

The Problem

The module internally stores all of the HTML code of the text that's displayed in the widget. The widget itself only displays plain text (formatted with the use of text tags). Because of this, the index of, say, the 10th character in the displayed text, will not line up with the 10th character in the HTML code.

Example:

Code:

0123456789012345678901234567890
<b><em>Hello, world!</em></b>
Hello, world!

The plain text that would be seen in the textbox doesn't line up perfectly with the HTML text behind it. So, if index 4 and 5 are "o," in the text widget, the actual HTML sees them as "m>"

And since adding or removing HTML code relies on the modification of the actual HTML code within, there has to be some way to line up the HTML with the plain text.

A Messy Solution

The best solution I could come up with is to use arrays and loop through, trying to keep a running map of how the indices relate to each other. Basically:

1. We get the index in the plain text that we want.
2. Split all of the HTML up into an array of individual characters.
3. Loop through the HTML array. Anytime we're in a tag, skip this character; else, map this character's literal array position in the HTML text with a running total that will line it up with the plain text.
4. Convert the plain text index into the corresponding HTML index by using this map we generated.

Here's the demonstration code I made for this:

Code:

[gray]#!/usr/bin/perl -w[/gray]

[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]
[black][b]use[/b][/black] [green]warnings[/green][red];[/red]
[black][b]use[/b][/black] [green]Data::Dumper[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$html[/blue] = [red]"[/red][purple]<b>Abcdef, <font color=[purple][b]\"[/b][/purple]blue[purple][b]\"[/b][/purple]>ghijk</font></b>[/purple][red]"[/red][red];[/red]
[gray][i]#my $plain = "Abcdef, ghijk";[/i][/gray]
[black][b]my[/b][/black] [blue]$plain[/blue] = [blue]$html[/blue][red];[/red]
[blue]$plain[/blue] =~ [red]s/[/red][purple]<(.|[purple][b]\n[/b][/purple])+?>[/purple][red]/[/red][purple][/purple][red]/[/red][red]ig[/red][red];[/red]

[gray][i]# Pick a random character from the plain text.[/i][/gray]
[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]HTML String: [blue]$html[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[black][b]print[/b][/black] [red]"[/red][purple]Plain String: [blue]$plain[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red]
	. [red]"[/red][purple]Select index (0..[/purple][red]"[/red] . [red]([/red][red]([/red][url=http://perldoc.perl.org/functions/length.html][black][b]length[/b][/black][/url] [blue]$plain[/blue][red])[/red] - [fuchsia]1[/fuchsia][red])[/red] . [red]"[/red][purple])> [/purple][red]"[/red][red];[/red]
[url=http://perldoc.perl.org/functions/chomp.html][black][b]chomp[/b][/black][/url] [red]([/red][black][b]my[/b][/black] [blue]$rnd[/blue] = <STDIN>[red])[/red][red];[/red]
[black][b]my[/b][/black] [blue]$char[/blue] = [url=http://perldoc.perl.org/functions/substr.html][black][b]substr[/b][/black][/url][red]([/red][blue]$plain[/blue],[blue]$rnd[/blue],[fuchsia]1[/fuchsia][red])[/red][red];[/red]

[black][b]print[/b][/black] [red]"[/red][purple]Selected character: [blue]$char[/blue] (index: [blue]$rnd[/blue])[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]

[gray][i]# Try to match up this index in the HTML.[/i][/gray]
[black][b]my[/b][/black] [blue]@chars[/blue] = [url=http://perldoc.perl.org/functions/split.html][black][b]split[/b][/black][/url][red]([/red][red]/[/red][purple][/purple][red]/[/red],[blue]$html[/blue][red])[/red][red];[/red]
[black][b]my[/b][/black] [blue]@new[/blue] = [red]([/red][red])[/red][red];[/red]
[black][b]my[/b][/black] [blue]$map[/blue] = [red]{[/red][red]}[/red][red];[/red]
[black][b]my[/b][/black] [blue]$inTag[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$i[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[black][b]my[/b][/black] [blue]$j[/blue] = [fuchsia]0[/fuchsia][red];[/red]
[olive][b]foreach[/b][/olive] [red]([/red][blue]@chars[/blue][red])[/red] [red]{[/red]
	[blue]$j[/blue]++[red];[/red]
	[olive][b]if[/b][/olive] [red]([/red][blue]$_[/blue] eq [red]'[/red][purple]<[/purple][red]'[/red][red])[/red] [red]{[/red]
		[blue]$inTag[/blue]++[red];[/red]
		[olive][b]next[/b][/olive][red];[/red]
	[red]}[/red]
	[olive][b]elsif[/b][/olive] [red]([/red][blue]$_[/blue] eq [red]'[/red][purple]>[/purple][red]'[/red][red])[/red] [red]{[/red]
		[blue]$inTag[/blue]--[red];[/red]
		[blue]$inTag[/blue] = [fuchsia]0[/fuchsia] [olive][b]if[/b][/olive] [blue]$inTag[/blue] < [fuchsia]0[/fuchsia][red];[/red]
		[olive][b]next[/b][/olive][red];[/red]
	[red]}[/red]

	[olive][b]if[/b][/olive] [red]([/red][blue]$inTag[/blue] == [fuchsia]0[/fuchsia][red])[/red] [red]{[/red]
		[gray][i]#print "DBG: inTag=0; chr=$_\n";[/i][/gray]
		[url=http://perldoc.perl.org/functions/push.html][black][b]push[/b][/black][/url] [red]([/red][blue]@new[/blue],[blue]$_[/blue][red])[/red][red];[/red]
		[blue]$map[/blue]->[red]{[/red][blue]$i[/blue][red]}[/red] = [blue]$j[/blue][red];[/red]
		[blue]$i[/blue]++[red];[/red]
	[red]}[/red]
[red]}[/red]

[gray][i]#print "HTML-Stripped Array: " . join ("; ", @new) . "\n";[/i][/gray]

[black][b]print[/b][/black] [red]"[/red][purple]Corresponding HTML index for our index [blue]$rnd[/blue]: [blue]$map[/blue]->{[blue]$rnd[/blue]}[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]

[gray][i]# print "Dump of \$map: " . Dumper($map) . "\n\n";[/i][/gray]

[gray][i]# See if we get the same character.[/i][/gray]
[black][b]print[/b][/black] [red]"[/red][purple]Character from HTML index [blue]$rnd[/blue] ([blue]$map[/blue]->{[blue]$rnd[/blue]}): [/purple][red]"[/red] . [black][b]substr[/b][/black][red]([/red][blue]$html[/blue],[blue]$map[/blue]->[red]{[/red][blue]$rnd[/blue][red]}[/red] - [fuchsia]1[/fuchsia],[fuchsia]1[/fuchsia][red])[/red][red];[/red]
[black][b]print[/b][/black] [red]"[/red][purple][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]

[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[li]warnings - Perl pragma to control optional warnings[/li]
[/ul]
Core (perl 5.8.8) Modules used :
[ul]
[li]Data:

umper - stringified perl data structures, suitable for both printing and eval[/li]
[/ul]
[/tt]

And here's some example output:

Code:

[kirsle@upsilon ~]$ perl indextest.pl
HTML String: <b>Abcdef, <font color="blue">ghijk</font></b>
Plain String: Abcdef, ghijk
Select index (0..12)> 1
Selected character: b (index: 1)
Corresponding HTML index for our index 1: 5
Character from HTML index 1 (5): b

[kirsle@upsilon ~]$ perl indextest.pl
HTML String: <b>Abcdef, <font color="blue">ghijk</font></b>
Plain String: Abcdef, ghijk
Select index (0..12)> 0
Selected character: A (index: 0)
Corresponding HTML index for our index 0: 4
Character from HTML index 0 (4): A

[kirsle@upsilon ~]$ perl indextest.pl
HTML String: <b>Abcdef, <font color="blue">ghijk</font></b>
Plain String: Abcdef, ghijk
Select index (0..12)> 12
Selected character: k (index: 12)
Corresponding HTML index for our index 12: 35
Character from HTML index 12 (35): k

So, this solution seems to work, but it seems kinda messy. Imagine using it when you already have a whole ton of HTML code. It would slow down significantly trying to loop through the array.

The Question

Is there a more efficient way to do this?

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog

MillerH · Aug 24, 2007

Actually, the first question I would ask is why are you trying to do this? I can definitely see the benefit of having HTML formatting for a widget. Most people know HTML to a good degree of proficiency, so it makes to create this functionality as an alternate means of creating formatting.

However, why would someone need to edit the formatting in such a way? Especially by inserting based off of character position? What issues did you come by that made you think that this would actually be a need?

I haven't used TK yet. However, I have to imagine that most formatting and text is simply static. Especially if there is going to be more complex formatting. If the goal is to allow some of the text to be dynamic there are other ways to go about it then string positions.

1) Allow the inclusion of div tags that have id attributes. Then allow the replacement of the text of the div tags just like is done with ajax style programming.

2) Allow the inclusion of a plan TK::Text object in your Hypertext Widget. Whenever the text of the TK::Text object is changed, then obviously the Hybertext Widget should also update.

Either of these methods seem more intuitive and useful than inserting text at a relative position in the string.

Regards,
- Miller

Kirsle · Aug 24, 2007

The reason for doing it is because all the other Tk::Text widgets can do them. In practice, I've only ever used "0.0" and "end" when inserting, but its possible to insert text anywhere you want.

For instance, getting the current position of the insertion cursor and inserting something at that position.

So, getting Tk::HyperText to conform to the standards of the other text widgets will also put it one step closer to becoming useful for an editable HTML text widget, like a simple WYSIWYG editor.

One of the eventual goals for why I decided to create this widget to begin with is, to create a Perl Tk clone of AOL Instant Messenger 5.x, cuz I liked that version of AIM, AOL doesn't seem too interested in Linux support, Gaim/Pidgin changes everything drastically with every minor version release, and I didn't care much for Kopete either.

And, having an IM client that uses HTML text to show the conversation history would also make sense to have an editable HTML box for sending those messages too.

(Also, embedding of text widgets inside of text widgets seems to be outside the scope of what Tk::Text allows you to do, as far as I've seen in my experimentation; I tried once to add support for <table> by inserting a Pane, and inside the Pane insert more Tk::HyperText widgets in a gridded geometry. Didn't work out too well. I could get the Pane to work, but couldn't do anything inside the Pane. Also, in an attempt to add <iframe> support, I tried a direct Tk::HyperText-within-a-Tk::HyperText method, which also failed miserably).

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog

MillerH · Aug 25, 2007

Well, I can understand that somewhat. But I think that your attempt to support all the functions of Tk::Text is misguided. Ultimately the only similarity between plain text and html is that they are both file formats. Some things that have meaning in plain text just do not have the same meaning html. Relative textual position is one of those less meaningful concepts both because of formatting tags but also because all spaces are compressed except with in specific entities. Trying to give that some adapted meaning is a waste of effort IMO, unless there is a specific goal that you're attempting to achieve.

If you truly want to add support for this just for backward compatibility, I would suggest that you be literal. Insert directly into the raw html. Do not give it some adhoc meaning related to only raw text within the html. Ultimately, I believe that is putting a lot of effort into something that is not all that useful. That is simply not how people use html.

What would make more sense is if you could somehow access the entities inside your HyperText widget by an identifier. (<div id="row1">). Accessing that identifier would return another HyperText widget that contained only the html withing that div tag. Then implement the code so that any change to this returned widget would automatically update the parent HyperText widget.

Alternatively, support the inclusion of TK::Text widgets within your hypertext widget. I realize that TK::Text widgets do not allow inclusion, but they do allow embedded images and windows. So the idea of embedding is not completely foreign.

Anyway, feel free to treat this advise worth a grain of salt. As I said before, I have no experience with Tk. However, I simply feel that the more that you treat your HyperText widget like actual HyperText, the better off you'll be and the more useful your code will become.

- Miller

PS,
I also strongly advise you to look into using HTML:

arser for your parsing instead of hard coding it yourself. Even better would be to create a DOM that you're attempting to support that you could document at the bottom of your module. One thing at a time though, I know.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Offset Plain- and HTML-text indices

Kirsle

Programmer

MillerH

Programmer

Kirsle

Programmer

MillerH

Programmer

Similar threads

Part and Inventory Search

Sponsor