Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

query for match in text with "interruptions" (regex related) 1

Status
Not open for further replies.

OsakaWebbie

Programmer
Feb 11, 2003
628
0
0
JP
Yeah, I know the subject line is cryptic... I have a database that keeps song lyrics with chords interspersed within square brackets, like this:
Code:
[D]Thank you, [D/F#]O my [G]Fa[D]ther, for [G]giving us [A]Your [D  A]Son
And [D]leav[A/E]ing [D/F#]Your [G]Spirit [D/F#]till the [Em]work on [A]earth is [D]done
I allow the user to search for text in the lyrics, so I have to somehow get past the chords. The code in use has been there for quite a while, and I thought it worked, but I recently discovered that it doesn't always. Since MySQL doesn't have a way to remove the bracketed content on the fly using regex, I was attacking it from the opposite direction, using PHP to create a regex expression from the search string, like this (search string is "leaving" in this example):
Code:
WHERE LOWER(Lyrics) REGEXP '(\\[[^\\]]*\\])?l(\\[[^\\]]*\\])?e(\\[[^\\]]*\\])?a(\\[[^\\]]*\\])?v(\\[[^\\]]*\\])?i(\\[[^\\]]*\\])?n(\\[[^\\]]*\\])?g'
That technique seems to work if the chord(s) within the matching section of lyrics are only a single letter (e.g. "o my father", "and leav", and "work on earth" work fine), but if there are multiple characters inside the brackets, the matching record is not returned. I thought the "*" in my regex would take care of that, but it doesn't seem to. Does someone see what I'm missing, or have a better approach?

It would be cleaner with something like "WHERE LOWER(REGEXP_REPLACE(Lyrics,'(\\[[^\\]]*\\])','')) LIKE '%leaving%'", but although there seem to be two offerings of a UDF for regex replace out there (here and here), both are very old and not maintained, and I don't know if they will even work with newer versions of MySQL. Plus, one requires recompiling MySQL and the other slightly less invasive but still requires compiling itself and has dependencies. If you know that one or both of those do work okay, or know of something else, let me know. I would prefer to use what is already a part of the engine, though, if I can figure out how.
 
possibly not what you want to hear, but the solution is to store the lyrics in two columns, one with chords and one pure text

r937.com | rudy.ca
Buy my new book Simply SQL from Amazon
 
Hi

Indeed, regular expression based replace would be ideal.

Even regular expression based matching would be helpful, if it would provide either the matched substring or the match position.

But this can also be solved with the regular string functions, just abit lengthier :
Code:
delimiter $$
[b]create[/b] [b]function[/b] stripchord[teal]([/teal]what [maroon]text[/maroon][teal])[/teal]
returns [maroon]text[/maroon]
deterministic
no sql
[b]begin[/b]
  [b]declare[/b] ch [maroon]varchar[/maroon][teal]([/teal][purple]1[/purple][teal]);[/teal]
  [b]declare[/b] i[teal],[/teal]start [maroon]integer[/maroon] [b]default[/b] [teal]-[/teal][purple]1[/purple][teal];[/teal]

  [b]if[/b] what [b]is[/b] [b]null[/b] [b]then[/b]
    [b]return[/b] [b]null[/b][teal];[/teal]
  [b]end[/b] [b]if[/b][teal];[/teal]

  [b]set[/b] i[teal]=[/teal][purple]1[/purple][teal];[/teal]
  l00p[teal]:[/teal] [b]loop[/b]
    [b]set[/b] ch[teal]=[/teal]mid[teal]([/teal]what[teal],[/teal]i[teal],[/teal][purple]1[/purple][teal]);[/teal]
    [b]if[/b] ch[teal]=[/teal][green][i]'[/i][/green][lime][i]\\[/i][/lime][green][i]'[/i][/green] [b]then[/b]
      [b]set[/b] what[teal]=[/teal]concat[teal]([/teal]left[teal]([/teal]what[teal],[/teal]i[teal]-[/teal][purple]1[/purple][teal]),[/teal]mid[teal]([/teal]what[teal],[/teal]i[teal]+[/teal][purple]1[/purple][teal]));[/teal]
      [b]set[/b] i[teal]=[/teal]i[teal]+[/teal][purple]1[/purple][teal];[/teal]
    [b]elseif[/b] ch[teal]=[/teal][green][i]'['[/i][/green] [b]then[/b]
      [b]set[/b] start[teal]=[/teal]i[teal];[/teal]
    [b]elseif[/b] ch[teal]=[/teal][green][i]']'[/i][/green] [b]and[/b] start[teal]!=-[/teal][purple]1[/purple] [b]then[/b]
      [b]set[/b] what[teal]=[/teal]concat[teal]([/teal]left[teal]([/teal]what[teal],[/teal]start[teal]-[/teal][purple]1[/purple][teal]),[/teal]mid[teal]([/teal]what[teal],[/teal]i[teal]+[/teal][purple]1[/purple][teal]));[/teal]
      [b]set[/b] i[teal]=[/teal]start[teal];[/teal]
      [b]set[/b] start[teal]=-[/teal][purple]1[/purple][teal];[/teal]
    [b]end[/b] [b]if[/b][teal];[/teal]

    [b]set[/b] i[teal]=[/teal]i[teal]+[/teal][purple]1[/purple][teal];[/teal]
    [b]if[/b] i[teal]>[/teal]length[teal]([/teal]what[teal])[/teal] [b]then[/b]
      [b]leave[/b] l00p[teal];[/teal]
    [b]end[/b] [b]if[/b][teal];[/teal]
  [b]end[/b] [b]loop[/b][teal];[/teal]

  [b]return[/b] what[teal];[/teal]
[b]end[/b]
$$
delimiter [teal];[/teal]

[b]select[/b]
[teal]*[/teal]

[b]from[/b] osakawebbie

[b]where[/b] lower[teal]([/teal]stripchord[teal]([/teal]lyrics[teal]))[/teal] [b]like[/b] [green][i]'%father%'[/i][/green][teal];[/teal]
Tested with MySQL 5.5.25.

Some notes :
[ul]
[li]removes substrings between [ and ][/li]
[li]from enclosed [ ] pairs removes only the inner one[/li]
[li]leaves single and unbalanced [ and ] unchanged[/li]
[li]leaves backslash escaped \[ and \] unchanged[/li]
[li]removes backslashes[/li]
[/ul]
Code:
[small]mysql> select * from osakawebbie;
+------------------------------------------------------------------------+
| lyrics                                                                 |
+------------------------------------------------------------------------+
| zero] one[two[three]four]five [six] \[seven] [eight\] \[nine\] \\ [ten |
+------------------------------------------------------------------------+
1 row in set (0.04 sec)

mysql> select stripchord(lyrics) from osakawebbie;
+-------------------------------------------------------+
| stripchord(lyrics)                                    |
+-------------------------------------------------------+
| zero] one[twofour]five  [seven] [eight] [nine] \ [ten |
+-------------------------------------------------------+
1 row in set (0.00 sec)[/small]

Feherke.
[link feherke.github.com/][/url]
 
That was beautiful! I had some confusion at first because when I copy/pasted from the forum, the linefeeds disappeared, and I didn't realize that they matter (in regular queries linefeeds are not necessary) - I was getting no errors, but no function either. But when I put the linefeeds back in, it worked like a charm. I haven't tested it completely, but so far it's doing exactly what it should.

Your caveats are fine - there are not supposed to ever be nested brackets, unbalanced brackets, etc., and if there are by mistake, it's fine that it doesn't strip anything. I also see uses for this function in other places in my code where currently I'm doing the stripping in PHP - it will get executed a lot!
 
Upon further inspection, I'm not as elated as I was. It works if you are patient, but it's really slow. Is there anything I can do to speed it up?

I tried adding a couple indexes that probably should have been there all along, but it didn't help much if at all, and that's not surprising - the bottleneck is dismantling the Lyrics text, which has nothing to do with indexing.

FYI, there are 449 rows in the table, and the Lyrics column totals less than 250kB of text. A query like "SELECT * FROM pw_song WHERE LOWER(stripchord(Lyrics)) LIKE '%foundation%'" (which will return no rows) takes about 10 seconds.
 
Hi

OsakaWebbie said:
It works if you are patient, but it's really slow. Is there anything I can do to speed it up?
Sorry to hear that. I am afraid there is no way to considerably speed it up. However, given that "there are not supposed to ever be nested brackets, unbalanced brackets, etc.", we can cut off some of the operations and use [tt]locate()[/tt] instead of [tt]loop[/tt]ing character by character. This should improve the performance abit, but probably not enough :
Code:
delimiter $$
[b]create[/b] [b]function[/b] stripchord[teal]([/teal]what [maroon]text[/maroon][teal])[/teal]
returns [maroon]text[/maroon]
deterministic
no sql
[b]begin[/b]
  [b]declare[/b] start[teal],[/teal]stop [maroon]integer[/maroon] [b]default[/b] [purple]1[/purple][teal];[/teal]

  [b]if[/b] what [b]is[/b] [b]null[/b] [b]then[/b]
    [b]return[/b] [b]null[/b][teal];[/teal]
  [b]end[/b] [b]if[/b][teal];[/teal]

  l00p[teal]:[/teal] [b]loop[/b]

    [b]set[/b] start[teal]=[/teal]locate[teal]([/teal][green][i]'['[/i][/green][teal],[/teal]what[teal],[/teal]start[teal]);[/teal]
    [b]if[/b] start[teal]=[/teal][purple]0[/purple] [b]then[/b]
       [b]leave[/b] l00p[teal];[/teal]
    [b]end[/b] [b]if[/b][teal];[/teal]

    [b]set[/b] stop[teal]=[/teal]locate[teal]([/teal][green][i]']'[/i][/green][teal],[/teal]what[teal],[/teal]start[teal]);[/teal]
    [b]if[/b] stop[teal]=[/teal][purple]0[/purple] [b]then[/b]
       [b]leave[/b] l00p[teal];[/teal]
    [b]end[/b] [b]if[/b][teal];[/teal]

    [b]set[/b] what[teal]=[/teal]concat[teal]([/teal]left[teal]([/teal]what[teal],[/teal]start[teal]-[/teal][purple]1[/purple][teal]),[/teal]mid[teal]([/teal]what[teal],[/teal]stop[teal]+[/teal][purple]1[/purple][teal]));[/teal]

  [b]end[/b] [b]loop[/b][teal];[/teal]

  [b]return[/b] what[teal];[/teal]
[b]end[/b]
$$
delimiter [teal];[/teal]

Feherke.
[link feherke.github.com/][/url]
 
You were skeptical, but actually that made a lot of difference - my simple test query (single test, so numbers are rough) went from 9.89sec to 0.56sec, and my full interface (more complex queries) responds well also. You get another star for that - thanks!
 
Hi

Grr ! Actually the later function was the one I wrote first, but considered it too rigid and then I wrote the "improved" one I posted first. Sorry for making you run extra rounds.

In meantime I was thinking more on this. For the simple case of [tt]lower[teal]([/teal]stripchord[teal]([/teal]lyrics[teal]))[/teal] like [green]'%father%'[/green][/tt] this approach is an overkill : stripchord() removes all chords ( 15 in your example ) before attempting to [tt]like[/tt], while would be enough to remove until the first match ( 4 in your example, when searching for "father" ). This could be solved with a function which does the [tt]like[/tt]ing itself after each n chord removals, like [tt]stripchord[teal]([/teal]lower[teal]([/teal]lyrics[teal]),[/teal][green]'%father%'[/green][teal])[/teal][/tt]. But this is affected by the way your PHP script is building up the [tt]where[/tt] clause from the search expression entered by the used. So if you would like to try such further improving attempt, I need to know more about the building of the query condition. ( Note that the above described approach may be a dead end due to MySQL's poor language features. )

Feherke.
[link feherke.github.com/][/url]
 
Thanks, but that's okay. I do have some complex criteria that can get added to the query depending on the user's situation - it might or might not have any conflict with your idea of bowing out at the first match, but I don't mind the current performance, so it's probably not worth tweaking it further. Plus, I have other uses for a function that can return a whole song's lyrics only (currently doing that in PHP, but this is nice and clean).

So you can save your extra effort for the next problem I have! [wink]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top