Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to convert a cvc structure to an onset-nucleus-coda structure?

Status
Not open for further replies.

LovecraftHP

Programmer
Dec 3, 2002
15
CN
I have an input file with a lemma, the cvc structure of that lemma, and the pronunciation of that lemma, eg

abide,[V][CVVC],[@][baId]
abolish,[V][CV][CVC],[@][bO][lIS]


What I would like to do is convert this to an output that gives me the original lemma together with an onset-nucleus-coda structure, eg

abide,=,=,=,=,@,=,b,aI,d
abolish,=,@,=,b,O,=,l,I,S


I'm only interested in the final 3 syllables of each lemma, so the output for eg

accomodate,[V][CV][CV][CVVC],[@][kO][m@][deIt]

should only read

accomodate,k,O,=,m,@,=,d,eI,t

If there are only one or two syllables, the output should have ='s for the empty spaces, as in the example of "abide" above.

I've tried to get it to work but I can't understand how to relate the cvc structure to the pronunciation, or get the prog to disregard the square brackets.

Any help would be greatly appreciated.
 
convert:
abide,[V][CVVC],[@][baId]
abolish,[V][CV][CVC],[@][bO][lIS]
to:
abide,=,=,=,=,@,=,b,aI,d
abolish,=,@,=,b,O,=,l,I,S

VERY, VERY INTERESTING.

try a 3128 lines awk
or a 2567 lines sed (pardon, sed has a 200 lines limit,
but you can use | (pipes))
perl do it in 1998.3 lines
 
This is not quite right, but may get you started

BEGIN { FS="," }
{
a = $3
gsub(/\[/," ",a)
gsub(/\]/," ",a)
n = split(a,b," ")
printf $1
k = n-2
if (k<1) {
for (j1=1;j1<=3-n;j1++) printf ",=,="
k=1
}
for (j1=k;j1<=n;j1++) {
printf ",="
l = length(b[j1])
for (j2=1;j2<=l;j2++) printf "," substr(b[j1],j2,1)
}
print ""
}

It produced too many or not enough "=,"

abide,=,=,=,@,=,b,a,I,d
abolish,=,@,=,b,O,=,l,I,S
accomodate,=,k,O,=,m,@,=,d,e,I,t



CaKiwi
 
Try..
Code:
BEGIN { FS="," ; fs="[" ; d="=" }
{
  printf $1
  n = split($2, a, fs)
  split($3, b, fs)
  for (x=(n-2); x<=n ;x++) {
    f=0
    k=j=l=""
    for (y=1; y<length(a[x]); y++) {
      p=substr(a[x],y,1)
      q=substr(b[x],y,1)
      if (p=="C"&&f==0) j=j q
      if (p=="V") { k=k q ; f=1 }
      if (p=="C"&&f==1) l=l q
    }
    printf FS (j?j:d) FS (k?k:d) FS (l?l:d)
  }
  printf RS
}
Tested on the sample data...
[tt]
abide,=,=,=,=,@,=,b,aI,d
abolish,=,@,=,b,O,=,l,I,S
accomodate,k,O,=,m,@,=,d,eI,t[/tt]
 
Thank you all for your replies. Please excuse me for taking this long to reply myself.

Ygor, I tried your solution but all I got was

abide,=,=,=,=,=,=,=,=,=
abolish,=,=,=,=,=,=,=,=,=
accomodate,=,=,=,=,=,=,=,=,=

I'm using GNU Awk 3.0.3 under WinXP. Any ideas?
 
My code expects the CVC structure to be in upper-case - so that could be the problem. If that's not it, then perhaps you could post a sample of your input file?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top