Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Merging two files together

Status
Not open for further replies.

Guest_imported

New member
Jan 1, 1970
0
Hello,

I've got two kinds of textfiles. The first kind of file contains English sentences and phrases (one on each line). Every word is followed by its part of speech tag. The second kind of file contains the French translations of these English sentences (again one sentence on each line). The first French sentence is the translation of the first sentence in the English file, etc. For instance:

File 1:

He/3PS is/BEZ eating/VBG cheese/NN with/IN a/Art\ yellow/Adj spoon/NN
John/NN took/VBD his/Pr dog/NN with/IN him/Pr
She/Pr was/BED a/Art girl/NN with/IN a/Art big/Adj and/Conjlovely/Adj smile/NN

File 2:

Il/3PS mange/VBS du/Art fromage/NN avec/IN une/Art\ cuillère/NN jaune/Adj
Jean/NN prennait/VBD son/Pr chien/NN avec/IN lui/Pr
Elle/Pr était/BED une/Art fille/NN avec/IN un/Art grand/Adjet/Conj beau/Adj sourire/NN

(The backslashes "\" are used to indicate that the lines do not stop there! On each line there's only one sentence. So, the "\"'s do not belong to the input!)

I need to write a script that combines these two files into the following third file:

is,BEZ,eating,VBG,cheese,NN,with,IN,a,Art,yellow,Adj,\ spoon,NN,avec
took,VBD,his,Pr,dog,NN,with,IN,him,Pr,=,=,=,=,avec
was,BED,a,Art,girl,NN,with,IN,a,Art,big,Adj,and,Conj,avec

So from the English file, I'm only interested in the preposition (in this case "with") and the three words (if any) by which this preposition is preceded and by which it is followed. If there are no three words that precede or follow the preposition, the empty fields should be filled with a "=".
Words and tags should be separated by comma's. From the French file, I'm only interested in the prepositions. These should be placed at the end of each line in the new file. In this way it is easy to see how each English preposition is translated into French. Every line should eventually contain 15 fields.

Is it possible to do this with gawk?

Febri





 
IMO You would have to pay for this one.
But still: try comp.lang.awk there are some old time
meisters out there looking for a challenge.
 
Woh! French lessons and awk lessons in the same forum! I don't have time to work on it now but I'll try to spend some time on it over the weekend (le weekend, one of the few french words I know). However, I won't be offended if someone else jumps in and solves it for you. CaKiwi
 
Hi Febri,

Try this script.

-Input files : French.Txt and English.txt
-No error checking (Tag not found, not same # of lines in the two files...)


awk -f EnFr.awk -v TAG=IN -v FRENCH=French.txt English.txt

--- EnFr.awk ---
BEGIN {
TagPattern = "^.*/" TAG "$" ;
}

#
# Print French word for the tag
#
function FrenchTag( f, fwt) {
getline < FRENCH ;
for (f=1; f<=NF; f++) {
if ($f ~ TagPattern) {
split($f, fwt, &quot;/&quot;) ;
print fwt[1] ;
break ;
}
}
}

#
# Print English words around the tag
#
function EnglishTag( f,ftag,ewt) {
for (f=1; f<=NF; f++) {
if ($f ~ TagPattern) {
ftag = f ;
break ;
}
}
for (f=ftag-3; f<=ftag+3; f++) {
if (f<0 || f>NF) {
printf &quot;=,=,&quot;
} else {
split($f, ewt, &quot;/&quot;) ;
printf &quot;%s,%s,&quot;,ewt[1],ewt[2]
}
}
}

#
# Allez, au boulot ...
#
{
EnglishTag() ;
FrenchTag() ;
}
--- End EnFr.awk --- Jean Pierre.
 
Merci Jean Pierre, mais malheureusement...

I'm getting a wrong output. The whole textfile is printed as one single block. Focusing only on the task of printing the specified preposition together with the three fields that precede this preposition and the three fields that follow it, the algorithm should actually be something like:

given a line $0,
focus on the three fields that precede pattern, on the
pattern itself, as well as on the three fields that
follow the pattern
If no three words/fields precede or follow the
preposition, the empty fields should be filled with
a &quot;=&quot;.
print this substring (one on each line!)

Secondary options (using gsub):
- white spaces (&quot; &quot;) between the fields of the substrings
should be replaced with comma's (&quot;,&quot;);
- slashes (&quot;/&quot;) should be replaced with comma's as well

This eventually should result into:

is,BEZ,eating,VBG,cheese,NN,with,IN,a,Art,yellow,Adj,\ spoon,NN
took,VBD,his,Pr,dog,NN,with,IN,him,Pr,=,=,=,=
was,BED,a,Art,girl,NN,with,IN,a,Art,big,Adj,and,Conj

If I would get this output, it suffices to write a script that takes the French translation of the English preposition out of the second file and places it at the end of the substring (as was shown in the previous message).

--Febri

 
Hi febri,

I'am using bash/gawk (CYGWIN), and for me the script works fine.
With your example files the script gives the result you are expecting :

Look at this execution trace (with debugging displays) :


BEGIN: French file: French.txt
BEGIN: Speech tag : IN
BEGIN: TagPatter : ^.*/IN$
NR==1: English file; English.txt

EnglishTag: Input: He/3PS is/BEZ eating/VBG cheese/NN with/IN a/Art yellow/Adj spoon/NN
EnglishTag: Tag found in field 5
EnglishTag: Print words from field 2 to 8
EnglishTag: Field: 2
EnglishTag: Output: is,BEZ,
EnglishTag: Field: 3
EnglishTag: Output: eating,VBG,
EnglishTag: Field: 4
EnglishTag: Output: cheese,NN,
EnglishTag: Field: 5
EnglishTag: Output: with,IN,
EnglishTag: Field: 6
EnglishTag: Output: a,Art,
EnglishTag: Field: 7
EnglishTag: Output: yellow,Adj,
EnglishTag: Field: 8
EnglishTag: Output: spoon,NN,
FrenchTag: getline from French.txt
FrenchTag: Input: il/3PS mange/VBS du/Art fromage/NN avec/IN une/Art cuillère/NN jaune/adj
FrenchTag: Tag found in field 5
FrenchTag: Output: avec


EnglishTag: Input: John/NN took/VBD his/Pr dog/NN with/IN him/Pr
EnglishTag: Tag found in field 5
EnglishTag: Print words from field 2 to 8
EnglishTag: Field: 2
EnglishTag: Output: took,VBD,
EnglishTag: Field: 3
EnglishTag: Output: his,Pr,
EnglishTag: Field: 4
EnglishTag: Output: dog,NN,
EnglishTag: Field: 5
EnglishTag: Output: with,IN,
EnglishTag: Field: 6
EnglishTag: Output: him,Pr,
EnglishTag: Dummy field: 7
EnglishTag: Output: =,=,
EnglishTag: Dummy field: 8
EnglishTag: Output: =,=,
FrenchTag: getline from French.txt
FrenchTag: Input: Jean/NN prennait/VBD son/pr chien/NN avec/IN lui/Pr
FrenchTag: Tag found in field 5
FrenchTag: Output: avec


EnglishTag: Input: She/pr was/BED a/Art girl/NN with/IN a/Art big/Adv and/Conj lovely/Adj smile/NN
EnglishTag: Tag found in field 5
EnglishTag: Print words from field 2 to 8
EnglishTag: Field: 2
EnglishTag: Output: was,BED,
EnglishTag: Field: 3
EnglishTag: Output: a,Art,
EnglishTag: Field: 4
EnglishTag: Output: girl,NN,
EnglishTag: Field: 5
EnglishTag: Output: with,IN,
EnglishTag: Field: 6
EnglishTag: Output: a,Art,
EnglishTag: Field: 7
EnglishTag: Output: big,Adv,
EnglishTag: Field: 8
EnglishTag: Output: and,Conj,
FrenchTag: getline from French.txt
FrenchTag: Input: Elle/pr était/BED une/Art fille/NN avec/IN un/Art grand/Ad et/Conj beau/Adj sourire/NN
FrenchTag: Tag found in field 5
FrenchTag: Output: avec
Jean Pierre.
 
Hi,

I have found a bug in my script.
If the speech tag is in field 1, 2 or 3, then the whole line is printed.
To correct the problem, in the EnglishTag function replace
if (f<0 || f>NF) {
by
if (f<=0 || f>NF) {


I have written another version of my script.
In that version i don't use pattern and split.
All &quot;/&quot; are replaced by space.


awk -f EnFr2.awk -v TAG=IN -v FRENCH=French.txt English.txt


--- EnFr2.awk ---
#
# Find speech Tag
#
function FindTag( f) {
for (f=2; f<=NF; f=f+2) {
if ($f == TAG)
break ;
}
return f
}
#
# Print french word for the tag
#
function FrenchTag() {
getline < FRENCH ;
gsub(&quot;/&quot;,&quot; &quot;) ;
print $(FindTag()-1) ;
}
#
# Print English words around the tag
#
function EnglishTag( f,ftag) {
gsub(&quot;/&quot;,&quot; &quot;) ;
ftag = FindTag() ;
for (f=ftag-7; f<=ftag+6; f++) {
if (f<=0 || f>NF)
printf &quot;=,&quot; ;
else
printf &quot;%s,&quot;,$f ;
}
}
#
# De nouveau au boulot ...
#
{
EnglishTag() ;
FrenchTag() ;
}
--- End of EnFr2.awk ---
Jean Pierre.
 
Jean Pierre (or any other expert),

The example files looked more difficult than I thought. There are certain lines where more than one preposition can be found!

English:

John/NN is/VBZ staying/VBG in/Prep school/NN at/Prep the/Art moment/NN

French:

En/prep ce/Det moment/NN Jean/NN reste/VB dans/Prep l'/Art école/NN

(for the French translation of &quot;in&quot;, Awk should return &quot;dans&quot; and not &quot;en&quot;)


So, I've looked at your script and made a few modifications. Only now, it does not work any longer and I can't figure out what it is. Could you have a look at the following script? I've inserted some comments so that you know what's going on:

#
# Find preposition in source language
#
function FindPrep(f) {
# I'm no longer looking for the tag but for the
# preposition (in this case: &quot;in&quot;)
for (f=1; f<=NF; f=f+2) {
if ($f~/\<in\>/) {
# I will need this variable in FindTransl(f)
a = f-4
break ;
}
}
return f
}
#
# Find other preposition
#
function FindTransl(f) {
# Following the hypothesis that the French
# preposition will almost be located in the same
# field as the English one, Awk should only look
# at a small part of the French sentence.
for (f=a; f<=a+8; f=f+2)
# if, for instance, $f does not exist in the French
# sentence, Awk should look at value &quot;f+2&quot; (in the case
# of f<=0) and &quot;f-2&quot; (in the case of f > NF)
# until it matches a field. I know that I've made a
# mistake here!
if (f<=0)
f = f+2
else if (f>NF)
f = f-2
# two possible French translations of &quot;in&quot;
else if (($f~/\<dans\>/||($f~\<en\>/))
break ;
else {
# if Awk doesn't find one of these prepositions in the
# specified context, it should return &quot;0&quot;
f = &quot;0&quot;
break ;
}
}
return f
}
#
# Print translation
#
function GetTransl() {
# name of the French file is &quot;F_in.tag&quot;
getline < &quot;F_in.tag&quot; ;
gsub(&quot;/&quot;,&quot; &quot;) ;
# this can only print &quot;en&quot;, &quot;dans&quot; or &quot;0&quot;
print $(FindTransl()) ;
}
#
# Print context
#
function GetContext(f,fprep) {
gsub(&quot;/&quot;,&quot; &quot;) ;
fprep = FindPrep() ;
# creates a context around the English prep &quot;in&quot;
for (f=fprep-6; f<=fprep+7; f++) {
if (f<=0 || f>NF)
printf &quot;=,&quot; ;
else
printf &quot;%s,&quot;,$f ;
}
}
#
# And eventually....
#
{
GetContext() ;
GetTransl() ;
}

I hope I'm not asking too much. But I can't get any further with the work I'm doing as long as this script does not work.

Thanks,

Febri
 
Hi Febri,

I think your problem comes from the FindTransl function.
Try this new version :

function FindTransl( f,ft) '
ft = 0 ;
for (f=a; f<=a+8; f=f+2) '
if (f <= 0) continue ;
if (f > NF) break ;
if ($f != &quot;dans&quot; && $f !=&quot;en&quot;) continue ;
ft = f ;
break ;
}
return ft ;
} Jean Pierre.
 
Hi Aigles,

There's still a problem with the FindTransl function. I'm only getting the right output when AWK finds the translated preposition in the context 'a till a+8'. When the preposition is not found, due to the fact that the English preposition is not translated by one of the French prepositions specified in the FindTransl function, the entire French sentence is placed after the English context, instead of the value &quot;0&quot;.

Any idea how this is possible?

Febri
 
The FindTransl returns a field number.
When the translated preposition is not found, 0 is returned and field 0 is the entire line.

The function FindTransl must return the translated preposition instead of the fiend number. So, modify the two functions FindTransl and GetTransl


function FindTransl( f,ft) '
ft = &quot;0&quot; ;
for (f=a; f<=a+8; f=f+2) '
if (f <= 0) continue ;
if (f > NF) break ;
if ($f != &quot;dans&quot; && $f !=&quot;en&quot;) continue ;
ft = $f ;
break ;
}
return ft ;
}

function GetTransl() {
getline < &quot;F_in.tag&quot; ;
gsub(&quot;/&quot;,&quot; &quot;) ;
print FindTransl() ;
}
Jean Pierre.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top