Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to transform 1000 xml to 1 file readable for SPSS?

Status
Not open for further replies.

amdx64bt

Technical User
Mar 30, 2009
9
CH
-- DESCRIPTION -----------------------

I would like to write a script with awk or vim to process Lab Blood Tests in xml format to

import with SPSS.

Each blood test is an xml file.

If I have (for example):
1000 Lab Blood Tests (1000 xml files)
250 patients
4 blood tests/patient

Each Blood Test file has the name format: rapport_33405954.xml

Each blood test has a variable number of components but I am interested to analyze only 3

elements: K, Na and Ca. (These elements are not included in all the Blood Tests.)


-- PATIENT -----------------------

-- SOURCE:

<Patient>
<lbpa_Npa>1234</lbpa_Npa>
<lbpa_Nai>02-Oct-1923 00:00:00</lbpa_Nai>
<Entree>15-Oct-1582 01:00:00</Entree>
<Pid>0</Pid>
<Ncas>0</Ncas>
<pre10 />
<lbpa_Pre>Peter</lbpa_Pre>
<lbpa_Num_Npat>1234567</lbpa_Num_Npat>
<lbrq_Nom1 />
<lbpa_Adr2>Paris</lbpa_Adr2>
<lbrq_Nom2 />
<lbpa_Sexe>M</lbpa_Sexe>
<nom10 />
<lbrq_Rid>0</lbrq_Rid>
<Actif />
<lbpa_Adr />
<lbpa_Nom>Smith</lbpa_Nom>
<Adm />
</Patient>

-- RESULT:

(first_name second_name, date_born)
lbpa_Nom lbpa_Pre, lbpa_Nai
Smith Peter, 1923.10.02


-- DATE TAKEN BLOOD -----------------------

--SOURCE:

<Demande>
<Entree>15-Oct-1582 01:00:00</Entree>
<lbde_Rid>12345</lbde_Rid>
<lbde_Nlab>12345</lbde_Nlab>
<Sortie>15-Oct-1582 01:00:00</Sortie>
<NarunaFile />
<Ncas>0</Ncas>
<Etabl />
<lbde_Num_Npat>12345</lbde_Num_Npat>
<Naruna />
<Date_Mod>01-Jan-1900 00:00:00</Date_Mod>
<Taille>0</Taille>
<lbde_pid>12345/111</lbde_pid>
<TCollection>0</TCollection>
<Semgr>0</Semgr>
<lbrq_nom1 />
<lbde_Dtprv>02-Mar-2011 06:00:00</lbde_Dtprv>
<Pathologique>FALSE</Pathologique>
<lbrq_nom2 />
<Bacterio>FALSE</Bacterio>
<Volume>0</Volume>
<Type_ <Poids>0</Poids>
<lbde_Dtdem>02-Mar-2011 07:18:32</lbde_Dtdem>
<PasVue>FALSE</PasVue>
<par />
<Domaine />
</Demande>

-- RESULT:

(Date_taken_blood)
lbde_Dtprv
2011.03.02


-- ELEMENT -----------------------

-- SOURCE:

<Analyse>
<OrdreImpression>12345</OrdreImpression>
<CodeMateriel />
<TypeLigne>0</TypeLigne>
<Formulaire>21</Formulaire>
<Norme>136 - 145 mmol/l</Norme>
<Code>2039</Code>
<Commentaire />
<Anterieur />
<TypeResultat>0</TypeResultat>
<Resultat>136</Resultat>
<Unite>mmol/l</Unite>
<Remarque />
<Clos>O</Clos>
<Libelle>Sodium</Libelle>
</Analyse>

-- RESULT:

(Element number)
Libelle Resultat
Sodium 136


-- SORT ELEMENT BY DATE -----------------------

Sodium 05.01.2011 --> Na1
Sodium 08.01.2011 --> Na3
Sodium 06.01.2011 --> Na2


-- FINAL RESULT -----------------------

From 1000 files I want to obtain a file with this format. To be able to import it with

SPSS:

Na1 Na2 K1 K2
Smith Peter 19231002 136 133 4 3.5
Gates Edward 19801204 145 166 3.1 3.4

(In this case the date of Na1 of Smith and Gates, could be different, but the variable Na1

is the same)


Any advice is appreciated
 
Here's something to get you started
Code:
BEGIN { ma["Jan"]="01";ma["Feb"]="02";ma["Mar"]="03";ma["Apr"]="04";
        ma["May"]="05";ma["Jun"]="06";ma["Jul"]="07";ma["Aug"]="08";
        ma["Sep"]="09";ma["Oct"]="10";ma["Nov"]="11";ma["Dec"]="12";
}
{
  if (FNR==1) {
    if (NR>1){
      print snam, pnam, dob, dot, nav
    }
  }
}
# Patient name and date of birth
/<Patient>/{pflg=1;next}
/<\/Patient>/{pflg=0;next}
pflg && /<lbpa_Nom>/ {gsub(/ *<[^>]*>/, ""); snam=$0;  next} 
pflg && /<lbpa_Pre>/ {gsub(/ *<[^>]*>/, ""); pnam=$0; next} 
pflg && /<lbpa_Nai>/ {
  gsub(/ *<[^>]*>/, "")
  split($1,a,"-")
  dob=a[3] ma[a[2]] a[1]
  next
} 
# Date of test
/<Demande>/{dflg=1;next}
/<\/Demande>/{dflg=0;next}
dflg && /<lbde_Dtprv>/ {
  gsub(/ *<[^>]*>/, "")
  split($1,a,"-")
  dot=a[3] ma[a[2]] a[1]
  next
} 
# Sodium result
/<Analyse>/{aflg=1; sflg=0; next}
aflg && /<Libelle>/ {gsub(/ *<[^>]*>/, "");if ($0 ~ "Sodium")sflg=1; next} 
aflg && /<Resultat>/ {gsub(/ *<[^>]*>/, ""); rslt=$0; next} 
/<\/Analyse>/{aflg=0; if (sflg==1) nav=rslt; next}

END {
  print snam, pnam, dob, dot, nav
}
It creates a file with one line for each test containing the patient name, dob, test date and Na result. You will need to add Ca and K results and combine the results for each patient. You may also want to add some error checking. Run by entering
awk -f file.awk *.xml

CaKiwi
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top