Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

A C prog to do pattern searching 2

Status
Not open for further replies.

nevj

Technical User
Sep 21, 2002
14
US
I have to solve a problem for my wife who is engaged in Research in Breast Cancer.

1. She has frequently to search a list of alphabetic characters for an exact match of a pattern.
This pattern can be pasted into a file.Sometimes the characters are presented to her as a single line but other times as several columns which actually are intended to be a single line.
Also the case can vary.

e.g. mwaaagglwrsraglralfrsrdaalfpgcerglhcsavscknwlkkfasktkkkvwyespslgshstykpskleflmrstskktrkedharlralngllykaltdllctpevsqelydlnvelskvsltpdfsacraywkttlsaeqnahmeavlqrsaahmslisywqsqtldpgmkettlykmisgtlmphnpaapqsrpqapvcvgsimrrstsrlwstkggkikgsgawcgrgrwls

OR

mwaaagglwrsraglralfrsrdaalfpgc
erglhcsavscknwlkkfasktkkkvwyes
pslgshstykpskleflmrstskktrkedh
arlralngllykaltdllctpevsqelydl
nvelskvsltpdfsacraywkttlsaeqna
hmeavlqrsaahmslisywqsqtldpgmke
ttlykmisgtlmphnpaapqsrpqapvcvg
simrrstsrlwstkggkikgsgawcgrgrwls

OR

MWAAAGGLWRSRAGLRALFRSRDAALFPGC
ERGLHCSAVSCKNWLKKFASKTKKKVWYES
PSLGSHSTYPKSKLEFLMRSTSKKTRKEDH
ARLRALNGLLYKALTDLLCTPEVSQELYDL
NVELSKVSLTPDFSACRAYWKTTLSAEQNA
HMEAVLQRSAAHMSLISYWQSQTLDPGMKE
TTLYKMISGTLMPHNPAAPQSRPQAPVCVG
SIMRRSTSRLWSTKGGKIKGSGAWCGRGRWLS

OR

MWAAAGGLWRSRAGLRALFRSRDAALFPGCERGLHCSAVSCKNWLKKFASKTKKKVWYEPSLGSHSTYPKSKLEFLMRSTSKKTRKEDHARLRALNGLLYKALTDLLCTPEVSQELYDLNVELSKVSLTPDFSACRAYWKTTLSAEQNAHMEAVLQRSAAHMSLISYWQSQTLDPGMKETTLYKMISGTLMPHNPAAPQSRPQAPVCVGSIMRRSTSRLWSTKGGKIKGSGAWCGRGRWLS

2. There are ONLY two patterns to be searched for -


r?r??s
r?r??t

The ? can be any of the following characters

acdefghiklmnpqrstvxy

3. Once an exact match/s has been made it is essential to know the number of characters from the start of the line to each match.It is possible that there is more than one match.

Can anyone suggest a program in ANSI C that will compile in the first instance in Solaris (SunOS 5.9).

But is portable (source and then re compile) to HP-UX and AIX and to XP.

It is urgent.

Thanks
 
cluelessNewbie

This is the result on Solaris 5.9

$ cc -o ten ten.c
&quot;ten.c&quot;, line 24: warning: improper pointer/integer combination: op &quot;<&quot;

$ gcc -o ten ten.c
$

Unfrotunately although gcc succeeds I get a &quot;segmentation fault' when I execute ten

Any thoughts?
 
icrf

Here's the latest result on Solaris 5.9.

$ cc -o eleven eleven.c
&quot;eleven.c&quot;, line 27: newline in string literal
&quot;eleven.c&quot;, line 28: syntax error before or at: yespslgshstykpskleflmrstskktrkel
&quot;eleven.c&quot;, line 31: newline in string literal
&quot;eleven.c&quot;, line 43: undefined symbol: buffer
&quot;eleven.c&quot;, line 44: warning: improper pointer/integer combination: arg #1
&quot;eleven.c&quot;, line 50: cannot dereference non-pointer type
&quot;eleven.c&quot;, line 65: undefined symbol: temp
&quot;eleven.c&quot;, line 66: warning: improper pointer/integer combination: arg #1
&quot;eleven.c&quot;, line 67: cannot dereference non-pointer type
&quot;eleven.c&quot;, line 70: newline in string literal
&quot;eleven.c&quot;, line 71: syntax error before or at: occured
&quot;eleven.c&quot;, line 71: invalid source character: '\'
&quot;eleven.c&quot;, line 71: newline in string literal
&quot;eleven.c&quot;, line 84: cannot recover from previous errors
cc: acomp failed for eleven.c



Better results with gcc -

$ gcc -o eleven eleven.c
$eleven

How do I feed in the variable long string (protein sequence)

The situation is that always I have to search for strings r?r??s OR r?r??t being present in a variable protein chain.
This chain can be in upper or lower case and be in one single line or many lines.
This variable protein chain will contain random alphabetic characters from a to y (except b,j,o, u,w)
 
Well, the proggie worked for me on a linux box. I can try on FreeBSD or OpenBSD but not Solaris. Sorry.

Let's give it a try.

OK I'm compiling with:

gcc -o ten ten.c

Let's reduce the possible source of errors & shrink this proggie a bit. We can add back functionality later. Try...


/*##########################################################################*/
/* BEWARE ONLY TESTED ON SuSe/LINUX... DUMB & DIRTY CODE FOLLOWS */
/*##########################################################################*/
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main(int argc, char** argv){
char* pos;
int off;
regex_t rx;
regmatch_t match[1];
char pat[256] = &quot;R.R..[RS]&quot;;
char* pep = &quot;mwaaagglwrsraglralfrsrdaalfpgc&quot;
&quot;erglhcsavscknwlkkfasktkkkvwyes&quot;
&quot;pslgshstykpskleflmrstskktrkedh&quot;
&quot;arlralngllykaltdllctpevsqelydl&quot;
&quot;nvelskvsltpdfsacraywkttlsaeqna&quot;
&quot;hmeavlqrsaahmslisywqsqtldpgmke&quot;
&quot;ttlykmisgtlmphnpaapqsrpqapvcvg&quot;
&quot;simrrstsrlwstkggkikgsgawcgrgrwls&quot;;

/*######################################################################*/
/* Build regular expression */
/*######################################################################*/
if ((regcomp(&rx, pat, REG_EXTENDED | REG_ICASE))){
fprintf(stderr, &quot;Could not compile regular expression\n&quot;);
exit(EXIT_FAILURE);
}

/*######################################################################*/
/* Now look for the pattern in memory */
/*######################################################################*/
pos = pep;
off = 0;
while (!(regexec(&rx, pos, 1, match, 0))){
printf(&quot;beg:%i end:%i\n&quot;, off+match[0].rm_so, off+match[0].rm_eo);
pos += match[0].rm_so + 1;
off += match[0].rm_so + 1;
printf(&quot;%s\n&quot;,pep+off-1);
}

/*######################################################################*/
/* Cleanup & exit */
/*######################################################################*/
regfree(&rx);
}
 
cluelessNewbie

Okay result on Solaris 5.9 and it is good progress.

$ cc -o twelve twelve.c
$ twelve
beg:236 end:242
rgrwls

pep is a variable string and can be upper or lower case and one line or many lines.

Can we present it as a file in the form of an argument?
e.g.
twelve file

Also it is only important to know the character count from 1 at the start of each pattern that is matched.
Naturally there can be more than one pattern match for any given string and all possible occurences must be reported.
 
nevj,

OK some success, good... & I'm learning what I need to... even better!

1) Realize that I'm taking you down a *nixie path. I know some Windows libraries but I'm severly out of practice. In particular, the libraies I use for file i/o may be *nixie only. I'll try & steer clear of this but be warned.

2) We still haven't dealt with the whitespace issue. That is, EOLs will cause the pattern match to fail... BUT we have 2 obvious options there.

a) Strip the whitespace out of your string in a preprocessing phase. Something like the following untested code snippet.
Code:
/* This may not be quite right yet */
i = j = -1;
while (pep[++i])
  if (pep[i] >= 'A') pep[++j] = pep[i];
pep[++j] = 0; /* Trailing null needs to be handled */
b) Handle the spaces in the regular expression itself. I'm currently better at Perl-like regexs... but the current lib can't be too far off from sed & awk so...

/* pattern is something like... (Note: the RE syntax may need to be tweaked) */
pat[256] = &quot;R\s*.\s*R\s*.\s*.\s*[RS]&quot;;

3) Getting the file name... should be in argv. easy.

4) reading the file in a way that will cross platforms... not to sure. In particular I really want to get the file length for the malloc. in one fell swoop.

5) You know, back to my original point...
grep -ib &quot;R.R..[RS]&quot; myFile
gets you most of the way there. & awk, sed, or perl, all available on Windows, should make it fairly easy to get to get you the rest of the way. That wheel is round too (-;
 
I'n not sure if you want a file position or an amino acid position in the peptide so I've given you both options. The amino acid postion option has been commented out & the file postion option is active.
Code:
/*##########################################################################*/
/* BEWARE ONLY TESTED ON LINUX... DUMB & DIRTY CODE FOLLOWS */
/*##########################################################################*/
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char** argv){
    int        file; /**/
    char*      pos; /**/
    int        off; /**/
    regex_t    rx; /**/
    regmatch_t match[1]; /**/
    char       pat[256] = &quot;R[ \t\n\r]*.[ \t\n\r]*R[ \t\n\r]*.[ \t\n\r]*.[ \t\n\r]*[RS]&quot;;
    /*char       pat[256] = &quot;R.R..[RS]&quot;; */
    char*      pep; /**/
    int        len; /**/
    int        k, j; /**/

    /*######################################################################*/
    /* Open the input file */
    /*######################################################################*/
    if (argc < 2){
        fprintf(stderr, &quot;No file given\n&quot;);
        exit(EXIT_FAILURE);
    }
    if ((file = open(argv[1], O_RDONLY)) == -1){
        fprintf(stderr, &quot;Could not open file: %s\n&quot;, argv[1]);
        exit(EXIT_FAILURE);
    }

    /*######################################################################*/
    /* Read the file into memory */
    /*######################################################################*/
    len = lseek(file, 0, SEEK_END); /* Get file size */
    lseek(file, 0, SEEK_SET);       /* Get back to start of file */
    if (!(pep = malloc(len))){
        fprintf(stderr, &quot;Ran out of memory\n&quot;);
        exit(EXIT_FAILURE);
    }
    pos = pep;
    while ((k = read(file, pos, 4096))) pos += k;
    close(file);

    /*######################################################################*/
    /* Now strip whitespace from the input data */
    /*######################################################################*/
    /*
    k = j = -1;
    while (pep[++k])
        if (pep[k] >= 'A') pep[++j] = pep[k];
    pep[++j] = 0;  //Trailing null needs to be handled
    */

    /*######################################################################*/
    /* Build regular expression */
    /*######################################################################*/
    if ((regcomp(&rx, pat, REG_EXTENDED | REG_ICASE))){
        fprintf(stderr, &quot;Could not compile regular expression\n&quot;);
        exit(EXIT_FAILURE);
    }

    /*######################################################################*/
    /* Now look for the pattern in memory */
    /*######################################################################*/
    pos = pep;
    off = 0;
    while (!(regexec(&rx, pos, 1, match, 0))){
        printf(&quot;beg:%i end:%i\n&quot;, off+match[0].rm_so, off+match[0].rm_eo);
        pos += match[0].rm_so + 1;
        off += match[0].rm_so + 1;
        /*printf(&quot;%s\n&quot;,pep+off-1);*/
    }

    /*######################################################################*/
    /* Cleanup & exit */
    /*######################################################################*/
    free(pep);
    regfree(&rx);
}
 
cluelessNewbie

Things are progressing.

I have first dealt with the piece of code which is your last posting.

It compiled no problem.

Although I made one change to Line 18 of the code for accuracy of pattern matching.
I changed -
/*char pat[256] = &quot;R.R..[RS]&quot;; */
to
/*char pat[256] = &quot;R.R..[ST]&quot;; */

I give a complete typescript of the four tests I made so far and the filenames are self explanatory.The four files themselves contained the same characters and differed only in case and number of lines.

For good measure I did a wc -m on each of them to ensure consistency.

Script started on Fri 27 Sep 2002 03:45:58 PM EDT
$ uname -a
SunOS XXX 5.9 Generic sun4u sparc SUNW,Ultra-Enterprise
$ fourteen oneline_lc
beg:233 end:239
$ fourteen oneline_uc
beg:233 end:239
$ fourteen manylines_lc
beg:243 end:249
$ fourteen manylines_uc
beg:243 end:249
$ wc -m oneline_lc
240 oneline_lc
$ wc -m oneline_uc
240 oneline_uc
$ wc -m manylines_lc
250 manylines_lc
$ wc -m manylines_uc
250 manylines_uc
script done on Fri 27 Sep 2002 03:48:22 PM EDT

Now for some refinements if possible please.

It would be good to present the entire file searched on the line after the actual results of the pattern search.
 
cluelessNewbie

I have dealt with your commenst from the earlier posting -

OK some success, good... & I'm learning what I need to... even better!
Okay so my wife gets a nifty C program and u learn C and something about Bioinformatics. It sounds good.

1) Realize that I'm taking you down a *nixie path. I know some Windows libraries but I'm severly out of practice. In particular, the libraies I use for file i/o may be *nixie only. I'll try & steer clear of this but be warned.

Stuff Windows at present as if this can be run on any decent UNIX system Windows can use a browser to get to it.

2) We still haven't dealt with the whitespace issue. That is, EOLs will cause the pattern match to fail... BUT we have 2 obvious options there.

a) Strip the whitespace out of your string in a preprocessing phase. Something like the following untested code snippet.

/* This may not be quite right yet */
i = j = -1;
while (pep[++i])
if (pep >= 'A') pep[++j] = pep;
pep[++j] = 0; /* Trailing null needs to be handled */

In the run time results of fourteen (in my previous posting) you will see the way it handled EOLs on the files with many lines (8 in reality) compared the single line files.
I did not present it with files containing actual whitespace on the single lines or many lines.
I will report the results next.

b) Handle the spaces in the regular expression itself. I'm currently better at Perl-like regexs... but the current lib can't be too far off from sed & awk so...

/* pattern is something like... (Note: the RE syntax may need to be tweaked) */
pat[256] = &quot;R\s*.\s*R\s*.\s*.\s*[RS]&quot;;

(I altered line 18 - RS to be ST as already posted.)

3) Getting the file name... should be in argv. easy.

It works

4) reading the file in a way that will cross platforms... not to sure. In particular I really want to get the file length for the malloc. in one fell swoop.

I will try it on as many UNIX systems as I can access.

5) You know, back to my original point...
grep -ib &quot;R.R..[RS]&quot; myFile
gets you most of the way there. & awk, sed, or perl, all available on Windows, should make it fairly easy to get to get you the rest of the way. That wheel is round too (-;

Sure but it means that someone has to tweak the protein sequence file to make it one line as grep or agrep (Windows) does not work across EOLs so there will be pattern match errors.

Anyhow we have made a lot of progress
 
line 18 is currently unprocessed & we're using the whitespace is in the pattern method. change line 17 to read...
Code:
    char       pat[256] = &quot;R[ \t\n\r]*.[ \t\n\r]*R[ \t\n\r]*.[ \t\n\r]*.[ \t\n\r]*[ST]&quot;;
or rather just change the last [RS] to [ST].

the REG_ICASE flag should handle the upper/lower case situations. for free in the library.

> It would be good to present the entire file
> searched on the line after the actual
> results of the pattern search.

insert the following line just after line #76
Code:
printf(&quot;%s\n&quot;, pep);

Lucky, you're &quot;Windows be d4mn3d for now&quot; because I realized a potential porting problem with the currently enabled whitespace is in the pattern method. That is, windows uses 2 characters at the end of every line (CR/LF) while Unix only uses one (LF). & this will make the position results differ. The amino acid offset method, currently disabled by being commented out, should give the same position in unix & win provided it works at all.
 
I should be perfectly clear here...
we are currently counting the whitespace
including the end-of-line characters. To
change this to only count amino acid postions...

comment out line #17
uncomment line #18
and uncomment lines #51-56

the printf statement added in my previous
post will now look a little funky because it
will have all of the whitespace stripped out.
if this a problem it can be fixed. the price
to easily fix this is a doubling of the
memory requirements which shouldn't be too
bad in this case. Proteins tend to rather
small oligos.
 
cluelessNewbie

One other essential that my wife wishes which (my aplogies) I forgot when I posted the possible refinements earlier is that when the result is presented; for each pattern found -
1. At least the character count from character 1 to start of each pattern found be shown
2. The actual pattern found be shown below its character/position.
3. The complete pattern (one line/many lines) searched be shown at the end of all the above.

See below -
$ fourteen oneline_lc
beg:233 end:239
rgrwls
aagglwrsraglralfrsrdaalfpgcerglhcsavscknwlkkfasktkkkvwyespslgshstykpskleflmrstskktrkedharlralngllykaltdllctpevsqelydlnvelskvsltpdfsacraywkttlsaeqnahmeavlqrsaahmslisywqsqtldpgmkettlykmisgtlmphnpaapqsrpqapvcvgsimrrstsrlwstkggkikgsgawcgrgrwls

Interestingly in the above example the pattern found are the last 6 characters of the protein sequence

I am going to try it on other sequences with more than one pattern in them.

 
have a good weekend... & the Dr. Wife too.
Code:
/*##########################################################################*/
/* BEWARE ONLY TESTED ON LINUX... DUMB & DIRTY CODE FOLLOWS */
/*##########################################################################*/
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char** argv){
    char       pat[256] = &quot;R.R..[ST]&quot;;
    int        file;        /**/
    char*      ppep;        /**/
    char*      preport;     /**/
    int        off;         /**/
    regex_t    rx;          /**/
    regmatch_t match[1];    /**/
    char       buff[1024];  /* Should be big enuf for now */
    char*      pep;         /**/
    char*      report;      /**/
    int        len;         /**/
    int        k;

    /*######################################################################*/
    /* Open the input file */
    /*######################################################################*/
    if (argc < 2){
        fprintf(stderr, &quot;No file given\n&quot;);
        exit(EXIT_FAILURE);
    }
    if ((file = open(argv[1], O_RDONLY)) == -1){
        fprintf(stderr, &quot;Could not open file: %s\n&quot;, argv[1]);
        exit(EXIT_FAILURE);
    }

    /*######################################################################*/
    /* Read the file into memory */
    /*######################################################################*/
    len = lseek(file, 0, SEEK_END); /* Get file size */
    lseek(file, 0, SEEK_SET);       /* Get back to start of file */
    if (!(pep = malloc(len))){
        fprintf(stderr, &quot;Ran out of memory\n&quot;);
        exit(EXIT_FAILURE);
    }
    if (!(report = malloc(len))){
        fprintf(stderr, &quot;Ran out of memory\n&quot;);
        exit(EXIT_FAILURE);
    }
    preport = report;
    while ((k = read(file, report, 4096))) preport += k;
    close(file);

    /*######################################################################*/
    /* Now strip whitespace from the input data */
    /*######################################################################*/
    preport = report - 1;
    ppep    = pep - 1;
    while (*(++preport))
        if (*preport >= 'A') *(++ppep) = *preport;
    *(++ppep) = 0;  //Trailing null needs to be handled

    /*######################################################################*/
    /* Build regular expression */
    /*######################################################################*/
    if ((regcomp(&rx, pat, REG_EXTENDED | REG_ICASE))){
        fprintf(stderr, &quot;Could not compile regular expression\n&quot;);
        exit(EXIT_FAILURE);
    }

    /*######################################################################*/
    /* Now look for the pattern in memory */
    /*######################################################################*/
    ppep = pep;
    off = 0;
    while (!(regexec(&rx, ppep, 1, match, 0))){
        printf(&quot;pos:%i\n&quot;, off+match[0].rm_so+1);
        ppep += match[0].rm_so + 1;
        off  += match[0].rm_so + 1;
        strncpy(buff, ppep - 1, match[0].rm_eo - match[0].rm_so);
        printf(&quot;%s\n&quot;,buff);
    }
    printf(&quot;%s\n&quot;, report);

    /*######################################################################*/
    /* Cleanup & exit */
    /*######################################################################*/
    free(pep);
    free(report);
    regfree(&rx);
}
 
cluelessNewbie

This is starting to look very good but ofcourse I will continue testing and keep you posted.

I made one change on line 61 as it failed and all it required was comment markers at the comment.

$ sixteen sample
pos:107 RSRHSS pos:131 RGRSRS mgtpkqpslapahalglrksdpgirslgsdaggrrwrpaaqsmfqipefepseqedasatdrglgpsltedqpgpylapgllgsnihqqgraatnshhggagametRSRHSSypagteedegmeeelspfRGRSRSappnlwaaqrygrelrrmsdefegsfkglprpksagtatqmrqsagwtriiqswwdrnlgkggstpsq

I hope you have a good weekend too.

Nevj
 
cluelessNewbie

The above post should not look teh way it is and should read-

This is starting to look very good but ofcourse I will continue testing and keep you posted.

I made one change on line 61 as it failed and all it required was comment markers at the comment.

$ sixteen sample
pos:107
RSRHSS
pos:131
RGRSRS
mgtpkqpslapahalglrksdpgirslgsdaggrrwrpaaqsmfqipefepseqedasatdrglgpsltedqpgpylapgllgsnihqqgraatnshhggagametRSRHSSypagteedegmeeelspfRGRSRSappnlwaaqrygrelrrmsdefegsfkglprpksagtatqmrqsagwtriiqswwdrnlgkggstpsq

I hope you have a good weekend too.

Nevj
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top