Hi Guys,
I hope I will get solution from awk, I am tired of opening and filtering the sequence per mellion line. I want to discard a sequence if more than two . at the begening( meaning character 1-10)
Data:
GCGGAA.GATCATTA
GCGA.GGCA.GCCG.CC.
GCTCCGGGA.GGCTCGGG
CTCC...A.GGCTGGGA
GCT.....A.GGT....A
GCAGGA.GGTGGCCA
GCAGGA.GGTGGCCA
CGTGGA.GGTGTGAG
GGAGGGTCA.GTAGTGAG
GCT.CGCGA.GTCCCAGA
GCGGCG.A.GTGGTGAG
CTCGTA.GA.T..TAGC.
GCTCCGTGA.T.CTGGCA
GAGGGAGTA.TTTTTTTT
GGCG.GCTAA.ACGTACG
GAGCGCTTAA.TC.AA.G
CGGTTGGGAAAAAAAAAA
CGGTTGGGAAAAAAAAAA
For Example, I need only the last two line from the above data. Think like this I have 64 mellion line of the same problem. I did majorty of filter by awk regex but can't able to handle this.
awk '$1!~/AAAAAAAAAAAAAAAA/||/TTTTTTTTTTTTTT/||/\.\.\.\.\.\.\.\.\.\.\.\.\./{print $1}' filein.txt
On excel I am doing the following
1. left(A1, 10)
2. sort
3. remove manually the top low complex sequence
Then I will repeate the same for
1 = right(A1, 10)
2 sort
3 remove manualy
I hope I will get solution from awk, I am tired of opening and filtering the sequence per mellion line. I want to discard a sequence if more than two . at the begening( meaning character 1-10)
Data:
GCGGAA.GATCATTA
GCGA.GGCA.GCCG.CC.
GCTCCGGGA.GGCTCGGG
CTCC...A.GGCTGGGA
GCT.....A.GGT....A
GCAGGA.GGTGGCCA
GCAGGA.GGTGGCCA
CGTGGA.GGTGTGAG
GGAGGGTCA.GTAGTGAG
GCT.CGCGA.GTCCCAGA
GCGGCG.A.GTGGTGAG
CTCGTA.GA.T..TAGC.
GCTCCGTGA.T.CTGGCA
GAGGGAGTA.TTTTTTTT
GGCG.GCTAA.ACGTACG
GAGCGCTTAA.TC.AA.G
CGGTTGGGAAAAAAAAAA
CGGTTGGGAAAAAAAAAA
For Example, I need only the last two line from the above data. Think like this I have 64 mellion line of the same problem. I did majorty of filter by awk regex but can't able to handle this.
awk '$1!~/AAAAAAAAAAAAAAAA/||/TTTTTTTTTTTTTT/||/\.\.\.\.\.\.\.\.\.\.\.\.\./{print $1}' filein.txt
On excel I am doing the following
1. left(A1, 10)
2. sort
3. remove manually the top low complex sequence
Then I will repeate the same for
1 = right(A1, 10)
2 sort
3 remove manualy