Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

bash or tcsh - Extracting a string from a string for spam removal 1

Status
Not open for further replies.

reclspeak

IS-IT--Management
Dec 6, 2002
57
GB
Hi!

I use a tool called mailfilter on Mac OS X to filter spam mail whilst it is still on the POP3 server.

However my service provider also supports a "reject list" so that mail from selected domains is rejected before you see it in your POP3 mailbox (useful, as I use WebMail, and would like to see an uncluttered mailbox).

When mailfilter executes, it generates a log file, in amongst which is the sending address. I want to extract that address, strip off everything before the @ symbol, retaining just the domain name which then goes into a text file that I can review and add selectively to the Reject List.

Here's an example of such a record;

mailfilter: Deleted &quot;Hazel Numbers&quot; <nqactl89a@prplacements.biz>: RE:Stop maintenance fees, Tue, 25 Nov 03 01:15:04 GMT. [Applied filter: '^Content-Type:\ multipart/alternative;' to 'Content-Type: multipart/alternative; boundary=&quot;0F._3B..B16FC4CB1A_7D2B8&quot;']

The bit I want to extract is the &quot;&quot;prplacements.biz&quot; domain.

Here's what I have at present, which on the above string, does the trick (a bit kacky, I know);

_______________________________________________________

#!/bin/bash
echo &quot;Starting mailfilter...&quot;
mailfilter
# After mailfilter runs, extract the domain from the addresses we don't want.
LOG=/Users/<my account>/Library/Logs/mailfilter.log

cat $LOG | grep Deleted | awk '{ print $5 }' | grep \@ | cut -d\@ -f2 | sed -e 's/>://g' | sort -u >>/Users/<my account>/Desktop/domains_to_be_removed.txt

# Wipe the log file for the next iteration

cat /dev/null>$LOG

___________________________________________________________

Couldn't be simpler (I thought). But the command depends on the position of the email address, determined by the &quot;print $5&quot; awk routine. Although the above routine works, it doesn't always get all of the addresses, as sometimes there is no recipient name before the address, so instead of a string in between the quotes (i.e. &quot;Hazel Numbers&quot;) there is just &quot;&quot;, such as in the example below;

mailfilter: Deleted &quot;&quot; <lynda.odonnell@erie.net>: Get what you always wanted miilqt ivzyhys, Sun, 17 Mar 02 17:38:11 GMT. [Applied filter: '^Content-Type:\ multipart/alternative;' to 'Content-Type: multipart/alternative; boundary=&quot;7A...8_8AC8&quot;']

So the query is; anyone know of a foolproof means to extract the domain from the email address in the strings, without the positional dependence I have at present, even if it means abandoning what I have?

Thanks in anticipation

recl
 
Try something like this:
Code:
sed -n -e '/Deleted/s!.*@\([^>]*\)>.*!\1!p' $LOG | sort -u

Hope This Help
PH.
 
Hi reclspeak,

Instead awk '{ print $5 }' you can try :
Code:
cut -f2 -d&quot;<&quot; | cut -f1 -d&quot;>&quot;
Hope it helps

Theophilos.
 
Thanks PH & Theophilos.

The more comprehensive sed routine worked first time so I've run with that. It has picked up the domain names I was missing with my routine.

Regards


recl
 
You could use awk

awk 'match($0,&quot;@.*>&quot;) {print substr($0,RSTART+1,RLENGTH-2)}' $LOG
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top