Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

grep + sed/awk ==> extract specific text from surrounding text... 1

Status
Not open for further replies.

shadedecho

Programmer
Oct 4, 2002
336
US
So, the basic concept here is that I have some text output that comes from another command (which one matters not here)... so it might have something like:

blah192.168.0.2blahblah

on a single line output'd to the screen. using the regexp

[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}

I want to have the output from a single command line (piped together in whatever way necessary) be JUST the matching IP address.

so, something like

Code:
$ ./myscript | sed -e "s/[0-9]\{1,3\}\.[0-9]
\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}//"
                                    ^
                              right here
is where i would need to put something in which would actually display the matched string. sed's documentation states that & would do the trick, BUT, the sed command above actually prints the rest of the line too, which I want to discard.

So, I have two thoughts on how to do this:

1. find a way to tell the regular expression to match the negation, in other words, everything that does NOT match the above regular expressions which find the IP. Then, everything that didn't match (actually, matched the negation of the expression) would be replaced by empty string. This would leave only the IP address.

2. find a way using & in the replace and also how to make a new line character in there, so it would print the match on one line and the rest that didn't match on another line. My thinking with this is that this could then be piped to grep and grep would only return the line with the match'd sequence again.

Maybe there's another way to do this. I can't seem to make either of the above approaches work. Please, can someone help me write a single command line using grep, awk, and/or sed that will take a single line of any kind of text and extract JUST a single IP address from that line?

 
This worked for a simple test case, but I can't help thinking there must be an easier way.

sed -e "s/.*[^0-9]\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*/\1/"

CaKiwi

"I love mankind, it's people I can't stand" - Linus Van Pelt
 
THANKS! that's exactly what I needed... I am straining my brain trying to figure out what it is you did with SED to make that work... I see the .*[^0-9] at the beginning of the grep, which its obviously looking for any non-numeric character preceeding the IP address... then I see the pattern for the IP enclosed in (), then I see the .* at the end, looking for a character after the IP as well.

But, you replace it with \1...

is this basically using the concept that you match everything on the entire line, but you enclose the IP address match in a hold buffer, and then you replace the entire line with only what's in the hold buffer? If I have interpreted that correctly, then GREAT idea!

you get a STAR!
 
You interpreted it correctly. The .*[^0-9] means search for any number of any character followed by a character that is not numeric. If you just use .* it will match the first characters of the IP address.

CaKiwi

"I love mankind, it's people I can't stand" - Linus Van Pelt
 
one quick note that I found when in testing of this solution:

if i had the text:

Code:
sdnsakdfn192.168.0.1sdfl asdfkladf192.168.4.2asjdn45.56.67.78lkdf lskadf;l

and i parsed that with the above statement, it parses through it and finds 3 distinct matching strings (IP's) and returns the LAST one on the line, so in this case, 45.56.67.78

I am not understanding why this occurs? I would think based on using the \1 it would return the first one.

This happens to work out to be just what I need though, cause there will only ever be 1 IP on the line of text, and if there were more, I'd always want the last one.

But I can forsee the need to have an algorithm return the FIRST matching IP on the line, even if there are no characters at all before it. so with my test text, it should return 192.168.0.1

Do you know of a way to modify your regular expression to return the first occurence, instead of the last?
 
Make the first [tt].*[/tt] non-greedy. I'm not sure how to do this with sed, but with PCRE engines, it would be to add a question mark to it, like [tt].*?[/tt].

//Daniel
 
"non-greedy"??? haven't heard that term before in relation to SED or regular expressions. I'll look into that more.

When I try putting .*?[^0-9] or .*[^0-9]? into the above SED expression, no match is made, so that is not how to do it.

anyone?
 

Try this to get the first occurance:

sed -e "s/^[^0-9]*\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*$/\1/"

Good luck!

 
edemiere-
that's better, but its still not quite correct, because if any numeric character (that does not belong to an IP pattern) shows up on the line before the first valid IP, it croaks and spits the whole line back. If the IP pattern contains the first numbers on the line, then it works just fine. any ideas?

try:
echo "blahblah1blahblah192.168.0.2sknsaj" | sed -e "s/^[^0-9]*\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*$/\1/"

and it spits out the whole line, but try:
echo "blahblah192.168.0.2ajsnd" | sed -e "s/^[^0-9]*\([0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\).*$/\1/"

and it correctly reports the 192.168.0.2

thanks for everyone's help!
 

No problem, glad you got it the way you want it.

I was operating under the assumption that it was alpha characters only mixed with IP's as in your examples at the beginning of this thread.

Had you thrown in that curve I'd have given a different solution. :)

Cheers!

 
no i think you misunderstood me, edemiere... I was showing two examples, first one with input that broke your solution, and the second with input that worked with your solution. I still don't have a solution to the first example.

To be clear, what I would like is to have a sed regexp that will clear out ANY character, alphanumeric, etc that shows up before or after an IP address pattern, and also that the IP address may have no characters in front of it, or possibly none behind it. I have the algorithm from above which will match the last IP pattern on the line, so I need one that will also match the first IP pattern on the line, given the above requirements. any ideas?

Thanks!
 

Ok, gimme a few to get ya the answer, I dont have any *nix systems and therefore have to do it from memory on pen and paper and trace my results.

Get back to you shortly.

Cheers!

 

I worked on this over the weekend and it boils down to this:

In the example you gave, as far as I can see you are trying to seek the impossible to ascertain (at least based on what I know) and here's why:

You said there could be a few numbers in the string of characters, well here is the hitch:

When I am scanning the line for the IP information, and lets just say for example that we come to an IP 10.xx.xx.xxx or something. Well, who is to say tht there isn't a number 1 in first octet? i.e. 110.xx.xx.xxx. Both are valid IP addresses and unless you are reading a "validation" type document that it would read for rules on valid IP addresses from within the network or whatever, I am not sure how to pluck these out correctly.

If I do think of something, I will let you know...

Cheers!

 
seems to me that the logic isn't tough to say that a number character on the line should only be included in the IP pattern match if it is immediately adjacent to (left of) the rest of the pattern.

say for instance this string

f123g456h789i012192.168.0.1lkdfaslkmfasd

notice how the IP pattern is immediately adjacent to (right of) f123g456h789i012. the way the pattern match knows to include the 1 in 192 is that the pattern says up to 3 numeric characters. so, the IP might have been intended as just 92, but the pattern would have to include the 1 to make it 192....

wouldn't matching (0 or more) characters (of any type) which were NOT in the pattern but appeared to the left of the pattern match would be done with something like:

^.{0,}

then even that could be surrounded in () so that they would be \1 and then the IP pattern could be \2 right? and if there were no characters who matched at the beginning of the line, then \1 would just be blank, right?
 

Well, that is what I was saying is that you would have to tell the sed to pluck up to and no more than 3 DIGITS to the left of the first . it finds, which can get you some bogus IP addresses that you will have to go through and pull out on your own or correct or whatever.

What I was saying is that there is no logic for it to decide on what is, and is NOT a valid IP address from the file of text, all it could do is generically pull out the data with the ruleset of no more than 3 to the left of the . and of course no more than 3 to the right of the last . and exclude letters and such.

Just gets a bit fuzzy when you try to put more and more constraints into the equation.

Cheers!

 
When I say "valid" i just mean the 4 octects (of 3 or less digits) seperated by .'s. i am not concerned with the sed equation validating the IP address it extracts... in fact, I know the output will always include valid IP's, BUT, even if it didn't I have another script which could be run on the output of this equation which would validate it against Class A,B,C rules, IPv4, etc.

The problem really just is a theoretical one at this point, given ANY set of text, numbers, letters, spaces, whatever, if an IP is found on that line, return it.

As I saw above, we have a way to do that if there is only one IP or if the IP you want to extract is the last matched IP on the line (which solves the issue that prompted me to post this question in the firt place).

The extension to that question though is I could forsee needing the ability to extract the first valid IP from a line, given the same set of other parameters.

BTW, thank you so much for all your help thus far and for sticking with this, I really do appreciate it.

I think what a previous post'r might have meant by "non-greedy" is this - if i had:

blahblah12345192.168.0.1blah

on a line, and we were matching all the characters to the left of the IP pattern, the pattern states that it should have 1-3 characters in the first octect, so in this case, it might return 2.168.0.1 (since that is a valid pattern match), because the first part of the equation which matches everything to the left is by default "greedy" in that it matches all the way up to the minimum required by the IP pattern, taking away the 19 from the pattern match.

so, what comes to mind in that sense is to say that the stuff to the left of the IP address CANNOT end with a number (using something like [^0-9]{0,n}$ ), so the \1 in that above example might be "blahblah", the IP pattern would be 192.168.0.1 and the "12345" would simply not match anything and would thus be discarded. does this sound logical?
 
echo "blahblah12345192.168.0.1blah"| awk '{ if (match($0,"[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"))
print substr($0,RSTART,RLENGTH) }'

Returns 192.168.0.1
 
Tried it with the other examples given...
blah192.168.0.2blahblah
sdnsakdfn192.168.0.1sdfl asdfkladf192.168.4.2asjdn45.56.67.78lkdf lskadf;l
blahblah1blahblah192.168.0.2sknsaj
blahblah192.168.0.2ajsnd
f123g456h789i012192.168.0.1lkdfaslkmfasd
blahblah12345192.168.0.1blah

And it returned...
192.168.0.2
192.168.0.1
192.168.0.2
192.168.0.2
192.168.0.1
192.168.0.1
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top