INTELLIGENT WORK FORUMS FOR COMPUTER PROFESSIONALS
Come Join Us!
Are you a Computer / IT professional? Join Tek-Tips now!
- Talk With Other Members
- Be Notified Of Responses
To Your Posts
- Keyword Search
- One-Click Access To Your
Favorite Forums
- Automated Signatures
On Your Posts
- Best Of All, It's Free!
*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.
Partner With Us!
"Best Of Breed" Forums Add Stickiness To Your Site

(Download This Button Today!)
Feedback
"...Congratulations on a brilliant idea and a great site..."
Geography
Where in the world do Tek-Tips members come from?
|
|
I'm in the midst of writing a rather quick and dirty URL Encoding function (replacing all specials characters with their hex value, i.e. 'http://www.url.com/test text.html' -> 'http://www.url.com/test%20text.html') via AWK and am struggling on this rather simple task.
Well, it's rather easy achieving this by using printf via shell (notice the apostrophe in front of the ASCII character, in this case the [): printf "%02x" "'[" > 5b
But I'm having trouble replicating the same effect with AWK. I've already tried everything with either the AWK internal printf or sprintf. But I'm not even sure if the apostrophe is necessary, which I also had trouble to concatenate to the character to be converted, since even 'escaped' via backslash I couldn't add the darn ' without errors. I've solved this in a rather ugly way by using 'awk -v' to bring the apostrophe from "outside" of awk.
But still, I can't get awk's printf function to do same as the "regular" shell printf. |
|
I forgot to add my small testing script (tested with OS X): CODE#!/bin/sh
teststr=$1
echo "$teststr" | awk -v dummy="'" ' function url_encode(rawURL) { cleanURL="" nonURLPos=match(rawURL,/[^[:alnum:]]/) while( nonURLPos > 0 ) { rawChar=substr(rawURL,nonURLPos,1) replaceChar=sprintf("%02x", dummy rawChar) cleanURL = cleanURL substr(rawURL,1,nonURLPos-1) "%" replaceChar rawURL = substr(rawURL,nonURLPos+1) nonURLPos=match(rawURL,/[^[:alnum:]]/) } cleanURL = cleanURL rawURL return cleanURL } { print(url_encode($0))} ' urlencode.sh "abc ABC[123]456" >abc%00ABC%00123%00456 replacing the sprintf in 'replaceChar=sprintf("%02x", dummy rawChar)' with the printf function results in the error: awk: syntax error at source line 7 in function url_encode context is >>> replaceChar=printf <<< ("%02x", dummy rawChar) awk: illegal statement at source line 8 in function url_encode |
|
I've never encountered that apostrophe character constant syntax before... is it a C thing? As you say, awk certainly doesn't support it... awk has a very simple "typeless" variable system which tries to intelligently DWIM (do what I mean) rather than handling everything literally, which unfortunately is stumping you here. I can't think of a way around it for now... does it need to be awk? If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation. Annihilannic tgmlify - code syntax highlighting for your tek-tips posts |
|
Quote (Annihilannic):I've never encountered that apostrophe character constant syntax before... is it a C thing?
I'm not sure myself where that weird syntax with the apostrophe is coming from. I was just searching the web for as solution for my small problem and found this neat little trick, that sadly isn't working with AWK. But it seems it's indeed a specification of the "classic" printf, but it isn't that well documented: http://pubs.opengroup.org/onlinepubs/009695399/utilities/printf.htmlWhere the important part is: Quote:If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
So, it's not just the apostrophe, either the single apostrophe or single double-quote will do. Quote (Annihilannic):If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation.
The reason why I mix shell parts with AWK is, that I just needed the url encode function for a half written shell script (something with an functionality similar to GetRight or Firefox' DownThemAll but as shell script). And I didn't wanted to rewrite the rest for AWK too, but still wanted to use AWK for the trickier parts of intense string operation. Something that maybe would require countless 'expr' calls from shell or at least would be more complicated (and thus less performant) using shell only. And BTW I found another solution, that does the trick, but isn't that pretty either: CODE --> AWKBEGIN { for (i = 0 ; i <= 255 ; i++) { t = sprintf("%c", i) _ord_[t] = sprintf("%x", i) } } function ord(str, c) { c = substr(str, 1, 1) return _ord_[c] }
|
|
Oh, and BTW I forgot to add the working URL Encoding function. But the mentioned script I want to use it for isn't finished yet. Why isn't there an edit post feature for already sent posts, so you don't have to answer your own posts do add something later ![[ponder] ponder](http://www.tipmaster.com/images/ponder.gif) : CODE --> AWKBEGIN { for (i = 0 ; i <= 255 ; i++) { t = sprintf("%c", i) _ord_[t] = sprintf("%x", i) } }
function ord(str, c) { c = substr(str, 1, 1) return _ord_[c] }
function url_encode(rawURL) { cleanURL="" do { nonURLPos = match(rawURL,/[^a-zA-Z0-9_.\:\/]/) if (nonURLPos > 0) { rawChar = substr(rawURL,nonURLPos,1) replaceChar = ord(rawChar) cleanURL = cleanURL substr(rawURL,1,nonURLPos-1) "%" replaceChar rawURL = substr(rawURL,nonURLPos+1) } } while (nonURLPos > 0)
cleanURL = cleanURL rawURL return cleanURL }
|
|
I think that's a good solution. Any aversion to perl? CODE --> Perlecho "$teststr" | perl -nwe ' foreach my $char (split //) { if ($char =~ /([[:alnum:]\/.:\n])/) { print $char; } else { printf "%%%02x",ord($char); } } ' Or shorter, but more cryptic: CODE --> Perlecho "$teststr" | perl -nwe ' print join "", map { $_ =~ /([[:alnum:]\/.:\n])/ ? $_ : sprintf("%%%02x",ord($_)) } split //; ' Annihilannic tgmlify - code syntax highlighting for your tek-tips posts |
|
From Shelldorado CODE: ########################################################################## # Title : urlencode - encode URL data # Author : Heiner Steven (heiner.steven@odn.de) # Date : 2000-03-15 # Requires : awk # Categories : File Conversion, WWW, CGI # SCCS-Id. : @(#) urlencode 1.4 06/10/29 ########################################################################## # Description # Encode data according to # RFC 1738: "Uniform Resource Locators (URL)" and # RFC 1866: "Hypertext Markup Language - 2.0" (HTML) # # This encoding is used i.e. for the MIME type # "application/x-www-form-urlencoded" # # Notes # o The default behaviour is not to encode the line endings. This # may not be what was intended, because the result will be # multiple lines of output (which cannot be used in an URL or a # HTTP "POST" request). If the desired output should be one # line, use the "-l" option. # # o The "-l" option assumes, that the end-of-line is denoted by # the character LF (ASCII 10). This is not true for Windows or # Mac systems, where the end of a line is denoted by the two # characters CR LF (ASCII 13 10). # We use this for symmetry; data processed in the following way: # cat | urlencode -l | urldecode -l # should (and will) result in the original data # # o Large lines (or binary files) will break many AWK # implementations. If you get the message # awk: record `...' too long # record number xxx # consider using GNU AWK (gawk). # # o urlencode will always terminate it's output with an EOL # character # # Thanks to Stefan Brozinski for pointing out a bug related to non-standard # locales. # # See also # urldecode ##########################################################################
PN=`basename "$0"` # Program name VER='1.4'
: ${AWK=awk}
Usage () { echo >&2 "$PN - encode URL data, $VER usage: $PN [-l] [file ...] -l: encode line endings (result will be one line of output)
The default is to encode each input line on its own." exit 1 }
Msg () { for MsgLine do echo "$PN: $MsgLine" >&2 done }
Fatal () { Msg "$@"; exit 1; }
set -- `getopt hl "$@" 2>/dev/null` || Usage [ $# -lt 1 ] && Usage # "getopt" detected an error
EncodeEOL=no while [ $# -gt 0 ] do case "$1" in -l) EncodeEOL=yes;; --) shift; break;; -h) Usage;; -*) Usage;; *) break;; # First file name esac shift done
LANG=C export LANG $AWK ' BEGIN { # We assume an awk implementation that is just plain dumb. # We will convert an character to its ASCII value with the # table ord[], and produce two-digit hexadecimal output # without the printf("%02X") feature.
EOL = "%0A" # "end of line" string (encoded) split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ") hextab [0] = 0 for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0 if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0 } { encoded = "" for ( i=1; i<=length ($0); ++i ) { c = substr ($0, i, 1) if ( c ~ /[a-zA-Z0-9.-]/ ) { encoded = encoded c # safe character } else if ( c == " " ) { encoded = encoded "+" # special handling } else { # unsafe character, encode it as a two-digit hex-number lo = ord [c] % 16 hi = int (ord [c] / 16); encoded = encoded "%" hextab [hi] hextab [lo] } } if ( EncodeEOL ) { printf ("%s", encoded EOL) } else { print encoded } } END { #if ( EncodeEOL ) print "" } ' "$@" Mike
"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters." |
|
|
 |
|