Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Andrzejek on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

ASCII to Hex Conversion?

Status
Not open for further replies.

PumpgunMessiah

Technical User
Mar 17, 2012
5
DE
[ignore]I'm in the midst of writing a rather quick and dirty URL Encoding function (replacing all specials characters with their hex value, i.e. ' text.html' -> ' via AWK and am struggling on this rather simple task.[/ignore]

Well, it's rather easy achieving this by using printf via shell (notice the apostrophe in front of the ASCII character, in this case the [):
printf "%02x" "'["
> 5b

But I'm having trouble replicating the same effect with AWK. I've already tried everything with either the AWK internal printf or sprintf. But I'm not even sure if the apostrophe is necessary, which I also had trouble to concatenate to the character to be converted, since even 'escaped' via backslash I couldn't add the darn ' without errors. I've solved this in a rather ugly way by using 'awk -v' to bring the apostrophe from "outside" of awk.

But still, I can't get awk's printf function to do same as the "regular" shell printf.
 
I forgot to add my small testing script (tested with OS X):
Code:
#!/bin/sh

teststr=$1

echo "$teststr" | awk -v dummy="'" '
function url_encode(rawURL) {
	cleanURL=""
	nonURLPos=match(rawURL,/[^[:alnum:]]/)
	while( nonURLPos > 0 ) {
		rawChar=substr(rawURL,nonURLPos,1)
		replaceChar=sprintf("%02x", dummy rawChar)
		cleanURL = cleanURL substr(rawURL,1,nonURLPos-1) "%" replaceChar
		rawURL = substr(rawURL,nonURLPos+1)
		nonURLPos=match(rawURL,/[^[:alnum:]]/)
	}
	cleanURL = cleanURL rawURL
	return cleanURL
}
{ print(url_encode($0))}
'
urlencode.sh "abc ABC[123]456"
>abc%00ABC%00123%00456

replacing the sprintf in 'replaceChar=sprintf("%02x", dummy rawChar)' with the printf function results in the error:
awk: syntax error at source line 7 in function url_encode
context is
>>> replaceChar=printf <<< ("%02x", dummy rawChar)
awk: illegal statement at source line 8 in function url_encode
 
I've never encountered that apostrophe character constant syntax before... is it a C thing?

As you say, awk certainly doesn't support it... awk has a very simple "typeless" variable system which tries to intelligently DWIM (do what I mean) rather than handling everything literally, which unfortunately is stumping you here. I can't think of a way around it for now... does it need to be awk?

If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Annihilannic said:
I've never encountered that apostrophe character constant syntax before... is it a C thing?
I'm not sure myself where that weird syntax with the apostrophe is coming from. I was just searching the web for as solution for my small problem and found this neat little trick, that sadly isn't working with AWK.
But it seems it's indeed a specification of the "classic" printf, but it isn't that well documented:
[URL unfurl="true"]http://pubs.opengroup.org/onlinepubs/009695399/utilities/printf.html[/url]
Where the important part is:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
So, it's not just the apostrophe, either the single apostrophe or single double-quote will do.

Annihilannic said:
If it's going to be a pure awk script, you could avoid the shell part entirely and use a #!/usr/bin/awk -f shebang line to allow you to use the apostrophe directly in the script rather than dummy variables, but I tried that too and it didn't help the printf situation.

The reason why I mix shell parts with AWK is, that I just needed the url encode function for a half written shell script (something with an functionality similar to GetRight or Firefox' DownThemAll but as shell script). And I didn't wanted to rewrite the rest for AWK too, but still wanted to use AWK for the trickier parts of intense string operation. Something that maybe would require countless 'expr' calls from shell or at least would be more complicated (and thus less performant) using shell only.

And BTW I found another solution, that does the trick, but isn't that pretty either:

Code:
[COLOR=#FF8000]BEGIN[/color] {
	[COLOR=#0000FF]for[/color] (i = [COLOR=#FF0000]0[/color] ; i <= [COLOR=#FF0000]255[/color] ; i++) {
        	t = [COLOR=#FF0000]sprintf[/color]([COLOR=#808080]"%c"[/color], i)
        	_ord_[t] = [COLOR=#FF0000]sprintf[/color]([COLOR=#808080]"%x"[/color], i)
    	}
}
[COLOR=#0000FF]function[/color] ord(str,    c)
{
    c = [COLOR=#FF0000]substr[/color](str, [COLOR=#FF0000]1[/color], [COLOR=#FF0000]1[/color])
    return _ord_[c]
}
 
Oh, and BTW I forgot to add the working URL Encoding function. But the mentioned script I want to use it for isn't finished yet.

Why isn't there an edit post feature for already sent posts, so you don't have to answer your own posts do add something later [ponder] :

Code:
[COLOR=#FF8000]BEGIN[/color] {
	[COLOR=#0000FF]for[/color] (i = [COLOR=#FF0000]0[/color] ; i <= [COLOR=#FF0000]255[/color] ; i++) {
        	t = [COLOR=#FF0000]sprintf[/color]([COLOR=#808080]"%c"[/color], i)
        	_ord_[t] = [COLOR=#FF0000]sprintf[/color]([COLOR=#808080]"%x"[/color], i)
    	}
}

[COLOR=#0000FF]function[/color] ord(str,    c)
{
	c = [COLOR=#FF0000]substr[/color](str, [COLOR=#FF0000]1[/color], [COLOR=#FF0000]1[/color])
	return _ord_[c]
}

[COLOR=#0000FF]function[/color] url_encode(rawURL) {
	cleanURL=[COLOR=#808080]""[/color]
	[COLOR=#0000FF]do[/color] {
		nonURLPos = [COLOR=#FF0000]match[/color](rawURL,/[^a-zA-Z0-9_.\:\/]/)
		[COLOR=#0000FF]if[/color] (nonURLPos > [COLOR=#FF0000]0[/color]) {
		   rawChar = [COLOR=#FF0000]substr[/color](rawURL,nonURLPos,[COLOR=#FF0000]1[/color])
		   replaceChar = ord(rawChar)
		   cleanURL = cleanURL [COLOR=#FF0000]substr[/color](rawURL,[COLOR=#FF0000]1[/color],nonURLPos-[COLOR=#FF0000]1[/color]) [COLOR=#808080]"%"[/color] replaceChar
		   rawURL = [COLOR=#FF0000]substr[/color](rawURL,nonURLPos+[COLOR=#FF0000]1[/color])
		}
	} [COLOR=#0000FF]while[/color] (nonURLPos > [COLOR=#FF0000]0[/color])

	cleanURL = cleanURL rawURL
	return cleanURL
}
 
I think that's a good solution.

Any aversion to perl?

Perl:
echo "$teststr" | perl -nwe '
        [COLOR=#0000FF]foreach[/color] [COLOR=#0000FF]my[/color] $char ([COLOR=#FF0000]split[/color] //) {
                [COLOR=#0000FF]if[/color] ($char =~ /([[:alnum:]\/.:\n])/) {
                        [COLOR=#FF0000]print[/color] $char;
                } [COLOR=#0000FF]else[/color] {
                        [COLOR=#FF0000]printf[/color] [COLOR=#808080]"%%%02x"[/color],[COLOR=#FF0000]ord[/color]($char);
                }
        }
'

Or shorter, but more cryptic:

Perl:
echo "$teststr" | perl -nwe '
        [COLOR=#FF0000]print[/color] [COLOR=#FF0000]join[/color] [COLOR=#808080]""[/color], [COLOR=#0000FF]map[/color] { $_ =~ /([[:alnum:]\/.:\n])/ ? $_ : [COLOR=#FF0000]sprintf[/color]([COLOR=#808080]"%%%02x"[/color],[COLOR=#FF0000]ord[/color]($_)) } [COLOR=#FF0000]split[/color] //;
'

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
From Shelldorado

Code:
:
##########################################################################
# Title      :	urlencode - encode URL data
# Author     :	Heiner Steven (heiner.steven@odn.de)
# Date       :	2000-03-15
# Requires   :	awk
# Categories :	File Conversion, WWW, CGI
# SCCS-Id.   :	@(#) urlencode	1.4 06/10/29
##########################################################################
# Description
#	Encode data according to
#	    RFC 1738: "Uniform Resource Locators (URL)" and
#	    RFC 1866: "Hypertext Markup Language - 2.0" (HTML)
#
#	This encoding is used i.e. for the MIME type
#	"application/x-[URL unfurl="true"]www-form-urlencoded"[/URL]
#
# Notes
#    o	The default behaviour is not to encode the line endings. This
#	may not be what was intended, because the result will be
#	multiple lines of output (which cannot be used in an URL or a
#	HTTP "POST" request). If the desired output should be one
#	line, use the "-l" option.
#
#    o	The "-l" option assumes, that the end-of-line is denoted by
#	the character LF (ASCII 10). This is not true for Windows or
#	Mac systems, where the end of a line is denoted by the two
#	characters CR LF (ASCII 13 10).
#	We use this for symmetry; data processed in the following way:
#		cat | urlencode -l | urldecode -l
#	should (and will) result in the original data
#
#    o	Large lines (or binary files) will break many AWK
#    	implementations. If you get the message
#		awk: record `...' too long
#		 record number xxx
#	consider using GNU AWK (gawk).
#
#    o	urlencode will always terminate it's output with an EOL
#    	character
#
# Thanks to Stefan Brozinski for pointing out a bug related to non-standard
# locales.
#
# See also
#	urldecode
##########################################################################

PN=`basename "$0"`			# Program name
VER='1.4'

: ${AWK=awk}

Usage () {
    echo >&2 "$PN - encode URL data, $VER
usage: $PN [-l] [file ...]
    -l:  encode line endings (result will be one line of output)

The default is to encode each input line on its own."
    exit 1
}

Msg () {
    for MsgLine
    do echo "$PN: $MsgLine" >&2
    done
}

Fatal () { Msg "$@"; exit 1; }

set -- `getopt hl "$@" 2>/dev/null` || Usage
[ $# -lt 1 ] && Usage			# "getopt" detected an error

EncodeEOL=no
while [ $# -gt 0 ]
do
    case "$1" in
    	-l)	EncodeEOL=yes;;
	--)	shift; break;;
	-h)	Usage;;
	-*)	Usage;;
	*)	break;;			# First file name
    esac
    shift
done

LANG=C	export LANG
$AWK '
    BEGIN {
	# We assume an awk implementation that is just plain dumb.
	# We will convert an character to its ASCII value with the
	# table ord[], and produce two-digit hexadecimal output
	# without the printf("%02X") feature.

	EOL = "%0A"		# "end of line" string (encoded)
	split ("1 2 3 4 5 6 7 8 9 A B C D E F", hextab, " ")
	hextab [0] = 0
	for ( i=1; i<=255; ++i ) ord [ sprintf ("%c", i) "" ] = i + 0
	if ("'"$EncodeEOL"'" == "yes") EncodeEOL = 1; else EncodeEOL = 0
    }
    {
	encoded = ""
	for ( i=1; i<=length ($0); ++i ) {
	    c = substr ($0, i, 1)
	    if ( c ~ /[a-zA-Z0-9.-]/ ) {
		encoded = encoded c		# safe character
	    } else if ( c == " " ) {
		encoded = encoded "+"	# special handling
	    } else {
		# unsafe character, encode it as a two-digit hex-number
		lo = ord [c] % 16
		hi = int (ord [c] / 16);
		encoded = encoded "%" hextab [hi] hextab [lo]
	    }
	}
	if ( EncodeEOL ) {
	    printf ("%s", encoded EOL)
	} else {
	    print encoded
	}
    }
    END {
    	#if ( EncodeEOL ) print ""
    }
' "$@"

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top