scramble the order of records in a file 4

mpramods · May 17, 2007

This is in continuation to the my previous post called 'create file with possible combinations of numbers from 3 files'

I have a file with 10,000,000 numbers. Sample data(18 records) from this file1 is as follows:
100333505
100333606
100333707
100444505
100444606
100444707
100555505
100555606
100555707
200333505
200333606
200333707
200444505
200444606
200444707
200555505
200555606
200555707

I would like to scramble the order of records in file1 and put it into a new file. The scrambled_file would look something as follows:
200444505
100333505
100333606
200555505
100333707
200555707
100444505
200333606
100444606
100555606
100555707
200333505
200333707
200444606
100444707
200444707
200555606
100555505

Please note that the records remain same in the scrambled_file but the order is changed (The order is random). How can I do this?

Thanks,
Pramod

spookie · May 18, 2007

Just curious, why do you want to do that?

--------------------------------------------------------------------------
I never set a goal because u never know whats going to happen tommorow.

columb · May 18, 2007

What you require is far from trivial and, I would guess, near imposible in shell scripting. My first thought was to use a language such as perl, read in the records, suffle them and print them out again. However 10,000,000 records is an enormus amount and the memory overheads would be high - the data file is 100,000,000 bytes before any overhead used reading it into memory.

The next problem you hit is how you define random. I work for a company that runs a national lottery with a large number of entries. We need to provide, on a monthly basis, something along the lines of what you require and it takes a significant amount of computing to do it - and the random number generator is a separate, distinct piece of hardware custom designed and audited.

Ceci n'est pas une signature
Columb Healy

feherke · May 18, 2007

Hi

Until now [tt]sort[/tt] sorted huge amount of data to me. I assume it can deal with those 10 000 000 lines too.

Code:

sort -R /input/file > /scrambled/file

Tested with GNU coreutils [tt]sort[/tt]. And of course, fewer data.

Feherke.

http://rootshell.be/~feherke/

columb · May 18, 2007

feherke

My RedHat ES4 sort doesn't support the -R flag, and I thought it would be GNU coreutils. I've checked the GNU documentation and they don't mention it. Where did your version come from?

Ceci n'est pas une signature
Columb Healy

feherke · May 18, 2007

Hi

Code:

[blue]master #[/blue] seq 100000000 110000000 > hugefile.txt

[blue]master #[/blue] wc -l hugefile.txt
10000001 hugefile.txt

[blue]master #[/blue] ls -lh hugefile.txt
-rw-r--r-- 1 master None 96M May 18 14:01 hugefile.txt

[blue]master #[/blue] time sort -R hugefile.txt > scrambledfile.txt

real    4m11.061s
user    3m56.765s
sys     0m7.093s

[blue]master #[/blue] head -$RANDOM scrambledfile.txt | tail -10
108197787
101050327
108051105
106260414
104965658
108832288
104934445
105478192
104684364
102577666

[blue]master #[/blue] feherke@f:~>sort --version | head -2
sort (GNU coreutils) 6.7
Copyright (C) 2006 Free Software Foundation, Inc.

[blue]master #[/blue] sort --help | grep -- -R
  -R, --random-sort           sort by random hash of keys

[blue]master #[/blue] uname -a
CYGWIN_NT-5.1 master 1.5.24(0.156/4/2) 2007-01-31 10:57 i686 Cygwin

Feherke.

http://rootshell.be/~feherke/

feherke · May 18, 2007

Hi

But I have to admit it, that [tt]sort[/tt] did not performed exactly as I expected. It used approximatively 335 Mb memory. But I still hope it is able to deal with less memory in case it is not available.

Feherke.

http://rootshell.be/~feherke/

feherke · May 18, 2007

Hi

Columb, indeed, that document does not mention the -R parameter, but if you download the current [link ftp://ftp.gnu.org/gnu/coreutils/coreutils-6.9.tar.gz]coreutils[/url] package you can see it in the coreutils-6.9.tar/coreutils-6.9/man/sort.1 file.

Yet another example that my CygWin is ages newer then all the regular Linux distributions I used to work on...

Feherke.

http://rootshell.be/~feherke/

bigoldbulldog · May 18, 2007

I tried out the following, but it certainly fails on both points made by columb (memory and use of pseudo-rand).

Code:

perl -e '@a = <>; print splice @a, rand @a, 1 while @a' /input/file > /scrambled/file

I'd run this by the C or possibly the perl people. You'll probably have to have run some heafty IO and a crack rand lib.

Cheers,
ND [smile]

feherke · May 18, 2007

Hi

If it fails because the lack of memory, then I do not think just choosing another programming language could solve it.

Better change the approach :
[ol]
[li]split the huge file in more smaller chunks[/li]
[li]sort each chunk separately[/li]
[li]copy the resulted chunks together by passing through them line by line and comparing only one line from each one time[/li]
[/ol]

Feherke.

http://rootshell.be/~feherke/

PHV · May 18, 2007

And what about this (legacy unix ;-)):
awk 'BEGIN{srand()}{print rand(),$1}' /path/to/input | sort -n | awk '{print $2}' > output

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

p5wizard · May 18, 2007

PHV beat me to it...

But combining the two tasks (assuming files named f1, f2 and f3):

Code:

awk 'BEGIN{
srand()
i=k=j==0
while ((getline f1 <"f1") > 0) f1a[i++]=f1
while ((getline f2 <"f2") > 0) f2a[j++]=f2
while ((getline f3 <"f3") > 0) f3a[k++]=f3
for (a in f3a)
 for (b in f2a)
  for (c in f1a)
   printf "%7d %s-%s-%s\n", rand()*10000000, f1a[a], f2a[b], f3a[c]
}'|sort -n|cut -d" " -f2

HTH,

p5wizard

mpramods · May 21, 2007

spookie,
To answer your question, we need to change some sensitive data in a file. So I am creating a scrambled lookup file, the data from which will be used to replace the real data.

feherke,
I do not have GNU coreutils on the system I am working on. So the 'sort -R' does not work for me. I was going to try your other approach of splitting into number of files. I may have to use this approach in the future if we run into issues.

PHV and p5wizard,
Your solutions worked perfectly. Thanks for the help.

Cheers,
Pramod

SamBones · May 22, 2007

Here's a quick-and-dirty I wrote in Korn shell a while back. It bogs down a bit if the file gets really massive.

Code:

typeset -RZ6 KEY

INFILE=original.dat
OUTFILE=unsorted.dat

while read RECORD
do
        KEY=${RANDOM}
        print "${KEY}:${RECORD}"
done < ${INFILE} | sort -n | sed 's/^[0-9]*://1' > ${OUTFILE}

It also assumes there's no ":" in the data.

feherke · May 22, 2007

Hi

Hmm... I wondered if I can tease abit Sam by suggesting [tt]cut[/tt] for speed optimization. My theory was that [tt]cut[/tt] should beat [tt]sed[/tt] because it does not use a slow regular expression.

Code:

[blue]master #[/blue] time sed 's/^[0-9]*://1' seq2.txt > /dev/null

real    0m8.629s
user    0m5.093s
sys     0m0.937s

[blue]master #[/blue] time cut -d: -f2 seq2.txt > /dev/null

real    0m9.318s
user    0m6.905s
sys     0m0.421s

[blue]master #[/blue] time awk -F: '{print$2}' seq2.txt > /dev/null

real    0m6.179s
user    0m4.515s
sys     0m0.624s

Sorry for posting this outputs, but I find the result interesting.

Feherke.

http://rootshell.be/~feherke/

SamBones · May 22, 2007

No problem. Like I said, it was a quick-and-dirty. I'm kind of surprised that the [tt]cut[/tt] wasn't the fastest.

Since the random key is always 6 characters, can you try it with the same data file with this?

Code:

time sed 's/^......://1' seq2.txt > /dev/null

Also how about...

Code:

time cut -c1-7 seq2.txt > /dev/null

That simplifies or eliminates the pattern match.

Actually, that "1" at the end of the search and replace in the sed does make it allow for colons (":") in the data. It will still only chop off the prepended key up to the first colon it sees.

feherke · May 22, 2007

Hi

Hmm... A minor problem. I did not used fixed width prefixes... You know, my login shell is [tt]bash[/tt] and that [tt]typeset[/tt] is [tt]ksh[/tt] specific... Anyway, solved now. Another minor problem, see in red.

Code:

[blue]master #[/blue] time sed 's/^......://1' seq2b.txt > /dev/null

real    0m4.641s
user    0m4.061s
sys     0m0.531s

[blue]master #[/blue] time cut -c[red]8-[/red] seq2b.txt > /dev/null

real    0m5.899s
user    0m5.156s
sys     0m0.796s

Anyway, your approach of cutting characters instead of fields seems to be useful. [medal]

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

scramble the order of records in a file 4

mpramods

Technical User

spookie

Programmer

columb

IS-IT--Management

feherke

Programmer

columb

IS-IT--Management

feherke

Programmer

feherke

Programmer

feherke

Programmer

bigoldbulldog

Programmer

feherke

Programmer

PHV

MIS

p5wizard

IS-IT--Management

mpramods

Technical User

SamBones

Programmer

feherke

Programmer

SamBones

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor