Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HELP: Correlation in SAS (Beginner)

Status
Not open for further replies.

SASHelp

Technical User
Feb 15, 2007
7
US
Hello. I don't have much experience with SAS. I just downloaded a series of stock data from CRSPS database. The data contains daily stock prices from 1985-1989 from the softdrink industry. I need to run a correlation on ALL of the pairs of stocks' prices and see which have a correlation of .9 or greater. I was told I need to "sort" and "transpose" the data in SAS.

Please help.
 
we need to know how your current data is structured.

another way (as opposed to transposing) is to create a cross product of stock quotes by day then use 'PROC CORR' with a 'BY' statement to produce the correlations e.g.

i assumed your data looks something like:
[tt]
stock date quote
AAA 1JAN2007 1.20
BBB 1JAN2007 1.23
CCC 1JAN2007 2.21
...etc
[/tt]

i've outputted the correlations to a table so that its easy to filter for correlations > 0.9

Code:
* sample data;
data stock_quote ;
  format stock $3. date date9. quote 8.2;
  informat date date9.;
  input stock $ date quote;
datalines;
AAA 1JAN2007 1.20
BBB 1JAN2007 1.23
CCC 1JAN2007 2.21
AAA 2JAN2007 1.22
BBB 2JAN2007 1.24
CCC 2JAN2007 2.20
AAA 3JAN2007 1.29
BBB 3JAN2007 1.28
CCC 3JAN2007 2.18
AAA 4JAN2007 1.31
BBB 4JAN2007 1.27
CCC 4JAN2007 2.16
;
run;

* create cross product of stocks by date;
proc sql;
  create table quotes as
  select a.date, a.stock as stock1, b.stock as stock2,  a.quote as quote1, b.quote as quote2 
  from stock_quote a, stock_quote b
  where a.date = b.date
    and a.stock > b.stock
  order by a.stock, b.stock;
quit;

* compute pearson's correlation;
proc corr data = quotes outp=correlations noprint;
  by stock1 stock2;
  var quote1 quote2;
quit;

* display correlations > 0.9;
proc sql;
  select stock1, stock2, quote1 as correlation
  from correlations
  where _type_='CORR' 
    and _name_='quote2'
    and quote1 > 0.9; /* select only correlations > 0.9 */
quit;

 
Hello, my data is like this

PERMNO PERMCO DATE TICKER COMNAM PRC
11308 20468 1985010 KO COCA COLA CO 61.875
11308 20468 19850103 KO COCA COLA CO 62.5
11308 20468 19850104 KO COCA COLA CO 62.375
11308 20468 19850107 KO COCA COLA CO 62.75
11308 20468 19850108 KO COCA COLA CO 62.875
11308 20468 19850109 KO COCA COLA CO 63.75
...until 1989
Followed by the next company's data listed until 1989...and so on and so forth...

Would that work with the way you explained above? Sorry, I have very little SAS experience. I greatly appreciate all your help!
 
yes, it should work on your data - you just need to change the variable names in the code i gave you.

in my code 'stock' is the variable that uniquely identifies a company - so that is 'COMNAM' for you.

I've also assumed that one record uniquely identifies one stock quote for a company on one date.

i've also assumed the name of your dataset is called "stock_quote" - so you need to change any references of "stock quote" in my code to whatever your dataset is called.


so, change the first 'proc sql' code fragment to:
Code:
* create cross product of stocks by date;
proc sql;
  create table quotes as
  select a.date, a.comnam as stock1, b.comnam as stock2,  a.prc as quote1, b.prc as quote2 
  /*WARNING: change dataset name below */
  from [red]stock_quote[/red] a, [red]stock_quote[/red] b 
  where a.date = b.date
    and a.comnam > b.comnam
  order by a.comnam, b.comnam;
quit;

the rest of the code will run ok.

if the stock quotes are freely available then you could provide a link and i'll download it and check the code.

oh - i accidently put >0.9 instead of >=0.9 in the last 'proc sql'

cheers,
dan.

 
DanJR,

Thank you so much for your help! Unfortunately, the data is from WRDS database which requires private membership. So assuming I change the write variables you mentioned above, I just copy and paste all of that code into the SAS Editor window?
 
libname datasets 'd:\research\sas_datasets';

data all;
set datasets.Soft_Drink;
prc=abs(prc);
run;

proc sort data=all;
by date permno;
run;


proc transpose data=all out=tran_all prefix=ret;
by date;
id permno;
var ret;
run;
** check this print output to make sure that your data have been
properly transposed where each column is the return on a particular stock;
proc print data=tran_all;
run;
** the COLON is used as a wildcard. This will produce all pairwise correlations;
proc corr data=tran_all outp=correlations;
var ret:;
run;


^ DanJR, that is my code above. What do I need to enter to just look at the correlations of >= 0.9
 
i rearly use 'PROC TRANSPOSE' but did your first proc transpose work? i think you need to change it to:
Code:
proc transpose data=all out=tran_all prefix=ret;
by date;
id permno;
/* i don't think the next line is neccessary*/
[red]/*var ret;*/[/red]
run;

then change your next proc transpose to (suppress output):
Code:
proc corr data=tran_all outp=correlations [red]noprint[/red];
var ret:;
run;
i'm guessing this will give you a matrix of correlations with the permno prefixed with 'ret'.

so, just transpose it back and filter:
Code:
proc transpose data=correlations(where =(_type_='CORR') rename=(_name_=name)) out=corr;
    var ret:;
    by _type_ name;
quit;

data corr90;
  set corr;
  where name > _name_ 
    and [blue]col1 >= .9[/blue];
run;

alternative, you could have used my first approach :) (but the proc transpose will be faster)

cheers,
dan.


 
DanJr,

ONCE AGAIN - thank you for all your help. My next "problem" is the stock data I'm downloading from the CRSP database is "too large" for SAS to handle. How can I get SAS to randomly pick 50 stocks from the hundreds-thousands of stocks in my data...?

Basically what I'm trying to say is, since the data is outputted in alphabetical order... can you have SAS just rearrange it 'randomly'? I will then just work with the top 50 so you don't have to worry about isolating it.

I really appreciate all your help.
 
1)how many records? how many different companies?

2)can you show us the code that you are using and tell us which bit of the code is problematic ie "too slow"?

3) are you getting any errors, if so can you show us the log.

there are some things you can do to improve performance. e.g. using a different approach, using indexes, increasing buffer size to reduce IO, compressing...

cheers,
dan.
 
Dan,

I'm collecting data from CRSP database of stocks from 1985-1986 of daily stocks throughout the entire database. So as you can imagine, it is a HUGE file ~ 200 MB. I usually get an error from says saying OUT OF VIRTUAL MEMORY when trying to run the correlation of all the stocks. I am doing a project on "pair stocks," so I need to find stocks with a correlation of .9 or greater. If I can manipulate the arrangement of stocks (as opposed to alphabetical) then I can choose the first 50 "randomly assorted" stocks and run a correlation on them. To give you an idea, there are 1,944,898 rows of data in this output. Since it's daily stocks, it gives the stock name and its daily return for every single day throughout the entire year. Then moves onto the next stock...

I'm still using the correlation code I posted above which works... then once I can pick 50 random stocks, I will run your code to weed out the ones that have correlations of .9 or greater.

As always, thank you very much.
 
ok, so the problem is memory not IO. i had a look at proc corr using "options fullstimer" & saw that its memory requirement seems quadratic to the number of varibles.

in order to select a unique list of companyies:
Code:
proc sql;
  create table company as
  select distinct permno from mydataset;
quit;
in order to randomly select 50 of them:
Code:
proc surveyselect data=company
      method=srs n=50
      out=company_sample;
run;

another way could be to iterate over each stock and compare it to all others (create a sas macro) using the "WITH" statement in the PROC CORR. These will solve the memory problem, but it will take a long time...perhaps someone has a better solution??

cheers,
dan





 
Dan,

I have a new plan. I'd like to isolate the entire 5 years of data by SIC code (4 digit number). Now, instead of running a correlation on ALL of the stocks, I'd like to isolate the entire data set by specific industry. Since the SIC code for TOBACCO PRODUCTS is 21, I tried to formulate my code as follows:

libname datasets 'd:\research\sas_datasets';

data all;
set datasets.whole8589;
prc=abs(prc);
run;

proc means data=all;
where SIC>=2100 and SIC<2200;
run;

proc sort data=all;
by date permno;
run;

The ERROR I receive is "ERROR: Variable SIC is not on file WORK.ALL."

Any help is greatly appreciated!
 
The error message is pretty self explanatory I think. It's saying that the dataset ALL does not contain a variable named SIC. You need to check that. You should be able to browse the dataset through your SAS session, or add in a proc contents step to list the variables in the dataset.
Also, as a suggestion, I'd avoid using the libname "datasets" as there is a procedure using that name, it might cause problems further down the line...

Chris
Business Analyst, Code Monkey, Data Wrangler.
SAS Guru.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top