Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regression Model for categorical data

Status
Not open for further replies.

smile4life

Programmer
Apr 15, 2012
1
CA
Hi;

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.

Now what is required from me is to apply regression model and any other machine learning algorithm on this data.

I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.

- I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.

Thank you for help in advance
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top