Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How can I use categorical (class) inputs with my neural network?

Data Mining

How can I use categorical (class) inputs with my neural network?

by  Predictor  Posted    (Edited  )
There are quite a few ways to present categorical inputs to neural networks (specifically, multilayer perceptrons). To be perfectly clear, categorical variables take on a finite number of distinct values, which are not ordered. An example of a categorical variable is "Country of Manufacture" for automobiles, which assumes values like "USA", "Germany" and "Japan", which do not have a natural order. In contrast, ordinal variables, such as "Size", have some inherent ordering, like: "Small", "Medium" and "Large".

The simplest way to present categorical data to neural networks is using dummy variables. One 0/1 flag is created for each possible value, like this:

"Country of Origin" "USAFlag" "GermanyFlag" "JapanFlag"
"USA" 1 0 0
"Germany" 0 1 0
"Japan" 0 0 1

This permits categorical data to be input to a neural network (or any mathematical model) and effectively localizes the information about each categorical value. It does, however, expand the number of input variables dramatically. In most situations, it is possible to use only one less flag than the total number of values. When "USAFlag" and "GermanyFlag" are both zero, the value of "JapanFlag" is implied. To reduce the number of dummy variables, classes with small representation can be collected under an "Other" category flag. Although this discards information distinguishing those small categories from one another, it can potentially reduce the number of inputs substantially.

A warning: For those familiar with binary numbering, it may be tempting to reduce the number of inputs by storing them as binary integers. This is not likely to work well. Consider an expanded "Country of Origin" example, which will add two values: "Britain" and "Italy". One might "compress" the representation by using only 2 bits like this:

"Country of Origin" "FlagA" "FlagB" "Flag C"
"USA" 0 0 1
"Germany" 0 1 0
"Japan" 0 1 1
"Britain" 1 0 0
"Italy" 1 0 1

In this representation, note that different pairs of values have varying numbers of bits in common. For instance, "Japan" differs from "Britain" in all three bits, while differing from "Germany" by only one. This implies an artificial similarity between "Japan" and "Germany". Keep in mind that there is no implicit ordering of these value. Stick to dummy variables and avoid binary representations.

The literature also records some success representing class inputs by the sample probability of the target class in classification problems. For instance, if the neural network is intended to classify cars as being "fuel efficient" (say, MPG >= 35) or "fuel inefficient" (MPG < 35), one can represent the categorical variable "Country of Origin" by the probability that examples from each value are "fuel efficient", like this:

"Country of Origin" "CountryClassSummary"
"USA" 0.45
"Germany" 0.67
"Japan" 0.83




Register to rate this FAQ  : BAD 1 2 3 4 5 6 7 8 9 10 GOOD
Please Note: 1 is Bad, 10 is Good :-)

Part and Inventory Search

Back
Top