There are quite a few ways to present categorical inputs to neural networks (specifically, multilayer perceptrons). To be perfectly clear, categorical variables take on a finite number of distinct values, which are not ordered. An example of a categorical variable is "Country of Manufacture" for automobiles, which assumes values like "USA", "Germany" and "Japan", which do not have a natural order. In contrast, ordinal variables, such as "Size", have some inherent ordering, like: "Small", "Medium" and "Large".
The simplest way to present categorical data to neural networks is using dummy variables. One 0/1 flag is created for each possible value, like this:
This permits categorical data to be input to a neural network (or any mathematical model) and effectively localizes the information about each categorical value. It does, however, expand the number of input variables dramatically. In most situations, it is possible to use only one less flag than the total number of values. When "USAFlag" and "GermanyFlag" are both zero, the value of "JapanFlag" is implied. To reduce the number of dummy variables, classes with small representation can be collected under an "Other" category flag. Although this discards information distinguishing those small categories from one another, it can potentially reduce the number of inputs substantially.
A warning: For those familiar with binary numbering, it may be tempting to reduce the number of inputs by storing them as binary integers. This is not likely to work well. Consider an expanded "Country of Origin" example, which will add two values: "Britain" and "Italy". One might "compress" the representation by using only 2 bits like this:
In this representation, note that different pairs of values have varying numbers of bits in common. For instance, "Japan" differs from "Britain" in all three bits, while differing from "Germany" by only one. This implies an artificial similarity between "Japan" and "Germany". Keep in mind that there is no implicit ordering of these value. Stick to dummy variables and avoid binary representations.
The literature also records some success representing class inputs by the sample probability of the target class in classification problems. For instance, if the neural network is intended to classify cars as being "fuel efficient" (say, MPG >= 35) or "fuel inefficient" (MPG < 35), one can represent the categorical variable "Country of Origin" by the probability that examples from each value are "fuel efficient", like this:
"Country of Origin" "CountryClassSummary"
"USA" 0.45
"Germany" 0.67
"Japan" 0.83
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.