I have a dataset and i want to decide on which ML algorithm to apply to my given problem.
Customers are to fill out an assessment questionnaire of 50 questions. Examples of the questions are, what is your job, previous job history, how much do you earn, have you been rejected for a loan etc, and the end goal is to decide whether they should be rejected or not.
I have circa 500 entries for my algorithm to learn from and have pre-processed my dataset and converted the inputs into a numpy array and wondering what would be the best algorithm to use? Should i use a classification algorithm or a neural network in tensorflow and if the latter, what would be the layers I should use?
How about beginning with
random forest? – So plain “old” ML?
The advantage would be that you could visualize the decision tree of the model once trained.
If using a NN in tensorflow (or even easier: keras with tensorflow backend), you could go with a MLP (multi layer perceptron), since the questions answers have fixed position in the input. You don’t need many layers.
Important is that you normalize your input data columnwise, so that the input numbers are not much bigger/smaller than +1/-1, respectively. Introductory books often miss this point, though important.
Since your target labeling is “accept” or “reject”, binary classifier will do it (also in the machine learning approach). (You use 0 and 1 as labels).
For NN, you don’t need for such kind of classification that many layers or neurons. Try the smallest network first. let’s say 10 neurons in first layer, then 7 neurons in the next layer (probably even less) and then 1 output neuron for the binary decision.
With Keras this would be:
from keras.models import Sequential from keras.layers import Dense def create_mlp(n_input = 500): # number of columns of input data 500 here model = Sequential() model.add(Dense(10, input_dim=n_input, kernel_initializer='normal', activation='relu')) # init = kernel_initializer model.add(Dense(7, kernel_initializer='normal', activation='relu')) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) return model model = create_mlp(500) # this will generate the correct NN compiled.
Your data frame (or Numpy input array must have as rows the samples,
the columns are for each answer for a question 1 column.
the answers you have to encode in a numeric form. Numbers should be small – the best between -1 and 1. NNs don’t like big numbers. Thus column-wise normalization can help.)
That’s it. I learned all this stuff last year. Good luck for learning. It will be tons of fun!