I’m trying to predict personality types from the Myers-Briggs test.
I created my own csv_file with 16 rows and 2 columns, which looks like this:
Personality Type | Description 1 INTJ | This personality type is(...) 2 ENTP | (...) 16 (...) | (...)
What I already try?
Preprocessing: I tried to preprocess the description column getting rid of stopwords and applying tokenization:
filter_char = '“”,.()\"/:;""%?¿!¡´\u200b\n\r' # \u200b unicode unrecognised a space new_stopwords = ['person', 'personality', 'type', 'period', 'info'] stop_words = stopwords.words('english') stop_words.extend(new_stopwords) stop_words = set(stop_words)
tokenizer = Tokenizer(num_words = 100, filters = filter_char, lower = True, split = ' ') tokenizer.fit_on_texts(x_train_data_array) # Indexing each word from the tokenized x_train_data_array word_index = tokenizer.word_index # Reversing word index value:keys reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) # Cleaning word_index getting rid of stopwords meaningful_words = np.array([i for i in word_index if not i in stop_words])
I’m still unsure how to correctly prepare custom datasets in text classification, specialy in this kind of problem.
If there’s anything what I can do to improve this question, let me know, I will be glad to fix it.