Home » Python » python – How to prepare custom dataset for text classification in Tensorflow 2.x?-Exceptionshub

python – How to prepare custom dataset for text classification in Tensorflow 2.x?-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I’m trying to predict personality types from the Myers-Briggs test.

I created my own csv_file with 16 rows and 2 columns, which looks like this:

   Personality Type | Description
1       INTJ        | This personality type is(...)
2       ENTP        | (...)
16      (...)       | (...)

What I already try?
Preprocessing: I tried to preprocess the description column getting rid of stopwords and applying tokenization:

filter_char = '“”,.()\"/:;""%?¿!¡´\u200b\n\r' # \u200b unicode unrecognised a space
new_stopwords = ['person', 'personality', 'type', 'period', 'info']

stop_words = stopwords.words('english') 
stop_words.extend(new_stopwords)

stop_words = set(stop_words)
tokenizer = Tokenizer(num_words = 100,
                      filters = filter_char,
                      lower = True,
                      split = ' ')

tokenizer.fit_on_texts(x_train_data_array)

# Indexing each word from the tokenized x_train_data_array
word_index = tokenizer.word_index

# Reversing word index value:keys
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Cleaning word_index getting rid of stopwords
meaningful_words = np.array([i for i in word_index if not i in stop_words])

I’m still unsure how to correctly prepare custom datasets in text classification, specialy in this kind of problem.

If there’s anything what I can do to improve this question, let me know, I will be glad to fix it.

How to&Answers: