Hi I’m learning about text classification and if I have a dataset like this one:
My question: If I split training and testing set from the dataset, and do the feature extraction separately (I’m working with the word embedding).
Is it correct to pass the features from the training and testing dataset (names: feature_array_trainingset and feature_array_testingset) to the pipeline directly this way:
from sklearn.metrics import classification_report from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline from sklearn import svm pipeline = Pipeline([('classifier',svm.SVC())]) pipeline.fit(feature_array_trainingset,train['Category']) predictions = pipeline.predict(feature_array_testingset) print (classification_report(predictions,test['Category']))
It returns the classification result, but I’m not quite sure whether I’m doing the correct process or not.