Text-Classification

In this project, wine reviews have been used to determine the type of wine training on imbalanced an dataset using classification algorithms like SVM, Naive Bayes and Random Forest Classifier. Neural Network (CNN, RNN and LSTM) and LLM models (DistilBERT and RoBERTa) were also used followed by error analysis using SHAP.

Overview:

We have been provided with a wine reviews dataset with two columns: “review_text” and “wine_variant” and the goal is to create a wine recommendation system using test classification.

Data:

Target variable – ‘wine_variant’
Categories – 8 Types - 'Pinot Noir', 'Sauvignon Blanc', 'Cabernet Sauvignon', 'Chardonnay', 'Syrah', 'Riesling', 'Merlot', 'Zinfandel'
Train data – 10000 observations were split into test set of sample size 25% (2500). Stratified sampling used for appropriate representation of above-mentioned classes. An additional validation data with 5000 observations has been used.
Distribution – In percentage

Models and Algorithms

Embedding:

TF-IDF vectorization
Latent Semantic Analysis
Sentence Transformer (all-mpnet-base-v2)
torchtext.vocab

Alogorithms:

Supverised ML

Linear and Non-linear SVM
SDG Classifier
Multinomial Naive Bayes
Random Forest Classifier

Neural Network

CNN
LSTM

LLM

DistilBERT
RoBERTa

Conclusion

From the above results we have the four best classifier along list in the order of descending macro average f1 score on validation set:

RoBERTa (0.80)
DistilBERT (0.79)
TFIDF Vectorization + Linear SVC (with hyperparameter tuning) (0.78)
CNN (0.77) We can conclude two things from the above analysis:
Given the size of the training set, the transfer learning algorithms(RoBERTa and DistilBERT) are likely to provide much better results as seen in the table above.
Given the class imbalance in the dataset, the best way to group the categories is on the basis of domain knowledge as stated above. Grouping on the basis of taste and flavour is more appropriate when building a wine recommendation system rather looking at the distribution of target variables. This has led to a significant improvement in results improving classification accuracy from low 70s to almost 80%.
Although our model has shown a significant improvement in results from the baseline SVC model, the macro f1 score does not go above 80% even after working with multiple models. This is a clear indication that we need more training data to improve our classification report.

Error Analysis

We have used the RoBERTa model for performing error analysis using SHAP. We have taken a sample of 30 mis-predicted observations from the provided test set of sample size 500 for this analysis. We will look into a few samples for our report, for a model detailed analysis please refer to the code.

Example 1: “Medium to Full-bodied Reds” classified as “Bold Reds”

While words like “light” and “oak” incline the results towards “Medium to Full-bodied Reds”, the final outcome seems to influenced by the use of “powerful”, “refrain” and “berries”.

Example 2: “Medium to Full-bodied Reds” classified as “Bold Reds”

In this example we see that the use of words like, “TONS” and “more fruit” has pushed the classifier to predict “Bold Red”

Example 3: “Bold Reds” classified as “Medium to Full-bodied Reds”

In the given scenario, the word “medium” clearly influences the result

“Light-bodied, Crisp Whites” classified as “Full-bodied Whites”

The use of the word “champagne” which is a “Full-bodied white” has stirred the prediction to be as such. From the above analysis we see errors that are primarily domain knowledge related. However, in the reviews we also have text that are redundant and do not contribute to the classification with respect to taste of quality of wine as seen below. Hence, a recommendation from this would to carefully curate samples that are used to train the wine-recommendation model in order to obtain more accurate results.

For more details please refer to Project Report

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Project Report.pdf		Project Report.pdf
README.md		README.md
Text-Classification (LLM).html		Text-Classification (LLM).html
Text-Classification (Neural Networks).html		Text-Classification (Neural Networks).html
Text-Classification (SHAP).html		Text-Classification (SHAP).html
Text-Classification(SVM-SDG-NaiveBayes-RF).html		Text-Classification(SVM-SDG-NaiveBayes-RF).html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text-Classification

Overview:

Data:

Models and Algorithms

Embedding:

Alogorithms:

Supverised ML

Neural Network

LLM

Conclusion

Error Analysis

Example 1: “Medium to Full-bodied Reds” classified as “Bold Reds”

Example 2: “Medium to Full-bodied Reds” classified as “Bold Reds”

Example 3: “Bold Reds” classified as “Medium to Full-bodied Reds”

“Light-bodied, Crisp Whites” classified as “Full-bodied Whites”

About

Uh oh!

Releases

Packages

Languages

anurima-saha/Text-Classification

Folders and files

Latest commit

History

Repository files navigation

Text-Classification

Overview:

Data:

Models and Algorithms

Embedding:

Alogorithms:

Supverised ML

Neural Network

LLM

Conclusion

Error Analysis

Example 1: “Medium to Full-bodied Reds” classified as “Bold Reds”

Example 2: “Medium to Full-bodied Reds” classified as “Bold Reds”

Example 3: “Bold Reds” classified as “Medium to Full-bodied Reds”

“Light-bodied, Crisp Whites” classified as “Full-bodied Whites”

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages