Anime Hybrid Recommendation System
Content + collaborative filtering on MyAnimeList (2023) data with TF-IDF cosine similarity and Surprise SVD/NMF/KNN—tuned via cross-validation and grid search.
Overview.
A two-part Anime recommendation system built on MyAnimeList (MAL) data.
- Part 1 assembles a strong content-based engine using genres, studios, type, source, episodes, rating class, and synopsis text (TF-IDF + cosine similarity).
- Part 2 adds collaborative filtering with Surprise models (SVD, NMF, KNNBasic), evaluates via cross-validation (RMSE/MAE), performs GridSearchCV hyperparameter tuning, and persists the best model.
Key features
- Content similarity: Build a feature “soup” from genres, studios, type, source, episodes, rating (e.g., PG-13), and synopsis → vectorize with TF-IDF → rank with cosine similarity.
- NLP cleanup: Tokenization/lemmatization using spaCy for cleaner synopsis signals.
- Collaborative filtering: Train/evaluate Surprise SVD, NMF, and KNNBasic on user–anime ratings.
- Model selection & tuning: Compare RMSE/MAE across models via
cross_validate; refine SVD with GridSearchCV and save the best estimator. - Hybrid scoring: Blend content similarity with CF predictions; fall back to content-only for cold-start users.
- Explainability: Surface partial scores (similarity vs. CF) so users see why a title is recommended.
Tech stack
Python, Pandas, NumPy, scikit-learn (TF-IDF, preprocessing, cosine), spaCy, scikit-surprise (SVD/NMF/KNNBasic, Dataset/Reader, CV & GridSearch), Matplotlib
Architecture (simplified)
- Ingest & clean MAL data; normalize key categorical fields and prepare the ratings matrix.
- Content pipeline: build the “soup” → TF-IDF → precompute cosine similarity for fast nearest-neighbour lookup.
- CF pipeline: load ratings → train SVD/NMF/KNNBasic → cross-validate → grid search SVD → persist best model.
- Hybrid rank: for a seed title or known user, combine content score + CF prediction; return Top-N with brief why-this cues.
- Cold start: if no user history, rely on content similarity (optionally mix in MAL mean score as a popularity prior).
Datasets
- Anime metadata (2023): fields like
anime_id, name, genres, studios, type, source, episodes, rating, score, **synopsis`.
Sources: MyAnimeList · Kaggle (Anime datasets) - User ratings:
users-score-2023.csv(MAL user–anime scores) used to train and evaluate the CF models.
Project parts
- Part 1 — Content-Based Recommender (MAL 2023):
Build the text/metadata “soup,” compute TF-IDF vectors, and use cosine similarity for nearest titles.
Notebook: AnimeRecommender101.ipynb
Docs: TF-IDF · Cosine similarity · spaCy - Part 2 — Collaborative + Hybrid (Surprise):
Train SVD/NMF/KNNBasic, compare RMSE/MAE, run GridSearchCV for SVD, persist the best, and blend with content scores for final ranking.
Notebook: Collaborative_Anime_Recommendation_System.ipynb
Docs: scikit-surprise · Dataset/Reader · GridSearchCV
Links
- Datasets: MyAnimeList · Kaggle (Anime datasets)
- Libraries: scikit-learn · spaCy · scikit-surprise
- Notebooks: Part 1 · Part 2