Anime Hybrid Recommendation System

Content + collaborative filtering on MyAnimeList (2023) data with TF-IDF cosine similarity and Surprise SVD/NMF/KNN—tuned via cross-validation and grid search.

Overview.

A two-part Anime recommendation system built on MyAnimeList (MAL) data.

  • Part 1 assembles a strong content-based engine using genres, studios, type, source, episodes, rating class, and synopsis text (TF-IDF + cosine similarity).
  • Part 2 adds collaborative filtering with Surprise models (SVD, NMF, KNNBasic), evaluates via cross-validation (RMSE/MAE), performs GridSearchCV hyperparameter tuning, and persists the best model.

Key features

  • Content similarity: Build a feature “soup” from genres, studios, type, source, episodes, rating (e.g., PG-13), and synopsis → vectorize with TF-IDF → rank with cosine similarity.
  • NLP cleanup: Tokenization/lemmatization using spaCy for cleaner synopsis signals.
  • Collaborative filtering: Train/evaluate Surprise SVD, NMF, and KNNBasic on user–anime ratings.
  • Model selection & tuning: Compare RMSE/MAE across models via cross_validate; refine SVD with GridSearchCV and save the best estimator.
  • Hybrid scoring: Blend content similarity with CF predictions; fall back to content-only for cold-start users.
  • Explainability: Surface partial scores (similarity vs. CF) so users see why a title is recommended.

Tech stack

Python, Pandas, NumPy, scikit-learn (TF-IDF, preprocessing, cosine), spaCy, scikit-surprise (SVD/NMF/KNNBasic, Dataset/Reader, CV & GridSearch), Matplotlib

Architecture (simplified)

  1. Ingest & clean MAL data; normalize key categorical fields and prepare the ratings matrix.
  2. Content pipeline: build the “soup” → TF-IDF → precompute cosine similarity for fast nearest-neighbour lookup.
  3. CF pipeline: load ratings → train SVD/NMF/KNNBasiccross-validategrid search SVD → persist best model.
  4. Hybrid rank: for a seed title or known user, combine content score + CF prediction; return Top-N with brief why-this cues.
  5. Cold start: if no user history, rely on content similarity (optionally mix in MAL mean score as a popularity prior).

Datasets

  • Anime metadata (2023): fields like anime_id, name, genres, studios, type, source, episodes, rating, score, **synopsis`.
    Sources: MyAnimeList · Kaggle (Anime datasets)
  • User ratings: users-score-2023.csv (MAL user–anime scores) used to train and evaluate the CF models.

Project parts