Movies Hybrid Recommendation System
Content + CF hybrid with IMDb-weighted rating and a recency boost, tested on larger MovieLens metadata.
Overview.
A two-part project that builds a movie recommender from the ground up.
- Part 1: crafts a solid content-based system on TMDB (genres, keywords, cast/crew, overview) using TF-IDF/CountVectorizer and cosine similarity.
- Part 2 upgrades to an improved hybrid: blends Surprise SVD (collaborative filtering) with IMDb-style weighted rating and a light recency boost, then validates on the larger MovieLens metadata.
Key features
- Content similarity: create a unified “movie soup” (genres, keywords, cast, crew, overview) → TF-IDF / CountVectorizer → cosine similarity for fast top-N retrieval.
- Personalization (CF): user-aware scoring with SVD (from scikit-surprise) using historical ratings.
- Popularity prior: IMDb-style weighted rating (WR) to avoid small-vote bias (uses corpus mean
Cand 65th-percentilem). - Recency boost: scale release year to
[0,1]and blend with a tunablerecency_weight(e.g.,0.2) to gently favor newer titles. - Cold-start handling: for brand-new users, fall back to content + popularity; for known users, blend in SVD predictions (weight increases with user activity).
- Explainability: expose partial scores (similarity, popularity, recency, SVD) to show “why this recommendation”.
Tech stack
Python, Pandas, NumPy, scikit-learn (TF-IDF, cosine similarity), scikit-surprise (SVD), Matplotlib
Architecture (simplified)
- Preprocess content (TMDB): clean text; build a soup from genres/keywords/cast/crew/overview; vectorize with TF-IDF / CountVectorizer.
- Compute similarity: precompute cosine similarity matrix for fast lookups.
- Get candidates: for a seed title, fetch top-K similar movies (exclude itself).
- Score candidates with:
- IMDb weighted rating (mean
C, vote thresholdmat ~65th percentile). - Similarity score (from cosine).
- Recency score (scaled release year).
- SVD prediction (if user is known; else skip).
- IMDb weighted rating (mean
- Hybrid blend:
final_score = α·similarity + β·weighted_rating + γ·recency + δ·svd_pred
(typical starting point:α=0.55–0.65,β≈0.20,γ≈0.10–0.20,δgrows with user history). - Return Top-N with brief “why” signals (e.g., “similar to Inception; strong votes; recent”).
Datasets
- TMDB 5000 Movies & Credits (Part 1): titles, overviews, genres, keywords, cast/crew.
Sources: TMDB · Kaggle mirrors (e.g., tmdb_5000_movies.csv, tmdb_5000_credits.csv). - MovieLens (latest / 25M metadata) (Part 2): ratings + rich metadata; uses
links.csvto map TMDb IDs ↔ MovieLens IDs.
Source: GroupLens MovieLens.
Project parts
- Part 1 — Content-Based on TMDB:
Build the “soup,” compute cosine similarity, and rank similar titles.
Notebook: TMDB_RecommenderSystem.ipynb
Docs: TF-IDF · Cosine similarity · TMDB - Part 2 — Improved Hybrid on MovieLens:
Add IMDb-weighted rating, recency boost, and Surprise SVD; handle cold-start and blend scores into a final rank; validate on a larger corpus.
Notebook: Testing_The_Improved_Hybrid_Recoomendation_System.ipynb
Docs: scikit-surprise SVD · MovieLens dataset
Links
- Datasets: TMDB · MovieLens
- Libraries: scikit-learn · scikit-surprise
- Notebooks: Part 1 · Part 2