Movies Hybrid Recommendation System

A two-part project that builds a movie recommender from the ground up.

Part 1: crafts a solid content-based system on TMDB (genres, keywords, cast/crew, overview) using TF-IDF/CountVectorizer and cosine similarity.
Part 2 upgrades to an improved hybrid: blends Surprise SVD (collaborative filtering) with IMDb-style weighted rating and a light recency boost, then validates on the larger MovieLens metadata.

Content similarity: create a unified “movie soup” (genres, keywords, cast, crew, overview) → TF-IDF / CountVectorizer → cosine similarity for fast top-N retrieval.
Personalization (CF): user-aware scoring with SVD (from scikit-surprise) using historical ratings.
Popularity prior: IMDb-style weighted rating (WR) to avoid small-vote bias (uses corpus mean C and 65th-percentile m).
Recency boost: scale release year to [0,1] and blend with a tunable recency_weight (e.g., 0.2) to gently favor newer titles.
Cold-start handling: for brand-new users, fall back to content + popularity; for known users, blend in SVD predictions (weight increases with user activity).
Explainability: expose partial scores (similarity, popularity, recency, SVD) to show “why this recommendation”.

Python, Pandas, NumPy, scikit-learn (TF-IDF, cosine similarity), scikit-surprise (SVD), Matplotlib

Preprocess content (TMDB): clean text; build a soup from genres/keywords/cast/crew/overview; vectorize with TF-IDF / CountVectorizer.
Compute similarity: precompute cosine similarity matrix for fast lookups.
Get candidates: for a seed title, fetch top-K similar movies (exclude itself).
Score candidates with:
- IMDb weighted rating (mean C, vote threshold m at ~65th percentile).
- Similarity score (from cosine).
- Recency score (scaled release year).
- SVD prediction (if user is known; else skip).
Hybrid blend: final_score = α·similarity + β·weighted_rating + γ·recency + δ·svd_pred
(typical starting point: α=0.55–0.65, β≈0.20, γ≈0.10–0.20, δ grows with user history).
Return Top-N with brief “why” signals (e.g., “similar to Inception; strong votes; recent”).

TMDB 5000 Movies & Credits (Part 1): titles, overviews, genres, keywords, cast/crew.
Sources: TMDB · Kaggle mirrors (e.g., tmdb_5000_movies.csv, tmdb_5000_credits.csv).
MovieLens (latest / 25M metadata) (Part 2): ratings + rich metadata; uses links.csv to map TMDb IDs ↔ MovieLens IDs.
Source: GroupLens MovieLens.

Part 1 — Content-Based on TMDB:
Build the “soup,” compute cosine similarity, and rank similar titles.
Notebook: TMDB_RecommenderSystem.ipynb
Docs: TF-IDF · Cosine similarity · TMDB
Part 2 — Improved Hybrid on MovieLens:
Add IMDb-weighted rating, recency boost, and Surprise SVD; handle cold-start and blend scores into a final rank; validate on a larger corpus.
Notebook: Testing_The_Improved_Hybrid_Recoomendation_System.ipynb
Docs: scikit-surprise SVD · MovieLens dataset