Portfolio Details

Aspect Sentiment BI

A real-time pipeline that ingests Reddit comments, processes them with NLP, stores results in SQLite, and presents insights in BI dashboards.

PythonAirflowSQLiteStreamlitPower BI

NLP & BI

2025

Aspect Sentiment BI

Transforming unstructured Reddit discussions into structured, actionable insights through NLP pipelines and BI dashboards.

The project builds a pipeline that automatically ingests Reddit comments, applies NLP tasks such as aspect extraction, sentiment analysis, and topic modeling, and stores the results in a database. The outputs are then visualized using Streamlit dashboards for public access and Power BI for advanced reporting.

Real-world application: Product managers and analysts can monitor customer feedback at scale, tracking which features (e.g., “battery,” “screen,” “pricing”) drive positive or negative sentiment, enabling data-driven business decisions.

Unstructured data: Reddit comments are noisy, with slang, emojis, and inconsistent formatting.
Aspect extraction: Identifying product features (e.g., “camera” vs. “battery life”) from free-text is non-trivial.
Model scalability: Running topic models and transformer-based sentiment models in real-time is computationally heavy.
Data freshness: Stakeholders required up-to-date insights, not monthly snapshots.
Sharing limitations: Power BI licensing restricted wide sharing outside the org.

Automated ingestion: Airflow DAGs scheduled regular data pulls from Reddit API.
NLP pipeline: KeyBERT for aspect extraction, RoBERTa for sentiment classification, BERTopic for clustering themes.
Lightweight persistence: SQLite used for portability and quick querying during prototyping.
Dashboards: Streamlit for open access and interactivity; Power BI for richer analytics where licensing allowed.
Scalability plan: Architecture designed to move from SQLite to Postgres or cloud DB as data grows.

Key Features

Automated Reddit ingestion
Aspect-based sentiment analysis
Topic modeling with BERTopic
Streamlit + Power BI dashboards

View Project Next Project

Smart Invoice Extractor

OCR + NLP pipeline to parse invoices into structured fields, powered by FastAPI, Docker, and Streamlit.

OCRNLPFastAPIDockerStreamlit

OCR & NLP

2025

Smart Invoice Extractor

Automating financial workflows by converting unstructured invoice scans into structured fields such as invoice number, vendor details, dates, and totals — reducing manual effort and errors in accounting pipelines.

By combining Tesseract OCR with spaCy NLP, this solution extracts structured data from invoices. The backend was built with FastAPI, deployed in Docker containers for portability, and integrated with a Streamlit dashboard for end-user validation and export to CSV/BI tools.

Real-world application: SMEs and finance teams can auto-process vendor invoices, validate tax compliance, and feed clean structured data directly into ERP systems like SAP or QuickBooks.

Low-quality scans: Blurred, rotated, or noisy invoice images affected OCR accuracy.
No standard format: Layouts varied widely by vendor/country, making extraction rules brittle.
Entity ambiguity: Confusion between invoice vs. PO number, or invoice vs. due dates.
Currency/localization: Handling multiple currencies, date formats, and languages.
End-user adoption: Users needed a simple interface to review and correct outputs.

OCR preprocessing: Applied OpenCV-based denoising, thresholding, and skew correction before text extraction.
Hybrid NLP: Regex patterns for dates/numbers + spaCy NER for vendor names and tax fields.
Validation rules: Cross-checking invoice totals vs. line items ensured accuracy.
API-first: FastAPI endpoints for batch uploads, making it extensible for ERP integration.
Scalability: Dockerized containers deployable on local machines or cloud infra.
User dashboard: Streamlit UI allowed accountants to upload invoices and download verified structured data.

Key Features

OCR preprocessing (OpenCV + Tesseract)
NLP + regex hybrid entity extraction
FastAPI REST API with Docker
Streamlit dashboard for validation

View Project Next Project

smartcurvefit (R Package)

Robust nonlinear curve fitting R package with Rcpp backend and automated loss selection.

RRcppC++ggplot2

R Package

2025

smartcurvefit

An R package that enables robust nonlinear curve fitting with multiple loss functions (L1, L2, Huber) and automatic cross-validated loss selection using an Rcpp backend.

Developed as part of a graduate-level Data Science course, smartcurvefit provides robust nonlinear regression functions with S3 objects, summaries, diagnostics, and visualization. Models supported include power-law, exponential, and user-defined forms.

Real-world application: Scientists and analysts can use it to fit growth curves, reliability functions, and dose-response models with resilience to outliers.

Nonlinear regression is sensitive to outliers.
Users often lack clarity on which loss function to choose.
Balancing performance with package usability and documentation.
Integration of Rcpp with R S3 methods while maintaining CRAN standards.

Implemented L1, L2, and Huber loss in C++ for efficiency.
Cross-validation automatically selects the best loss for given data.
Provided S3 methods: `print`, `summary`, `plot`, `predict` for user-friendliness.
Detailed vignettes and examples for reproducibility.
Future-ready: allows extension to user-defined nonlinear models.

Key Features

Rcpp-powered optimization for speed
Cross-validated loss selection (L1, L2, Huber)
Intuitive S3 object interface
Publication-ready plots with ggplot2

View Project Next Project

Heart Attack Prediction with Logistic Regression (Spark MLlib)

Predicting heart attack risk using Big Data and Machine Learning on a Hadoop cluster with Apache Spark MLlib.

Apache Spark Hadoop Python MLlib

Big Data & Healthcare

2025

Heart Attack Prediction with Spark MLlib

Large-scale predictive modeling of cardiovascular risk using Apache Spark MLlib logistic regression pipelines on CDC BRFSS 2022 survey data (~450,000 records).

Built an end-to-end ML pipeline in Apache Spark MLlib to classify whether a patient had a heart attack, using the CDC BRFSS 2022 dataset (450k+ rows, 40 features). Explored logistic regression variants including:

Raw Logistic Regression
Logistic Regression + MinMax Scaling
Logistic Regression + PCA (k=5,10,15)

Real-world application: Scalable risk models to help healthcare providers identify high-risk patients for preventive interventions across populations.

Imbalanced target: Only ~6% positive cases (`HadHeartAttack = Yes`).
Large dataset: 139MB, too big for in-memory Python-only workflows.
Feature redundancy: Many correlated survey features.
Healthcare context: Need for interpretable models rather than black-box predictions.

Distributed preprocessing pipeline using Spark DataFrames for cleaning & feature engineering.
Class balancing with oversampling to address imbalance.
Scaling + PCA experiments to reduce redundancy & improve efficiency.
Evaluation: Accuracy, ROC-AUC, confusion matrix, runtime benchmarks.
Result: Logistic Regression + Scaling achieved the best AUC (0.8813) and interpretability.

Key Features

End-to-end Spark MLlib pipeline
Handles 450k+ rows on Hadoop cluster
Logistic Regression + Scaling = best AUC 0.8813
Healthcare-ready & interpretable

View Project Next Project

US Crime Analysis with GAMs

Generalized Additive Models applied to US crime dataset to analyze socio-economic predictors.

Rmgcvdplyrggplot2

Statistical Modeling

2025

US Crime Analysis with GAMs

Analyzed violent crime rates across US communities using socio-economic predictors with flexible GAM models.

Modeled violent crime as a function of poverty, unemployment, and housing factors. GAMs allowed non-linear relationships to emerge without forcing strict parametric assumptions.

Real-world application: Policy makers can identify non-linear effects (e.g., how marginal increases in poverty disproportionately increase crime).

Multicollinearity between socio-economic variables.
Explaining GAM smooth terms to non-technical stakeholders.
Dealing with missing community-level data.

Filtered features using VIF thresholds.
Visualized smooth functions to explain non-linear effects.
Applied multiple imputation for missing values.
Validated with cross-validation to ensure generalizability.

Key Features

Non-linear GAM modeling
Variable selection with VIF
Cross-validated model performance

View Project Next Project

Cricket Analytics App

An R Shiny dashboard for cricket match analytics and player performance insights.

RShinydplyrggplot2

Sports Analytics

2024

Cricket Analytics App

An interactive platform for analyzing player and team performance metrics in cricket matches.

The app allows users to filter by tournament, player, or match, visualizing batting strike rates, bowling economy, and win contributions. Built entirely in R Shiny for live interactivity.

Real-world application: Coaches and analysts can quickly evaluate strategy decisions (e.g., batting order, bowler matchups) during live series.

Data scraping: cricket data APIs not standardized.
Performance bottlenecks in Shiny for large datasets.
Visual overload when comparing multiple players.

Automated scraping from ESPNcricinfo APIs.
Pre-aggregated stats to reduce Shiny server load.
Used faceted ggplot2 charts for clean comparisons.

Key Features

Interactive match filters
Batting + bowling KPIs
Exportable performance reports

View Project Next Project

Neural Networks - Applied Projects (5-in-1)

A collection of five applied deep learning experiments, covering feedforward networks, variational autoencoders, graph neural networks, and diffusion-based distribution mapping.

Python TensorFlow Keras PyTorch NumPy

Deep Learning

2024

Neural Networks - Applied Projects

Five portfolio projects exploring neural networks in diverse contexts: from generative models to graph neural networks and diffusion processes.

Spherical Projection of 3D Gaussian Data: Learned a mapping from 3D Gaussians to the unit sphere. Tuned architecture (3→20→20→3) with Adam; achieved test MSE ≈ 0.000871.
Generative Modeling: Converted Gaussian ↔ Uniform distributions and trained a VAE for 1D→2D Gaussian mapping.
Shape Generation: Compared Autoencoders vs Variational Autoencoders on generating 28×28 images of shapes, evaluating MSE, SSIM, and latent interpolations.
Tracking Random Walks with GNNs: Applied Graph Attention Networks to predict visited sensors in noisy random-walk simulations; ~80% accuracy after tuning.
Distribution Mapping with Diffusion: Compared SDE vs ODE diffusion processes for mapping distributions (Gaussian→Dog, Cat→Dog); ODE consistently outperformed in metrics and speed.

Real-world application: Skills here mirror how deep learning powers computer vision (projection, generative VAEs), IoT & networks (GNNs), and state-of-the-art generative AI (diffusion models like Stable Diffusion).

Overfitting: Small synthetic datasets made models prone to memorization.
Architecture search: Balancing depth, width, and activation functions for convergence.
Training stability: Diffusion and VAE models required careful loss balancing and normalization.
Scalability: CNNs and GNNs needed GPU acceleration for practical runtimes.
Interpretability: Latent spaces and diffusion outputs needed visualization for trust and validation.

Regularization: Dropout, L2 penalties, and data augmentation mitigated overfitting.
Optimization: Used learning-rate schedulers and Adam optimizer for stable convergence.
Comparisons: Benchmarked AE vs VAE, SDE vs ODE, and showed trade-offs in metrics (MSE, SSIM, Wasserstein).
Visualization: Latent interpolations, t-SNE, and distributional plots validated learned representations.
Deployment-ready: Each project structured as independent, reproducible Colab notebooks.

Key Features

Covers 5 neural net paradigms: AE, VAE, GNN, Diffusion, FFN
Achieved test MSE ≈ 0.000871 (spherical projection)
Shape generation SSIM > 0.993
GNN tracking ~80% accuracy
Diffusion ODE models faster + more accurate than SDE

View Project Next Project

Spiral Motion Analysis

Quantitative analysis of hand-drawn spirals to detect motor control differences between Parkinson’s patients and healthy controls.

R ggplot2 ClusterCrit t-SNE DBSCAN

Biomedical Data Science

2024

Spiral Motion Analysis for Parkinson’s vs Controls

Digitized spirals are more than drawings — they’re biomarkers. This project transforms stylus signal files into structured features that capture motor instability in Parkinson’s patients.

We parsed raw .svc files capturing spiral drawings into time-series of x, y, timestamp, pressure, azimuth, altitude. Cleaned data was enriched with features like speed, acceleration, and jaggedness. Clustering and dimensionality reduction revealed clear separation between Parkinson’s vs controls.

Real-world application: Moving from subjective neurological assessments toward objective, digital biomarkers for early Parkinson’s screening and remote monitoring.

Noisy stylus data: Timestamp irregularities, pen-up/down states, and outliers across patients.
Feature variability: Speed and pressure signals highly heterogeneous between individuals.
Class imbalance: Fewer Parkinson’s cases relative to controls made clustering validation difficult.
Subjectivity: Traditional clinical spiral tests depend on human interpretation.

Preprocessing: Outlier removal with IQR; alignment of timestamps; separation of Ill vs Control IDs.
Feature engineering: Computed velocity, acceleration, pen pressure variation, active speed, and coefficient of variation of motion.
Clustering: Applied K-Means, DBSCAN, and Agglomerative (Ward’s); validated with Silhouette, Davies–Bouldin, Dunn, ARI.
Visualization: Used PCA and t-SNE to reveal latent groupings between Parkinson’s and Control subjects.
Digital biomarkers: Identified jaggedness and pressure variability as strong indicators of motor instability.

Key Features

Parses raw .svc stylus spiral files
Feature engineering for speed, pressure, jaggedness
Multi-algorithm clustering & validation metrics
Dimensionality reduction with PCA & t-SNE
Demonstrates spirals as digital biomarkers

Results at a Glance

Method	Silhouette ↑	Davies–Bouldin ↓	Dunn ↑	ARI ↑
K-Means	0.640	0.996	0.288	0.464
DBSCAN	0.624	1.063	0.393	0.464
AHC (Ward)	0.740	0.891	0.421	0.474

↑ higher is better, ↓ lower is better. Ward’s method gave the clearest cluster separation.

View Project Next Project

Music Genre Classification (Kaggle)

End-to-end pipeline for Kaggle competition: feature engineering, model building, and leaderboard submission for music genre prediction.

Python Pandas Scikit-learn CatBoost Keras

Audio ML (Tabular)

2024

Music Genre Classification - Kaggle

Competition project to classify songs into genres using Kaggle-provided CSV datasets (audio metadata + features). Pipeline included EDA, preprocessing, feature engineering, and multiple modeling strategies culminating in a top CatBoost submission.

We worked with structured .csv data provided by Kaggle (no raw audio). The dataset contained ~40 features such as tempo, duration_ms, loudness, acousticness, speechiness, instrumentalness, etc. The workflow included missing value handling, distribution corrections, feature engineering (e.g., binning acousticness/instrumentalness with Gaussian Mixture Models), and dimensionality reduction before training models.

Real-world application: Automated tagging of music libraries, powering recommendation systems, playlist curation, and catalog management at scale.

Missingness: Features like tempo, duration_ms, and artist_name had thousands of invalid or missing entries.
Skewed distributions: Variables like speechiness and liveness had heavy tails requiring transformation.
Bimodal features: instrumentalness showed clear two-mode behavior, challenging traditional scaling methods.
Feature redundancy: Strong collinearity (e.g., loudness vs energy ~0.81) risked leakage and instability.
Multi-class complexity: Several genres had overlapping distributions of acousticness/speechiness, complicating separation.

Preprocessing: Normalized invalid values (“?”, -1) → NaN; imputed with Iterative Imputer to capture feature relationships.
Feature Engineering: - Dropped collinear features (energy). - Box-Cox/Power transforms for skewed columns. - GMM binning for instrumentalness and acousticness. - Mutual Information + RFE for feature selection (~15 features kept).
Modeling: Benchmarked Logistic Regression, Random Forest, Gradient Boosting, SVM, CatBoost, and Neural Networks (Keras dense model with BN + dropout).
Final model: CatBoostClassifier (500 iterations, no artist_name encoding) → 81% validation, 81.6% Kaggle leaderboard, ROC-AUC 98.4%.

Key Features

Advanced preprocessing: imputation, transformation, GMM binning
Ensemble and boosting models benchmarked
CatBoost with native categorical handling chosen
Achieved top 20% leaderboard rank (81.6% acc)

Results Summary

Model	Validation Acc	Kaggle	Notes
Logistic Regression	61.8%	–	Simple baseline
Random Forest	69%	65%	Tree ensemble
Gradient Boosting	76%	65.3%	Strong baseline
Voting Ensemble	76.9%	69%	CatBoost+GBC+HistGB
Neural Net (Keras)	79.4%	79.4%	Dense NN + BN/Dropout
CatBoost (final)	81%	81.6%	ROC-AUC 98.4%

View Project Next Project

Fruit Image Classification

End-to-end deep learning pipeline for fruit image classification using CNNs and transfer learning with ResNet-50.

PyTorchResNet-50PythonNumPyMatplotlib

Computer Vision

2024

Fruit Image Classification with CNNs (ResNet-50)

Developed an image classifier to distinguish between cherries, tomatoes, and strawberries, comparing MLP, custom CNN, and ResNet-50 architectures. Final model achieved 97.6% test accuracy.

This project demonstrates the full workflow of modern computer vision: exploratory data analysis, preprocessing & augmentation, baseline models (MLP, CNN), and transfer learning with ResNet-50.

Dataset: 6,000 images (~300×300, RGB) from Flickr, evenly split across 3 fruit classes. Train/test = 4,500/1,500.

Final results: ResNet-50 fine-tuned → 97.65% accuracy, with per-class metrics: Cherry 95.1%, Tomato 99.6%, Strawberry 98.2%.

Dataset variability: Different lighting, angles, and backgrounds across Flickr images.
Model overfitting: Custom CNN struggled without augmentation and regularization.
Compute cost: Training ResNet-50 on 6,000 images required GPU acceleration.
Generalization: Ensuring robustness beyond the training dataset.

Preprocessing: Resized to 300×300, normalized with dataset mean/std.
Augmentation: Random flips, rotations, resized crops, and color jitter to increase robustness.
Baselines: Implemented MLP and custom CNN for comparison.
Transfer Learning: Fine-tuned ResNet-50 (ImageNet pretrain) with new 3-class output head.
Training strategy: Adam optimizer, StepLR scheduler, weight decay regularization, 5-fold CV checks.
Deployment: Saved as model.pth, test script provided for reproducible evaluation.

Key Features

Progression: MLP → custom CNN → ResNet-50
Augmentation pipeline for generalization
Fine-tuned ResNet-50: 97.6% test accuracy
Reproducible results with .pth model weights

View Project Back to First Project

Aspect Sentiment BI

Aspect Sentiment BI

Project Overview

The Challenge

The Solution

Key Features

Smart Invoice Extractor

Smart Invoice Extractor

Project Overview

The Challenge

The Solution

Key Features

smartcurvefit (R Package)

smartcurvefit

Project Overview

The Challenge

The Solution

Key Features

Heart Attack Prediction with Logistic Regression (Spark MLlib)

Heart Attack Prediction with Spark MLlib

Project Overview

The Challenge

The Solution

Key Features

US Crime Analysis with GAMs

US Crime Analysis with GAMs

Project Overview

The Challenge

The Solution

Key Features

Cricket Analytics App

Cricket Analytics App

Project Overview

The Challenge

The Solution

Key Features

Neural Networks - Applied Projects (5-in-1)

Neural Networks - Applied Projects

Project Overview

The Challenge

The Solution

Key Features

Spiral Motion Analysis

Spiral Motion Analysis for Parkinson’s vs Controls

Project Overview

The Challenge

The Solution

Key Features

Results at a Glance

Music Genre Classification (Kaggle)

Music Genre Classification - Kaggle

Project Overview

The Challenge

The Solution

Key Features

Results Summary

Fruit Image Classification

Fruit Image Classification with CNNs (ResNet-50)

Project Overview

The Challenge

The Solution

Key Features