Aspect Sentiment BI

A real-time pipeline that ingests Reddit comments, processes them with NLP, stores results in SQLite, and presents insights in BI dashboards.

Aspect Sentiment BI
PythonAirflowSQLiteStreamlitPower BI
NLP & BI
2025

Aspect Sentiment BI

Transforming unstructured Reddit discussions into structured, actionable insights through NLP pipelines and BI dashboards.

The project builds a pipeline that automatically ingests Reddit comments, applies NLP tasks such as aspect extraction, sentiment analysis, and topic modeling, and stores the results in a database. The outputs are then visualized using Streamlit dashboards for public access and Power BI for advanced reporting.

Real-world application: Product managers and analysts can monitor customer feedback at scale, tracking which features (e.g., “battery,” “screen,” “pricing”) drive positive or negative sentiment, enabling data-driven business decisions.

  • Unstructured data: Reddit comments are noisy, with slang, emojis, and inconsistent formatting.
  • Aspect extraction: Identifying product features (e.g., “camera” vs. “battery life”) from free-text is non-trivial.
  • Model scalability: Running topic models and transformer-based sentiment models in real-time is computationally heavy.
  • Data freshness: Stakeholders required up-to-date insights, not monthly snapshots.
  • Sharing limitations: Power BI licensing restricted wide sharing outside the org.

  • Automated ingestion: Airflow DAGs scheduled regular data pulls from Reddit API.
  • NLP pipeline: KeyBERT for aspect extraction, RoBERTa for sentiment classification, BERTopic for clustering themes.
  • Lightweight persistence: SQLite used for portability and quick querying during prototyping.
  • Dashboards: Streamlit for open access and interactivity; Power BI for richer analytics where licensing allowed.
  • Scalability plan: Architecture designed to move from SQLite to Postgres or cloud DB as data grows.

Key Features

  • Automated Reddit ingestion
  • Aspect-based sentiment analysis
  • Topic modeling with BERTopic
  • Streamlit + Power BI dashboards

Smart Invoice Extractor

OCR + NLP pipeline to parse invoices into structured fields, powered by FastAPI, Docker, and Streamlit.

Smart Invoice Extractor
OCRNLPFastAPIDockerStreamlit
OCR & NLP
2025

Smart Invoice Extractor

Automating financial workflows by converting unstructured invoice scans into structured fields such as invoice number, vendor details, dates, and totals — reducing manual effort and errors in accounting pipelines.

By combining Tesseract OCR with spaCy NLP, this solution extracts structured data from invoices. The backend was built with FastAPI, deployed in Docker containers for portability, and integrated with a Streamlit dashboard for end-user validation and export to CSV/BI tools.

Real-world application: SMEs and finance teams can auto-process vendor invoices, validate tax compliance, and feed clean structured data directly into ERP systems like SAP or QuickBooks.

  • Low-quality scans: Blurred, rotated, or noisy invoice images affected OCR accuracy.
  • No standard format: Layouts varied widely by vendor/country, making extraction rules brittle.
  • Entity ambiguity: Confusion between invoice vs. PO number, or invoice vs. due dates.
  • Currency/localization: Handling multiple currencies, date formats, and languages.
  • End-user adoption: Users needed a simple interface to review and correct outputs.

  • OCR preprocessing: Applied OpenCV-based denoising, thresholding, and skew correction before text extraction.
  • Hybrid NLP: Regex patterns for dates/numbers + spaCy NER for vendor names and tax fields.
  • Validation rules: Cross-checking invoice totals vs. line items ensured accuracy.
  • API-first: FastAPI endpoints for batch uploads, making it extensible for ERP integration.
  • Scalability: Dockerized containers deployable on local machines or cloud infra.
  • User dashboard: Streamlit UI allowed accountants to upload invoices and download verified structured data.

Key Features

  • OCR preprocessing (OpenCV + Tesseract)
  • NLP + regex hybrid entity extraction
  • FastAPI REST API with Docker
  • Streamlit dashboard for validation

smartcurvefit (R Package)

Robust nonlinear curve fitting R package with Rcpp backend and automated loss selection.

smartcurvefit
RRcppC++ggplot2
R Package
2025

smartcurvefit

An R package that enables robust nonlinear curve fitting with multiple loss functions (L1, L2, Huber) and automatic cross-validated loss selection using an Rcpp backend.

Developed as part of a graduate-level Data Science course, smartcurvefit provides robust nonlinear regression functions with S3 objects, summaries, diagnostics, and visualization. Models supported include power-law, exponential, and user-defined forms.

Real-world application: Scientists and analysts can use it to fit growth curves, reliability functions, and dose-response models with resilience to outliers.

  • Nonlinear regression is sensitive to outliers.
  • Users often lack clarity on which loss function to choose.
  • Balancing performance with package usability and documentation.
  • Integration of Rcpp with R S3 methods while maintaining CRAN standards.

  • Implemented L1, L2, and Huber loss in C++ for efficiency.
  • Cross-validation automatically selects the best loss for given data.
  • Provided S3 methods: `print`, `summary`, `plot`, `predict` for user-friendliness.
  • Detailed vignettes and examples for reproducibility.
  • Future-ready: allows extension to user-defined nonlinear models.

Key Features

  • Rcpp-powered optimization for speed
  • Cross-validated loss selection (L1, L2, Huber)
  • Intuitive S3 object interface
  • Publication-ready plots with ggplot2

Heart Attack Prediction with Logistic Regression (Spark MLlib)

Predicting heart attack risk using Big Data and Machine Learning on a Hadoop cluster with Apache Spark MLlib.

Heart Attack Prediction with Spark
Apache Spark Hadoop Python MLlib
Big Data & Healthcare
2025

Heart Attack Prediction with Spark MLlib

Large-scale predictive modeling of cardiovascular risk using Apache Spark MLlib logistic regression pipelines on CDC BRFSS 2022 survey data (~450,000 records).

Built an end-to-end ML pipeline in Apache Spark MLlib to classify whether a patient had a heart attack, using the CDC BRFSS 2022 dataset (450k+ rows, 40 features). Explored logistic regression variants including:

  • Raw Logistic Regression
  • Logistic Regression + MinMax Scaling
  • Logistic Regression + PCA (k=5,10,15)

Real-world application: Scalable risk models to help healthcare providers identify high-risk patients for preventive interventions across populations.

  • Imbalanced target: Only ~6% positive cases (`HadHeartAttack = Yes`).
  • Large dataset: 139MB, too big for in-memory Python-only workflows.
  • Feature redundancy: Many correlated survey features.
  • Healthcare context: Need for interpretable models rather than black-box predictions.

  • Distributed preprocessing pipeline using Spark DataFrames for cleaning & feature engineering.
  • Class balancing with oversampling to address imbalance.
  • Scaling + PCA experiments to reduce redundancy & improve efficiency.
  • Evaluation: Accuracy, ROC-AUC, confusion matrix, runtime benchmarks.
  • Result: Logistic Regression + Scaling achieved the best AUC (0.8813) and interpretability.

Key Features

  • End-to-end Spark MLlib pipeline
  • Handles 450k+ rows on Hadoop cluster
  • Logistic Regression + Scaling = best AUC 0.8813
  • Healthcare-ready & interpretable

US Crime Analysis with GAMs

Generalized Additive Models applied to US crime dataset to analyze socio-economic predictors.

US Crime GAM
Rmgcvdplyrggplot2
Statistical Modeling
2025

US Crime Analysis with GAMs

Analyzed violent crime rates across US communities using socio-economic predictors with flexible GAM models.

Modeled violent crime as a function of poverty, unemployment, and housing factors. GAMs allowed non-linear relationships to emerge without forcing strict parametric assumptions.

Real-world application: Policy makers can identify non-linear effects (e.g., how marginal increases in poverty disproportionately increase crime).

  • Multicollinearity between socio-economic variables.
  • Explaining GAM smooth terms to non-technical stakeholders.
  • Dealing with missing community-level data.

  • Filtered features using VIF thresholds.
  • Visualized smooth functions to explain non-linear effects.
  • Applied multiple imputation for missing values.
  • Validated with cross-validation to ensure generalizability.

Key Features

  • Non-linear GAM modeling
  • Variable selection with VIF
  • Cross-validated model performance

Cricket Analytics App

An R Shiny dashboard for cricket match analytics and player performance insights.

Cricket Analytics
RShinydplyrggplot2
Sports Analytics
2024

Cricket Analytics App

An interactive platform for analyzing player and team performance metrics in cricket matches.

The app allows users to filter by tournament, player, or match, visualizing batting strike rates, bowling economy, and win contributions. Built entirely in R Shiny for live interactivity.

Real-world application: Coaches and analysts can quickly evaluate strategy decisions (e.g., batting order, bowler matchups) during live series.

  • Data scraping: cricket data APIs not standardized.
  • Performance bottlenecks in Shiny for large datasets.
  • Visual overload when comparing multiple players.

  • Automated scraping from ESPNcricinfo APIs.
  • Pre-aggregated stats to reduce Shiny server load.
  • Used faceted ggplot2 charts for clean comparisons.

Key Features

  • Interactive match filters
  • Batting + bowling KPIs
  • Exportable performance reports

Neural Networks - Applied Projects (5-in-1)

A collection of five applied deep learning experiments, covering feedforward networks, variational autoencoders, graph neural networks, and diffusion-based distribution mapping.

Neural Networks Projects
Python TensorFlow Keras PyTorch NumPy
Deep Learning
2024

Neural Networks - Applied Projects

Five portfolio projects exploring neural networks in diverse contexts: from generative models to graph neural networks and diffusion processes.

  • Spherical Projection of 3D Gaussian Data: Learned a mapping from 3D Gaussians to the unit sphere. Tuned architecture (3→20→20→3) with Adam; achieved test MSE ≈ 0.000871.
  • Generative Modeling: Converted Gaussian ↔ Uniform distributions and trained a VAE for 1D→2D Gaussian mapping.
  • Shape Generation: Compared Autoencoders vs Variational Autoencoders on generating 28×28 images of shapes, evaluating MSE, SSIM, and latent interpolations.
  • Tracking Random Walks with GNNs: Applied Graph Attention Networks to predict visited sensors in noisy random-walk simulations; ~80% accuracy after tuning.
  • Distribution Mapping with Diffusion: Compared SDE vs ODE diffusion processes for mapping distributions (Gaussian→Dog, Cat→Dog); ODE consistently outperformed in metrics and speed.

Real-world application: Skills here mirror how deep learning powers computer vision (projection, generative VAEs), IoT & networks (GNNs), and state-of-the-art generative AI (diffusion models like Stable Diffusion).

  • Overfitting: Small synthetic datasets made models prone to memorization.
  • Architecture search: Balancing depth, width, and activation functions for convergence.
  • Training stability: Diffusion and VAE models required careful loss balancing and normalization.
  • Scalability: CNNs and GNNs needed GPU acceleration for practical runtimes.
  • Interpretability: Latent spaces and diffusion outputs needed visualization for trust and validation.

  • Regularization: Dropout, L2 penalties, and data augmentation mitigated overfitting.
  • Optimization: Used learning-rate schedulers and Adam optimizer for stable convergence.
  • Comparisons: Benchmarked AE vs VAE, SDE vs ODE, and showed trade-offs in metrics (MSE, SSIM, Wasserstein).
  • Visualization: Latent interpolations, t-SNE, and distributional plots validated learned representations.
  • Deployment-ready: Each project structured as independent, reproducible Colab notebooks.

Key Features

  • Covers 5 neural net paradigms: AE, VAE, GNN, Diffusion, FFN
  • Achieved test MSE ≈ 0.000871 (spherical projection)
  • Shape generation SSIM > 0.993
  • GNN tracking ~80% accuracy
  • Diffusion ODE models faster + more accurate than SDE

Spiral Motion Analysis

Quantitative analysis of hand-drawn spirals to detect motor control differences between Parkinson’s patients and healthy controls.

Spiral Motion Analysis
R ggplot2 ClusterCrit t-SNE DBSCAN
Biomedical Data Science
2024

Spiral Motion Analysis for Parkinson’s vs Controls

Digitized spirals are more than drawings — they’re biomarkers. This project transforms stylus signal files into structured features that capture motor instability in Parkinson’s patients.

We parsed raw .svc files capturing spiral drawings into time-series of x, y, timestamp, pressure, azimuth, altitude. Cleaned data was enriched with features like speed, acceleration, and jaggedness. Clustering and dimensionality reduction revealed clear separation between Parkinson’s vs controls.

Real-world application: Moving from subjective neurological assessments toward objective, digital biomarkers for early Parkinson’s screening and remote monitoring.

  • Noisy stylus data: Timestamp irregularities, pen-up/down states, and outliers across patients.
  • Feature variability: Speed and pressure signals highly heterogeneous between individuals.
  • Class imbalance: Fewer Parkinson’s cases relative to controls made clustering validation difficult.
  • Subjectivity: Traditional clinical spiral tests depend on human interpretation.

  • Preprocessing: Outlier removal with IQR; alignment of timestamps; separation of Ill vs Control IDs.
  • Feature engineering: Computed velocity, acceleration, pen pressure variation, active speed, and coefficient of variation of motion.
  • Clustering: Applied K-Means, DBSCAN, and Agglomerative (Ward’s); validated with Silhouette, Davies–Bouldin, Dunn, ARI.
  • Visualization: Used PCA and t-SNE to reveal latent groupings between Parkinson’s and Control subjects.
  • Digital biomarkers: Identified jaggedness and pressure variability as strong indicators of motor instability.

Key Features

  • Parses raw .svc stylus spiral files
  • Feature engineering for speed, pressure, jaggedness
  • Multi-algorithm clustering & validation metrics
  • Dimensionality reduction with PCA & t-SNE
  • Demonstrates spirals as digital biomarkers

Results at a Glance

MethodSilhouette ↑Davies–Bouldin ↓Dunn ↑ARI ↑
K-Means0.6400.9960.2880.464
DBSCAN0.6241.0630.3930.464
AHC (Ward)0.7400.8910.4210.474

↑ higher is better, ↓ lower is better. Ward’s method gave the clearest cluster separation.

Music Genre Classification (Kaggle)

End-to-end pipeline for Kaggle competition: feature engineering, model building, and leaderboard submission for music genre prediction.

Music Genre Classification
Python Pandas Scikit-learn CatBoost Keras
Audio ML (Tabular)
2024

Music Genre Classification - Kaggle

Competition project to classify songs into genres using Kaggle-provided CSV datasets (audio metadata + features). Pipeline included EDA, preprocessing, feature engineering, and multiple modeling strategies culminating in a top CatBoost submission.

We worked with structured .csv data provided by Kaggle (no raw audio). The dataset contained ~40 features such as tempo, duration_ms, loudness, acousticness, speechiness, instrumentalness, etc. The workflow included missing value handling, distribution corrections, feature engineering (e.g., binning acousticness/instrumentalness with Gaussian Mixture Models), and dimensionality reduction before training models.

Real-world application: Automated tagging of music libraries, powering recommendation systems, playlist curation, and catalog management at scale.

  • Missingness: Features like tempo, duration_ms, and artist_name had thousands of invalid or missing entries.
  • Skewed distributions: Variables like speechiness and liveness had heavy tails requiring transformation.
  • Bimodal features: instrumentalness showed clear two-mode behavior, challenging traditional scaling methods.
  • Feature redundancy: Strong collinearity (e.g., loudness vs energy ~0.81) risked leakage and instability.
  • Multi-class complexity: Several genres had overlapping distributions of acousticness/speechiness, complicating separation.

  • Preprocessing: Normalized invalid values (“?”, -1) → NaN; imputed with Iterative Imputer to capture feature relationships.
  • Feature Engineering: - Dropped collinear features (energy). - Box-Cox/Power transforms for skewed columns. - GMM binning for instrumentalness and acousticness. - Mutual Information + RFE for feature selection (~15 features kept).
  • Modeling: Benchmarked Logistic Regression, Random Forest, Gradient Boosting, SVM, CatBoost, and Neural Networks (Keras dense model with BN + dropout).
  • Final model: CatBoostClassifier (500 iterations, no artist_name encoding) → 81% validation, 81.6% Kaggle leaderboard, ROC-AUC 98.4%.

Key Features

  • Advanced preprocessing: imputation, transformation, GMM binning
  • Ensemble and boosting models benchmarked
  • CatBoost with native categorical handling chosen
  • Achieved top 20% leaderboard rank (81.6% acc)

Results Summary

ModelValidation AccKaggleNotes
Logistic Regression61.8%Simple baseline
Random Forest69%65%Tree ensemble
Gradient Boosting76%65.3%Strong baseline
Voting Ensemble76.9%69%CatBoost+GBC+HistGB
Neural Net (Keras)79.4%79.4%Dense NN + BN/Dropout
CatBoost (final)81%81.6%ROC-AUC 98.4%

Fruit Image Classification

End-to-end deep learning pipeline for fruit image classification using CNNs and transfer learning with ResNet-50.

Fruit Classification
PyTorchResNet-50PythonNumPyMatplotlib
Computer Vision
2024

Fruit Image Classification with CNNs (ResNet-50)

Developed an image classifier to distinguish between cherries, tomatoes, and strawberries, comparing MLP, custom CNN, and ResNet-50 architectures. Final model achieved 97.6% test accuracy.

This project demonstrates the full workflow of modern computer vision: exploratory data analysis, preprocessing & augmentation, baseline models (MLP, CNN), and transfer learning with ResNet-50.

Dataset: 6,000 images (~300×300, RGB) from Flickr, evenly split across 3 fruit classes. Train/test = 4,500/1,500.

Final results: ResNet-50 fine-tuned → 97.65% accuracy, with per-class metrics: Cherry 95.1%, Tomato 99.6%, Strawberry 98.2%.

  • Dataset variability: Different lighting, angles, and backgrounds across Flickr images.
  • Model overfitting: Custom CNN struggled without augmentation and regularization.
  • Compute cost: Training ResNet-50 on 6,000 images required GPU acceleration.
  • Generalization: Ensuring robustness beyond the training dataset.

  • Preprocessing: Resized to 300×300, normalized with dataset mean/std.
  • Augmentation: Random flips, rotations, resized crops, and color jitter to increase robustness.
  • Baselines: Implemented MLP and custom CNN for comparison.
  • Transfer Learning: Fine-tuned ResNet-50 (ImageNet pretrain) with new 3-class output head.
  • Training strategy: Adam optimizer, StepLR scheduler, weight decay regularization, 5-fold CV checks.
  • Deployment: Saved as model.pth, test script provided for reproducible evaluation.

Key Features

  • Progression: MLP → custom CNN → ResNet-50
  • Augmentation pipeline for generalization
  • Fine-tuned ResNet-50: 97.6% test accuracy
  • Reproducible results with .pth model weights