Introduction

For this project, my team and I explored different prediction models to identify key risk factors for heart disease across various age groups. We used a dataset from Kaggle https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset and applied three models—Random Forest, Logistic Regression, and a Neural Network—to predict heart disease outcomes.

Initially, we experimented with a different dataset that included more intuitive and lifestyle-based features such as family history of heart disease, smoking status, and hours of sleep. However, despite tuning our models extensively, the performance remained subpar. In contrast, when we switched to the second dataset—which focused on clinical features like cholesterol levels, resting heart rate, and exercise-induced angina—we achieved significantly better results.

This experience highlighted the critical role of dataset quality and relevance in determining model performance, emphasizing that well-structured clinical data can often outperform larger, less-specific datasets.


Resources

Presentation Slides

https://docs.google.com/presentation/d/1L90xoD7ywazlPYA_4-t4Kv6jY2hGpE2E/edit?usp=sharing&ouid=116118024865008435192&rtpof=true&sd=true

Github Repository: https://github.com/SeungwonMJ/CS-470-Data-Mining

Final Paper:

Final_Report.pdf