Guidelines for Identifying and Selecting Your Data Analytics Project
Applied Data-Driven Solutions Course
1. Purpose of the Project
The goal of your project is to solve a real-world problem using data analytics — from defining the problem to delivering insights or predictive models that could help stakeholders make better decisions.
Your project should demonstrate:
- Clear problem understanding
- Appropriate data selection and preparation
- Correct application of analytical or machine learning methods
- Meaningful insights and interpretation
2. Steps to Identify and Select a Project
Step 1: Identify a Problem of Interest
Choose a topic that:
- Interests you personally (you’ll work on it for weeks)
- Has available, relevant data
- Can be completed within the course timeline
Ask yourself:
- Who are the stakeholders?
- What decision/problem will this analysis help address?
- Is the problem descriptive, diagnostic, predictive, or prescriptive?
Step 2: Search for Potential Datasets
You can find project datasets from:
- Kaggle Competitions & Datasets (https://www.kaggle.com/competitions)
- UCI Machine Learning Repository (https://archive.ics.uci.edu)
- OpenML (https://www.openml.org)
- Hugging Face Datasets (https://huggingface.co/datasets)
- Government open data portals (https://data.gov, https://data.europa.eu)
- Domain-specific repositories (see examples in Section 5)
Step 3: Check Feasibility
Before committing, ensure:
- Data is accessible and well-structured (or can be cleaned within time limits)
- Labels exist for supervised learning problems
- There is enough variety for feature engineering
- The scope fits your skills and the course timeframe
Step 4: Define Clear Goals and Success Metrics
Examples:
- Predict whether a customer will churn → Metric: F1-score, Precision, Recall
- Forecast daily energy usage → Metric: Mean Absolute Error (MAE)
- Classify product reviews → Metric: Accuracy, Macro-F1
Step 5: Get Inspiration from Kaggle Competitions
Joining a Kaggle competition is a great way to:
- Work with a real dataset
- Benchmark your solution against others
- Learn from public notebooks
Examples:
- House Prices: Advanced Regression Techniques – Predict real estate prices
- Titanic: Machine Learning from Disaster – Binary classification (survival prediction)
- Retail Product Demand Forecasting – Time series forecasting
- RSNA Pneumonia Detection Challenge – Medical image classification
- Predict Future Sales – Sales forecasting from time series data
Step 6: Narrow Down to One Project
Select a project based on:
- Interest – Will keep you motivated
- Impact – Useful to a stakeholder or community
- Feasibility – Can be done with your current skills in the available time
- Data Quality – Enough records, few missing values, relevant variables
- Learning Value – Gives you the chance to apply course techniques
3. Example Project Ideas by Domain
|
Domain
|
Example Project
|
Possible Dataset Source
|
|
Healthcare
|
Predict hospital readmission within 30 days
|
MIMIC-IV (PhysioNet)
|
|
Finance
|
Credit card fraud detection
|
Kaggle: Credit Card Fraud Dataset
|
|
Retail
|
Sales forecasting for products
|
Kaggle: Predict Future Sales
|
|
Marketing
|
Customer churn prediction
|
Kaggle: Telco Customer Churn
|
|
IT Operations
|
Predict long-resolution IT tickets
|
ServiceNow (synthetic), Kaggle IT ticket data
|
|
Education
|
Predict student dropout risk
|
Open University Learning Analytics Dataset
|
|
Transportation
|
Predict taxi demand in NYC
|
NYC Open Data, Kaggle Taxi Data
|
|
Sports Analytics
|
Predict NBA game outcomes
|
Kaggle NBA Stats datasets
|
|
Environmental
|
Air quality forecasting
|
UCI Air Quality Dataset, OpenAQ
|
4. Dataset Repository References
- Kaggle Datasets – https://www.kaggle.com/datasets
- UCI Machine Learning Repository – https://archive.ics.uci.edu
- OpenML – https://www.openml.org
- Hugging Face Datasets – https://huggingface.co/datasets
- Data.gov – https://www.data.gov
- NYC Open Data – https://opendata.cityofnewyork.us
- Our World in Data – https://ourworldindata.org
- Google Dataset Search – https://datasetsearch.research.google.com
5. Deliverables for the Selection Phase
By the end of Week 2 of the course, you should submit:
- Project Title
- Problem Statement (max 150 words)
- Dataset Description (source, number of records, variables)
- Planned Analytical Approach (descriptive, predictive, etc.)
- Success Metric
- Why You Chose This Project (interest, feasibility, impact)
Project identification and selection template: Project_selection_template.docx