Understanding Apache Spark Query Execution Plans

Data Science Workflow

Project Scoping and Financial Viability
Data Collection - See notebook
Data Cleaning/Wrangling - See notebook
Exploratory Data Analysis - See notebook
Data Processing/Feature Engineering
Model Building, Tuning, and Experimentation
Model Evaluation
Model Deployment
Model Monitoring and Observability

A data science project must begin with business objectives in mind and an understanding of how the project will deliver value to customers. We must also have an approximate understanding of the expense the expense the project requires and the value we expect to gain [1]. In this project, we will use machine learning to predict the sales price of a used car.

We will use Beautiful Soup and Selenium to scrape www.cars.com for used vehicles on sale within the NYC area. Next, we will clean this data and rewrite it in a format that we can use to visualize and understand the data.

We will continue with exploratory data analysis and gain a deeper understanding of the data, outliers, correlated features and nuances of the dataset which will guide our feature engineering. For example, if we take the cube root of the mileage column, we get a distribution very close the the normal distribution.

We continue with the model experiment phase of the ML lifecycle and attempt to build the best performing model we can. For this, we can try multiple models and perform hyperparameter tuning. Afterwards, the model needs to be rigorously tested for performance.

If the model passes its evaluations, then we can move on to deploying the model. Even after deployment, we must consistently monitor the model's performance for drift, accuracy, and business value and retrain with newer data. [2]

References

[1] Understanding the Financial Value and Return on Investment (ROI) of Machine Learning

[2] How to Rigorously Evaluate an ML Model