Data Science Projects with Python A case study approach to gaining valuable insights from real data with machine learning

Gain hands-on experience of Python programming with industry-standard machine learning techniques using pandas, scikit-learn, and XGBoost Key Features Think critically about data and use it to form and test a hypothesisChoose an appropriate machine learning model and train it on your dataCommunicate...

Full description

Saved in:
Bibliographic Details
Main Author Klosterman, Stephen
Format eBook
LanguageEnglish
Published Birmingham Packt Publishing 2021
Packt Publishing, Limited
Packt Publishing Limited
Edition2
Subjects
Online AccessGet full text
ISBN9781800564480
1800564481
DOI10.0000/9781800569447

Cover

Table of Contents:
  • Table of Contents Data Exploration and CleaningIntroduction to Scikit-Learn and Model EvaluationDetails of Logistic Regression and Feature ExplorationThe Bias-Variance Trade-offDecision Trees and Random ForestsGradient Boosting, XGBoost, and SHAP (SHapley Additive exPlanations) ValuesTest Set Analysis, Financial Insights, and Delivery to the Client
  • Title Page Preface Table of Contents 1. Data Exploration and Cleaning 2. Introduction to Scikit-Learn and Model Evaluation 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-off 5. Decision Trees and Random Forests 6. Gradient Boosting, XGBoost, and SHAP Values 7. Test Set Analysis, Financial Insights, and Delivery to the Client Appendix Index
  • Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP -- Summary -- Chapter 7: Test Set Analysis, Financial Insights, and Delivery to the Client -- Introduction -- Review of Modeling Results -- Feature Engineering -- Ensembling Multiple Models -- Different Modeling Techniques -- Balancing Classes -- Model Performance on the Test Set -- Distribution of Predicted Probability and Decile Chart -- Exercise 7.01: Equal-Interval Chart -- Calibration of Predicted Probabilities -- Financial Analysis -- Financial Conversation with the Client -- Exercise 7.02: Characterizing Costs and Savings -- Activity 7.01: Deriving Financial Insights -- Final Thoughts on Delivering a Predictive Model to the Client -- Model Monitoring -- Ethics in Predictive Modeling -- Summary -- Appendix -- Index
  • Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem -- Options for Logistic Regression in Scikit-Learn -- Scaling Data, Pipelines, and Interaction Features in Scikit-Learn -- Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data -- Summary -- Chapter 5: Decision Trees and Random Forests -- Introduction -- Decision Trees -- The Terminology of Decision Trees and Connections to Machine Learning -- Exercise 5.01: A Decision Tree in Scikit-Learn -- Training Decision Trees: Node Impurity -- Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions -- Training Decision Trees: A Greedy Algorithm -- Training Decision Trees: Different Stopping Criteria and Other Options -- Using Decision Trees: Advantages and Predicted Probabilities -- A More Convenient Approach to Cross-Validation -- Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree -- Random Forests: Ensembles of Decision Trees -- Random Forest: Predictions and Interpretability -- Exercise 5.03: Fitting a Random Forest -- Checkerboard Graph -- Activity 5.01: Cross-Validation Grid Search with Random Forest -- Summary -- Chapter 6: Gradient Boosting, XGBoost, and SHAP Values -- Introduction -- Gradient Boosting and XGBoost -- What Is Boosting? -- Gradient Boosting and XGBoost -- XGBoost Hyperparameters -- Early Stopping -- Tuning the Learning Rate -- Other Important Hyperparameters in XGBoost -- Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters -- Another Way of Growing Trees: XGBoost's grow_policy -- Explaining Model Predictions with SHAP Values -- Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values -- Missing Data -- Saving Python Variables to a File
  • Cover -- FM -- Copyright -- Table of Contents -- Preface -- Chapter 1: Data Exploration and Cleaning -- Introduction -- Python and the Anaconda Package Management System -- Indexing and the Slice Operator -- Exercise 1.01: Examining Anaconda and Getting Familiar with Python -- Different Types of Data Science Problems -- Loading the Case Study Data with Jupyter and pandas -- Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook -- Getting Familiar with Data and Performing Data Cleaning -- The Business Problem -- Data Exploration Steps -- Exercise 1.03: Verifying Basic Data Integrity -- Boolean Masks -- Exercise 1.04: Continuing Verification of Data Integrity -- Exercise 1.05: Exploring and Cleaning the Data -- Data Quality Assurance and Exploration -- Exercise 1.06: Exploring the Credit Limit and Demographic Features -- Deep Dive: Categorical Features -- Exercise 1.07: Implementing OHE for a Categorical Feature -- Exploring the Financial History Features in the Dataset -- Activity 1.01: Exploring the Remaining Financial Features in the Dataset -- Summary -- Chapter 2: Introduction to Scikit-Learn and Model Evaluation -- Introduction -- Exploring the Response Variable and Concluding the Initial Exploration -- Introduction to Scikit-Learn -- Generating Synthetic Data -- Data for Linear Regression -- Exercise 2.01: Linear Regression in Scikit-Learn -- Model Performance Metrics for Binary Classification -- Splitting the Data: Training and Test Sets -- Classification Accuracy -- True Positive Rate, False Positive Rate, and Confusion Matrix -- Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python -- Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions? -- Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model
  • The Receiver Operating Characteristic (ROC) Curve -- Precision -- Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve -- Summary -- Chapter 3: Details of Logistic Regression and Feature Exploration -- Introduction -- Examining the Relationships Between Features and the Response Variable -- Pearson Correlation -- Mathematics of Linear Correlation -- F-test -- Exercise 3.01: F-test and Univariate Feature Selection -- Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions -- Hypotheses and Next Steps -- Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable -- Univariate Feature Selection: What it Does and Doesn't Do -- Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python -- Exercise 3.03: Plotting the Sigmoid Function -- Scope of Functions -- Why Is Logistic Regression Considered a Linear Model? -- Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression -- From Logistic Regression Coefficients to Predictions Using Sigmoid -- Exercise 3.05: Linear Decision Boundary of Logistic Regression -- Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients -- Summary -- Chapter 4: The Bias-Variance Trade-Off -- Introduction -- Estimating the Coefficients and Intercepts of Logistic Regression -- Gradient Descent to Find Optimal Parameter Values -- Exercise 4.01: Using Gradient Descent to Minimize a Cost Function -- Assumptions of Logistic Regression -- The Motivation for Regularization: The Bias-Variance Trade-Off -- Exercise 4.02: Generating and Modeling Synthetic Classification Data -- Lasso (L1) and Ridge (L2) Regularization -- Cross-Validation: Choosing the Regularization Parameter
  • Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning