Guide to Intelligent Data Science How to Intelligently Make Use of Real Data

Making use of data is not anymore a niche project but central to almost every project. With access to massive compute resources and vast amounts of data, it seems at least in principle possible to solve any problem. However, successful data science projects result from the intelligent application of...

Full description

Saved in:

Bibliographic Details
Main Authors	Berthold, Michael R, Borgelt, Christian, Höppner, Frank, Klawonn, Frank, Silipo, Rosaria
Format	eBook Book
Language	English
Published	Cham Springer Nature 2020 Springer Springer International Publishing AG Springer International Publishing
Edition	2
Series	Texts in Computer Science
Subjects	Artificial intelligence Big Data/Analytics Computer Science Data mining Data Mining and Knowledge Discovery Data processing Computer science Machine Learning Mathematical statistics Mathematical statistics > Data processing
Online Access	Get full text
ISBN	9783030455743 3030455742 9783030455736 3030455734
ISSN	1868-0941 1868-095X
DOI	10.1007/978-3-030-45574-3

Cover

Table of Contents:

Intro -- Guide to Intelligent Data Science -- Preface -- Contents -- Symbols -- 1 Introduction -- 1.1 Motivation -- 1.1.1 Data and Knowledge -- 1.1.2 Tycho Brahe and Johannes Kepler -- 1.1.3 Intelligent Data Science -- 1.2 The Data Science Process -- 1.3 Methods, Tasks, and Tools -- 1.4 How to Read This Book -- References -- 2 Practical Data Science: An Example -- 2.1 The Setup -- 2.2 Data Understanding and Pattern Finding -- 2.3 Explanation Finding -- 2.4 Predicting the Future -- 2.5 Concluding Remarks -- 3 Project Understanding -- 3.1 Determine the Project Objective -- 3.2 Assess the Situation -- 3.3 Determine Analysis Goals -- 3.4 Further Reading -- References -- 4 Data Understanding -- 4.1 Attribute Understanding -- 4.2 Data Quality -- 4.3 Data Visualization -- 4.4 Correlation Analysis -- 4.5 Outlier Detection -- 4.5.1 Outlier Detection for Single Attributes -- 4.5.2 Outlier Detection for Multidimensional Data -- 4.6 Missing Values -- 4.7 A Checklist for Data Understanding -- 4.8 Data Understanding in Practice -- 4.8.1 Visualizing the Iris Data -- References -- 5 Principles of Modeling -- 5.1 Model Classes -- 5.2 Fitting Criteria and Score Functions -- 5.3 Algorithms for Model Fitting -- 5.3.1 Closed-Form Solutions -- 5.3.2 Gradient Method -- 5.4 Types of Errors -- 5.5 Model Validation -- 5.5.1 Training and Test Data -- 5.5.2 Cross-Validation -- 5.5.3 Bootstrapping -- 5.6 Model Errors and Validation in Practice -- 5.6.1 Scoring Models for Classification -- 5.7 Further Reading -- References -- 6 Data Preparation -- 6.1 Select Data -- 6.1.1 Feature Selection -- 6.2 Clean Data -- 6.2.1 Improve Data Quality -- 6.2.2 Missing Values -- 6.3 Construct Data -- 6.3.1 Provide Operability -- 6.4 Complex Data Types -- 6.5 Data Integration -- 6.5.1 Vertical Data Integration -- 6.5.2 Horizontal Data Integration -- 6.6 Data Preparation in Practice
10.3.1 Deployment to a Dashboard -- References -- A Statistics -- A.1 Terms and Notation -- A.2 Descriptive Statistics -- A.2.1 Tabular Representations -- A.3 Probability Theory -- A.3.1 Probability -- A.3.1.1 Intuitive Notions of Probability -- A.3.1.2 The Formal Definition of Probability -- A.3.2 Basic Methods and Theorems -- A.3.2.1 Combinatorial Methods -- A.3.2.2 Geometric Probabilities -- A.3.2.3 Conditional Probability and Independent Events -- A.3.2.4 Total Probability and Bayes' Rule -- A.3.2.5 Bernoulli's Law of Large Numbers -- A.3.3 Random Variables -- A.3.3.1 Real-Valued Random Variables -- A.3.3.2 Discrete Random Variables -- A.3.3.3 Continuous Random Variables -- A.3.3.4 Random Vectors -- A.4 Inferential Statistics -- A.4.1 Random Samples -- A.4.2 Parameter Estimation -- A.4.2.1 Point Estimation -- A.4.2.2 Point Estimation Examples -- A.4.2.3 Maximum Likelihood Estimation -- A.4.2.4 Maximum Likelihood Estimation Example -- A.4.2.5 Maximum A Posteriori Estimation -- A.4.2.6 Maximum A Posteriori Estimation Example -- A.4.2.7 Interval Estimation -- A.4.2.8 Interval Estimation Examples -- A.4.3 Hypothesis Testing -- A.4.3.1 Error Types and Significance Level -- A.4.3.2 Parameter Test -- A.4.3.3 Parameter Test Example -- A.4.3.4 Power of a Hypothesis Test -- A.4.3.5 Goodness-of-Fit Test -- A.4.3.6 Goodness-of-Fit Test Example -- A.4.3.7 (In)Dependence Test -- B KNIME -- B.1 Installation and Overview -- B.2 Building Workflows -- B.3 Example Workflow -- References -- Index
6.6.1 Removing Empty or Almost Empty Attributes and Records in a Data Set -- 6.7 Further Reading -- References -- 7 Finding Patterns -- 7.1 Hierarchical Clustering -- 7.2 Notion of (Dis-)Similarity -- 7.3 Prototype- and Model-Based Clustering -- 7.3.1 Overview -- 7.4 Density-Based Clustering -- 7.4.1 Overview -- 7.5 Self-organizing Maps -- 7.5.1 Overview -- 7.6 Frequent Pattern Mining and Association Rules -- 7.6.1 Overview -- 7.6.2 Construction -- 7.7 Deviation Analysis -- 7.7.1 Overview -- 7.7.2 Construction -- 7.8 Finding Patterns in Practice -- 7.8.1 Hierarchical Clustering -- 7.9 Further Reading -- References -- 8 Finding Explanations -- 8.1 Decision Trees -- 8.1.1 Overview -- 8.2 Bayes Classifiers -- 8.2.1 Overview -- 8.2.2 Construction -- 8.3 Regression -- 8.3.1 Overview -- 8.4 Rule learning -- 8.4.1 Propositional Rules -- 8.4.1.1 Extracting Rules from Decision Trees -- 8.4.1.2 Extracting Propositional Rules -- 8.5 Finding Explanations in Practice -- 8.5.1 Decision Trees -- 8.6 Further Reading -- References -- 9 Finding Predictors -- 9.1 Nearest-Neighbor Predictors -- 9.1.1 Overview -- 9.2 Artificial Neural Networks -- 9.2.1 Overview -- 9.3 Deep Learning -- 9.3.1 Recurrent Neural Networks and Long-Short Term Memory Units -- 9.4 Support Vector Machines -- 9.5 Ensemble Methods -- 9.5.1 Overview -- 9.5.2 Construction -- 9.5.3 Variations and Issues -- 9.5.3.1 Tree Ensembles and Random Forests (Bagging) -- 9.6 Finding Predictors in Practice -- 9.6.1 k Nearest Neighbor (kNN) -- 9.7 Further Reading -- References -- 10 Deployment and Model Management -- 10.1 Model Deployment -- 10.1.1 Interactive Applications -- 10.1.2 Model Scoring as a Service -- 10.1.3 Model Representation Standards -- 10.1.4 Frequent Causes for Deployment Failures -- 10.2 Model Management -- 10.2.1 Model Updating and Retraining -- 10.3 Model Deployment and Management in Practice