Data Science Fundamentals with R, Python, and Open Data

Introduction to essential concepts and techniques of the fundamentals of R and Python needed to start data science projectsOrganized with a strong focus on open data, This book discusses concepts, techniques, tools, and first steps to carry out data science projects, with a focus on Python and RStud...

Full description

Saved in:
Bibliographic Details
Main Author Cremonini, Marco
Format eBook Book
LanguageEnglish
Published Hoboken, New Jersey John Wiley & Sons 2024
Wiley
John Wiley & Sons, Incorporated
John Wiley & Sons (US)
Edition1
Subjects
Online AccessGet full text
ISBN9781394213245
1394213247
DOI10.1002/9781394213276

Cover

Table of Contents:
  • Title Page Introduction Preface Table of Contents 1. Open-Source Tools for Data Science 2. Simple Exploratory Data Analysis 3. Data Organization and First Data Frame Operations 4. Subsetting with Logical Conditions 5. Operations on Dates, Strings, and Missing Values 6. Pivoting and Wide-Long Transformations 7. Groups and Operations on Groups 8. Conditions and Iterations 9. Functions and Multicolumn Operations 10. Join Data Frames 11. List/Dictionary Data Format Index
  • 9.2.1.2 Mapping -- 9.2.2 Mapping and Anonymous Functions: purrr‐style Syntax -- 9.2.3 Conditional Mapping -- 9.2.4 Subsetting Rows with Multicolumn Logical Condition -- 9.2.4.1 Combination of Functions filter() and if_any() -- 9.2.5 Multicolumn Transformations -- 9.2.5.1 Combination of Functions mutate() and across() -- 9.2.6 Introducing Missing Values -- 9.2.7 Use Cases and Execution Time Measurement -- 9.2.7.1 Case 1 -- 9.2.7.2 Case 2 -- 9.3 Python: User‐defined and Lambda Functions -- 9.3.1 User‐defined Functions -- 9.3.1.1 Lambda Functions -- 9.3.2 Python: Multicolumn Operations -- 9.3.2.1 Execution Time -- 9.3.3 General Case -- 9.3.3.1 Function apply() -- Questions -- Chapter 10 Join Data Frames -- 10.1 Basic Concepts -- 10.1.1 Keys of a Join Operation -- 10.1.2 Types of Join -- 10.1.3 R: Join Operation -- 10.1.4 Join Functions -- 10.1.4.1 Function inner_join() -- 10.1.4.2 Function full_join() -- 10.1.4.3 Functions left_join() and right_join() -- 10.1.4.4 Function merge() -- 10.1.5 Duplicated Keys -- 10.1.6 Special Join Functions -- 10.1.6.1 Semi Join -- 10.1.6.2 Anti Join -- 10.2 Python: Join Operations -- 10.2.1.1 Function merge() -- 10.2.1.2 Inner Join -- 10.2.1.3 Outer/Full Join -- 10.2.2 Join Operations with Indexed Data Frames -- 10.2.3 Duplicated Keys -- 10.2.4 Special Join Types -- 10.2.4.1 Semi Join: Function isin() -- 10.2.4.2 Anti Join: Variants -- Questions -- Chapter 11 List/Dictionary Data Format -- 11.1 R: List Data Format -- 11.1.1 Transformation of List Columns to Ordinary Rows and Columns -- 11.1.1.1 Other Options -- 11.1.2 Function map in List Column Transformations -- 11.2 R: JSON Data Format and Use Cases -- 11.2.1 Memory Problem when Reading Very Large Datasets -- 11.3 Python: Dictionary Data Format -- 11.3.1 Methods -- 11.3.2 From Dictionary to Data Frame With a Single Level of Nesting
  • 11.3.2.1 Functions pd.Dataframe() and pd.Dataframe.from_dict()
  • Cover -- Title Page -- Copyright -- Contents -- Preface -- About the Companion Website -- Introduction -- Chapter 1 Open‐Source Tools for Data Science -- 1.1 R Language and RStudio -- 1.1.1 R Language -- 1.1.2 RStudio Desktop -- 1.1.3 Package Manager -- 1.1.4 Package Tidyverse -- 1.2 Python Language and Tools -- 1.2.1 Option A: Anaconda Distribution -- 1.2.2 Option B: Manual Installation -- 1.2.3 Google Colab -- 1.2.4 Packages NumPy and Pandas -- 1.3 Advanced Plain Text Editor -- 1.4 CSV Format for Datasets -- Questions -- Chapter 2 Simple Exploratory Data Analysis -- 2.1 Missing Values Analysis -- 2.2 R: Descriptive Statistics and Utility Functions -- 2.3 Python: Descriptive Statistics and Utility Functions -- Questions -- Chapter 3 Data Organization and First Data Frame Operations -- 3.1 R: Read CSV Datasets and Column Selection -- 3.1.1 Reading a CSV Dataset -- 3.1.1.1 Reading Errors -- 3.1.2 Selection by Column Name -- 3.1.3 Selection by Column Index Position -- 3.1.4 Selection by Range -- 3.1.5 Selection by Exclusion -- 3.1.6 Selection with Selection Helper -- 3.2 R: Rename and Relocate Columns -- 3.3 R: Slicing, Column Creation, and Deletion -- 3.3.1 Subsetting and Slicing -- 3.3.2 Column Creation -- 3.3.3 Column Deletion -- 3.3.4 Calculated Columns -- 3.3.5 Function mutate() and Data Masking -- 3.4 R: Separate and Unite Columns -- 3.4.1 Separation -- 3.4.2 Union -- 3.5 R: Sorting Data Frames -- 3.5.1 Sorting by Multiple Columns -- 3.5.2 Sorting by an External List -- 3.6 R: Pipe -- 3.6.1 Forward Pipe -- 3.6.2 Pipe in Base R -- 3.6.2.1 Variant -- 3.6.3 Parameter Placeholder -- 3.7 Python: Column Selection -- 3.7.1 Selecting Columns from Dataset Read -- 3.7.2 Selecting Columns from a Data Frame -- 3.7.3 Selection by Positional Index, Range, or with Selection Helper -- 3.7.4 Selection by Exclusion -- 3.8 Python: Rename and Relocate Columns
  • 3.8.1 Standard Method -- 3.8.2 Functions rename() and reindex() -- 3.9 Python: NumPy Slicing, Selection with Index, Column Creation and Deletion -- 3.9.1 NumPy Array Slicing -- 3.9.2 Slicing of Pandas Data Frames -- 3.9.3 Methods .loc and .iloc -- 3.9.4 Selection with Selection Helper -- 3.9.5 Creating and Deleting Columns -- 3.9.6 Functions insert() and assign() -- 3.10 Python: Separate and Unite Columns -- 3.10.1 Separate -- 3.10.2 Unite -- 3.11 Python: Sorting Data Frame -- 3.11.1 Sorting Columns -- 3.11.2 Sorting Index Levels -- 3.11.3 From Indexed to Non‐indexed Data Frame -- 3.11.4 Sorting by an External List -- Questions -- Chapter 4 Subsetting with Logical Conditions -- 4.1 Logical Operators -- 4.2 R: Row Selection -- 4.2.1 Operator %in% -- 4.2.2 Boolean Mask -- 4.2.3 Examples -- 4.2.3.1 Wrong Disjoint Condition -- 4.2.4 Python: Row Selection -- 4.2.5 Boolean Mask, Base Selection Method -- 4.2.6 Row Selection with query() -- Questions -- Chapter 5 Operations on Dates, Strings, and Missing Values -- 5.1 R: Operations on Dates and Strings -- 5.1.1 Date and Time -- 5.1.1.1 Datetime Data Type -- 5.1.2 Parsing Dates -- 5.1.3 Using Dates -- 5.1.4 Selection with Logical Conditions on Dates -- 5.1.5 Strings -- 5.2 R: Handling Missing Values and Data Type Transformations -- 5.2.1 Missing Values as Replacement -- 5.2.1.1 Keywords for Missing Values -- 5.2.2 Introducing Missing Values in Dataset Reads -- 5.2.3 Verifying the Presence of Missing Values -- 5.2.3.1 Functions any(), all(), and colSums() -- 5.2.4 Replacing Missing Values -- 5.2.5 Omit Rows with Missing Values -- 5.2.6 Data Type Transformations -- 5.3 R: Example with Dates, Strings, and Missing Values -- 5.3.1 When an Invisible Hand Mess with Your Data -- 5.3.2 Base Method -- 5.3.3 A Better Heuristic -- 5.3.4 Specialized Functions -- 5.3.4.1 Function parse_date_time()
  • 7.1.3 Sorting Within Groups -- 7.1.4 Creation of Columns in Grouped Data Frames -- 7.1.5 Slicing Rows on Groups -- 7.1.5.1 Functions slice_*() -- 7.1.5.2 Combination of Functions filter() and rank() -- 7.1.6 Calculated Columns with Group Values -- 7.2 Python: Groups -- 7.2.1 Group Index and Aggregation Operations -- 7.2.1.1 Functions groupby() and aggregate() -- 7.2.1.2 Counting Rows, Computing Arithmetic Means, and Sum for Each Group -- 7.2.2 Names on Columns with Aggregated Values -- 7.2.3 Sorting Columns -- 7.2.4 Sorting on Index Levels -- 7.2.5 Slicing Rows on Groups -- 7.2.5.1 Functions nlargest() and nsmallest() -- 7.2.6 Calculated Columns with Group Values -- 7.2.7 Sorting Within Groups -- Questions -- Chapter 8 Conditions and Iterations -- 8.1 R: Conditions and Iterations -- 8.1.1 Conditions -- 8.1.1.1 Function if_else() -- 8.1.1.2 Function case_when() -- 8.1.1.3 Function if() and Constructs If‐else and If‐else If‐else -- 8.1.2 Iterations -- 8.1.2.1 Function for() -- 8.1.2.2 Function Foreach() -- 8.1.3 Nested Iterations -- 8.1.3.1 Replacing a Single‐Element Value -- 8.1.3.2 Iterate on the First Column -- 8.1.3.3 Iterate on all Columns -- 8.2 Python: Conditions and Iterations -- 8.2.1 Conditions -- 8.2.1.1 Function if() -- 8.2.1.2 Constructs If‐else and If‐elif‐else -- 8.2.1.3 Function np.where() -- 8.2.1.4 Function np.select() -- 8.2.1.5 Functions pd.where() and pd.mask() -- 8.2.2 Iterations -- 8.2.2.1 Functions for() and while() -- 8.2.3 Nested Iterations -- 8.2.3.1 Execution Time -- 8.2.4 Iterating on Multi‐index -- 8.2.4.1 Function join() -- 8.2.4.2 Function items() -- Questions -- Chapter 9 Functions and Multicolumn Operations -- 9.1 R: User‐defined Functions -- 9.1.1 Using Functions -- 9.1.2 Data Masking -- 9.1.3 Anonymous Functions -- 9.2 R: Multicolumn Operations -- 9.2.1 Base Method -- 9.2.1.1 Functions apply(), lapply(), and sapply()
  • 5.3.5 Result Comparison -- 5.4 Pyhton: Operations on Dates and Strings -- 5.4.1 Date and Time -- 5.4.1.1 Function pd.to_datetime() -- 5.4.1.2 Function datetime.datetime.strptime() -- 5.4.1.3 Locale Configuration -- 5.4.1.4 Function datetime.datetime.strftime() -- 5.4.1.5 Pandas Timestamp Functions -- 5.4.2 Selection with Logical Conditions on Dates -- 5.4.3 Strings -- 5.5 Python: Handling Missing Values and Data Type Transformations -- 5.5.1 Missing Values as Replacement -- 5.5.1.1 Function pd.replace() -- 5.5.2 Introducing Missing Values in Dataset Reads -- 5.5.3 Verifying the Presence of Missing Values -- 5.5.4 Selection with Missing Values -- 5.5.5 Replacing Missing Values with Actual Values -- 5.5.6 Modifying Values by View or by Copy -- 5.5.7 Data Type Transformations -- 5.6 Python: Examples with Dates, Strings, and Missing Values -- 5.6.1 Example 1: Eurostat -- 5.6.2 Example 2: Open Data Berlin -- Questions -- Chapter 6 Pivoting and Wide‐long Transformations -- 6.1 R: Pivoting -- 6.1.1 From Long to Wide -- 6.1.2 From Wide to Long -- 6.1.3 GOV.UK: Gender Pay Gap -- 6.2 Python: Pivoting -- 6.2.1 From Wide to Long with Columns -- 6.2.2 From Long to Wide with Columns -- 6.2.3 Wide‐long Transformation with Index Levels -- 6.2.4 Indexed Data Frame -- 6.2.4.1 Function unstack() -- 6.2.4.2 Function stack() -- 6.2.5 From Long to Wide with Elements of Numeric Type -- Questions -- Chapter 7 Groups and Operations on Groups -- 7.1 R: Groups -- 7.1.1 Groups and Group Indexes -- 7.1.1.1 Function group_by() -- 7.1.1.2 Index Details -- 7.1.2 Aggregation Operations -- 7.1.2.1 Functions group_by() and summarize() -- 7.1.2.2 Counting Rows: function n() -- 7.1.2.3 Arithmetic Mean: function mean() -- 7.1.2.4 Maximum and Minimum Values: Functions max() and min() -- 7.1.2.5 Summing Values: function sum() -- 7.1.2.6 List of Aggregation Functions