Machine learning with spark and python: essential techniques for predictive analytics

By:

Bowles, Michael

Material type: Text

TextPublication details: Indianapolis John Wiley & Sons Inc 2020Edition: 2nd edDescription: xxvii,340p. pbkISBN:

9781119561934

Subject(s):

DDC classification:

006.31 BOW

Online resources:

Click here to access online

Summary: "Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark—a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code. Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code. Table of Contents Introduction xxi Chapter 1 The Two Essential Algorithms for Making Predictions 1 Why are These Two Algorithms So Useful? 2 What are Penalized Regression Methods? 7 What are Ensemble Methods? 9 How to Decide Which Algorithm to Use 11 The Process Steps for Building a Predictive Model 13 Framing a Machine Learning Problem 15 Feature Extraction and Feature Engineering 17 Determining Performance of a Trained Model 18 Chapter Contents and Dependencies 18 Summary 20 Chapter 2 Understand the Problem by Understanding the Data 23 The Anatomy of a New Problem 24 Different Types of Attributes and Labels Drive Modeling Choices 26 Things to Notice about Your New Data Set 27 Classification Problems: Detecting Unexploded Mines Using Sonar 28 Physical Characteristics of the Rocks Versus Mines Data Set 29 Statistical Summaries of the Rocks Versus Mines Data Set 32 Visualization of Outliers Using a Quantile-Quantile Plot 34 Statistical Characterization of Categorical Attributes 35 How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 36 Visualizing Properties of the Rocks Versus Mines Data Set 39 Visualizing with Parallel Coordinates Plots 39 Visualizing Interrelationships between Attributes and Labels 41 Visualizing Attribute and Label Correlations Using a Heat Map 48 Summarizing the Process for Understanding the Rocks Versus Mines Data Set 50 Real-Valued Predictions with Factor Variables: How Old is Your Abalone? 50 Parallel Coordinates for Regression Problems—Visualize Variable Relationships for the Abalone Problem 55 How to Use a Correlation Heat Map for Regression—Visualize Pair-Wise Correlations for the Abalone Problem 59 Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes 61 Multiclass Classification Problem: What Type of Glass is That? 67 Using PySpark to Understand Large Data Sets 72 Summary 75 Chapter 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data 77 The Basic Problem: Understanding Function Approximation 78 Working with Training Data 79 Assessing Performance of Predictive Models 81 Factors Driving Algorithm Choices and Performance—Complexity and Data 82 Contrast between a Simple Problem and a Complex Problem 82 Contrast between a Simple Model and a Complex Model 85 Factors Driving Predictive Algorithm Performance 89 Choosing an Algorithm: Linear or Nonlinear? 90 Measuring the Performance of Predictive Models 91 Performance Measures for Different Types of Problems 91 Simulating Performance of Deployed Models 105 Achieving Harmony between Model and Data 107 Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 107 Using Forward Stepwise Regression to Control Overfitting 109 Evaluating and Understanding Your Predictive Model 114 Control Overfitting by Penalizing Regression Coefficients—Ridge Regression 116 Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets 124 Summary 127 Chapter 4 Penalized Linear Regression 129 Why Penalized Linear Regression Methods are So Useful 130 Extremely Fast Coefficient Estimation 130 Variable Importance Information 131 Extremely Fast Evaluation When Deployed 131 Reliable Performance 131 Sparse Solutions 132 Problem May Require Linear Model 132 When to Use Ensemble Methods 132 Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 132 Training Linear Models: Minimizing Errors and More 135 Adding a Coefficient Penalty to the OLS Formulation 136 Other Useful Coefficient Penalties—Manhattan and ElasticNet 137 Why Lasso Penalty Leads to Sparse Coefficient Vectors 138 ElasticNet Penalty Includes Both Lasso and Ridge 140 Solving the Penalized Linear Regression Problem 141 Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 141 How LARS Generates Hundreds of Models of Varying Complexity 145 Choosing the Best Model from the Hundreds LARS Generates 147 Using Glmnet: Very Fast and Very General 152 Comparison of the Mechanics of Glmnet and LARS Algorithms 153 Initializing and Iterating the Glmnet Algorithm 153 Extension of Linear Regression to Classification Problems 157 Solving Classification Problems with Penalized Regression 157 Working with Classification Problems Having More Than Two Outcomes 161 Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 161 Incorporating Non-Numeric Attributes into Linear Methods 163 Summary 166 Chapter 5 Building Predictive Models Using Penalized Linear Methods 169 Python Packages for Penalized Linear Regression 170 Multivariable Regression: Predicting Wine Taste 171 Building and Testing a Model to Predict Wine Taste 172 Training on the Whole Data Set before Deployment 175 Basis Expansion: Improving Performance by Creating New Variables from Old Ones 179 Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 182 Build a Rocks Versus Mines Classifier for Deployment 191 Multiclass Classification: Classifying Crime Scene Glass Samples 200 Linear Regression and Classification Using PySpark 203 Using PySpark to Predict Wine Taste 204 Logistic Regression with PySpark: Rocks Versus Mines 208 Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings 213 Multiclass Logistic Regression with Meta Parameter Optimization 217 Summary 219 Chapter 6 Ensemble Methods 221 Binary Decision Trees 222 How a Binary Decision Tree Generates Predictions 224 How to Train a Binary Decision Tree 225 Tree Training Equals Split Point Selection 227 How Split Point Selection Affects Predictions 228 Algorithm for Selecting Split Points 229 Multivariable Tree Training—Which Attribute to Split? 229 Recursive Splitting for More Tree Depth 230 Overfitting Binary Trees 231 Measuring Overfit with Binary Trees 231 Balancing Binary Tree Complexity for Best Performance 232 Modifi cations for Classification and Categorical Features 235 Bootstrap Aggregation: “Bagging” 235 How Does the Bagging Algorithm Work? 236 Bagging Performance—Bias Versus Variance 239 How Bagging Behaves on Multivariable Problem 241 Bagging Needs Tree Depth for Performance 245 Summary of Bagging 246 Gradient Boosting 246 Basic Principle of Gradient Boosting Algorithm 246 Parameter Settings for Gradient Boosting 249 How Gradient Boosting Iterates toward a Predictive Model 249 Getting the Best Performance from Gradient Boosting 250 Gradient Boosting on a Multivariable Problem 253 Summary for Gradient Boosting 256 Random Forests 256 Random Forests: Bagging Plus Random Attribute Subsets 259 Random Forests Performance Drivers 260 Random Forests Summary 261 Summary 262 Chapter 7 Building Ensemble Models with Python 265 Solving Regression Problems with Python Ensemble Packages 265 Using Gradient Boosting to Predict Wine Taste 266 Using the Class Constructor for GradientBoostingRegressor 266 Using GradientBoostingRegressor to Implement a Regression Model 268 Assessing the Performance of a Gradient Boosting Model 271 Building a Random Forest Model to Predict Wine Taste 272 Constructing a RandomForestRegressor Object 273 Modeling Wine Taste with RandomForestRegressor 275 Visualizing the Performance of a Random Forest Regression Model 279 Incorporating Non-Numeric Attributes in Python Ensemble Models 279 Coding the Sex of Abalone for Gradient Boosting Regression in Python 280 Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282 Coding the Sex of Abalone for Input to Random Forest Regression in Python 284 Assessing Performance and the Importance of Coded Variables 287 Solving Binary Classification Problems with Python Ensemble Methods 288 Detecting Unexploded Mines with Python Gradient Boosting 288 Determining the Performance of a Gradient Boosting Classifier 291 Detecting Unexploded Mines with Python Random Forest 292 Constructing a Random Forest Model to Detect Unexploded Mines 294 Determining the Performance of a Random Forest Classifier 298 Solving Multiclass Classification Problems with Python Ensemble Methods 300 Dealing with Class Imbalances 301 Classifying Glass Using Gradient Boosting 301 Determining the Performance of the Gradient Boosting Model on Glass Classification 306 Classifying Glass with Random Forests 307 Determining the Performance of the Random Forest Model on Glass Classification 310 Solving Regression Problems with PySpark Ensemble Packages 311 Predicting Wine Taste with PySpark Ensemble Methods 312 Predicting Abalone Age with PySpark Ensemble Methods 317 Distinguishing Mines from Rocks with PySpark Ensemble Methods 321 Identifying Glass Types with PySpark Ensemble Methods 325 Summary 327 Index 329"

List(s) this item appears in: New Arrivals

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Collection	Call number	Status	Date due	Barcode
Book	Plaksha University Library	Computer science	006.31 BOW (Browse shelf(Opens below))	Available		005325

Browsing Plaksha University Library shelves, Collection: Computer science Close shelf browser (Hides shelf browser)

Previous								Next
Previous	006.31 BAR Bayesian reasoning and machine learning	006.31 BON Building machine learning projects with TensorFlow : engaging projects that will teach you how complex data can be exploited to gain the most insight	006.31 BON Machine learning algorithms : popular algorithms for data science and machine learning	006.31 BOW Machine learning with spark and python: essential techniques for predictive analytics	006.31 BUR The hundred-page machine learning book	006.31 BUR The hundred-page machine learning book	006.31 DEI Mathematics for machine learning	Next

"Machine Learning with Spark and Python Essential Techniques for Predictive Analytics, Second Edition simplifies ML for practical uses by focusing on two key algorithms. This new second edition improves with the addition of Spark—a ML framework from the Apache foundation. By implementing Spark, machine learning students can easily process much large data sets and call the spark algorithms using ordinary Python code.

Machine Learning with Spark and Python focuses on two algorithm families (linear methods and ensemble methods) that effectively predict outcomes. This type of problem covers many use cases such as what ad to place on a web page, predicting prices in securities markets, or detecting credit card fraud. The focus on two families gives enough room for full descriptions of the mechanisms at work in the algorithms. Then the code examples serve to illustrate the workings of the machinery with specific hackable code.

Table of Contents

Introduction xxi

Chapter 1

The Two Essential Algorithms for Making Predictions 1
Why are These Two Algorithms So Useful? 2
What are Penalized Regression Methods? 7
What are Ensemble Methods? 9
How to Decide Which Algorithm to Use 11
The Process Steps for Building a Predictive Model 13
Framing a Machine Learning Problem 15
Feature Extraction and Feature Engineering 17
Determining Performance of a Trained Model 18
Chapter Contents and Dependencies 18
Summary 20

Chapter 2

Understand the Problem by Understanding the Data 23
The Anatomy of a New Problem 24
Different Types of Attributes and Labels Drive Modeling Choices 26
Things to Notice about Your New Data Set 27
Classification Problems: Detecting Unexploded Mines Using Sonar 28
Physical Characteristics of the Rocks Versus Mines Data Set 29
Statistical Summaries of the Rocks Versus Mines Data Set 32
Visualization of Outliers Using a Quantile-Quantile Plot 34
Statistical Characterization of Categorical Attributes 35
How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set 36
Visualizing Properties of the Rocks Versus Mines Data Set 39
Visualizing with Parallel Coordinates Plots 39
Visualizing Interrelationships between Attributes and Labels 41
Visualizing Attribute and Label Correlations Using a Heat Map 48
Summarizing the Process for Understanding the Rocks Versus Mines Data Set 50
Real-Valued Predictions with Factor Variables: How Old is Your Abalone? 50
Parallel Coordinates for Regression Problems—Visualize Variable Relationships for the Abalone Problem 55
How to Use a Correlation Heat Map for Regression—Visualize Pair-Wise Correlations for the Abalone Problem 59
Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes 61
Multiclass Classification Problem: What Type of Glass is That? 67
Using PySpark to Understand Large Data Sets 72
Summary 75

Chapter 3

Predictive Model Building: Balancing Performance, Complexity, and Big Data 77
The Basic Problem: Understanding Function Approximation 78
Working with Training Data 79
Assessing Performance of Predictive Models 81
Factors Driving Algorithm Choices and Performance—Complexity and Data 82
Contrast between a Simple Problem and a Complex Problem 82
Contrast between a Simple Model and a Complex Model 85
Factors Driving Predictive Algorithm Performance 89
Choosing an Algorithm: Linear or Nonlinear? 90
Measuring the Performance of Predictive Models 91
Performance Measures for Different Types of Problems 91
Simulating Performance of Deployed Models 105
Achieving Harmony between Model and Data 107
Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size 107
Using Forward Stepwise Regression to Control Overfitting 109
Evaluating and Understanding Your Predictive Model 114
Control Overfitting by Penalizing Regression Coefficients—Ridge Regression 116
Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets 124
Summary 127

Chapter 4

Penalized Linear Regression 129
Why Penalized Linear Regression Methods are So Useful 130
Extremely Fast Coefficient Estimation 130
Variable Importance Information 131
Extremely Fast Evaluation When Deployed 131
Reliable Performance 131
Sparse Solutions 132
Problem May Require Linear Model 132
When to Use Ensemble Methods 132
Penalized Linear Regression: Regulating Linear Regression for Optimum Performance 132
Training Linear Models: Minimizing Errors and More 135
Adding a Coefficient Penalty to the OLS Formulation 136
Other Useful Coefficient Penalties—Manhattan and ElasticNet 137
Why Lasso Penalty Leads to Sparse Coefficient Vectors 138
ElasticNet Penalty Includes Both Lasso and Ridge 140
Solving the Penalized Linear Regression Problem 141
Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression 141
How LARS Generates Hundreds of Models of Varying Complexity 145
Choosing the Best Model from the Hundreds LARS Generates 147
Using Glmnet: Very Fast and Very General 152
Comparison of the Mechanics of Glmnet and LARS Algorithms 153
Initializing and Iterating the Glmnet Algorithm 153
Extension of Linear Regression to Classification Problems 157
Solving Classification Problems with Penalized Regression 157
Working with Classification Problems Having More Than Two Outcomes 161
Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems 161
Incorporating Non-Numeric Attributes into Linear Methods 163
Summary 166

Chapter 5

Building Predictive Models Using Penalized Linear Methods 169
Python Packages for Penalized Linear Regression 170
Multivariable Regression: Predicting Wine Taste 171
Building and Testing a Model to Predict Wine Taste 172
Training on the Whole Data Set before Deployment 175
Basis Expansion: Improving Performance by Creating New Variables from Old Ones 179
Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines 182
Build a Rocks Versus Mines Classifier for Deployment 191
Multiclass Classification: Classifying Crime Scene Glass Samples 200
Linear Regression and Classification Using PySpark 203
Using PySpark to Predict Wine Taste 204
Logistic Regression with PySpark: Rocks Versus Mines 208
Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings 213
Multiclass Logistic Regression with Meta Parameter Optimization 217
Summary 219

Chapter 6

Ensemble Methods 221
Binary Decision Trees 222
How a Binary Decision Tree Generates Predictions 224
How to Train a Binary Decision Tree 225
Tree Training Equals Split Point Selection 227
How Split Point Selection Affects Predictions 228
Algorithm for Selecting Split Points 229
Multivariable Tree Training—Which Attribute to Split? 229
Recursive Splitting for More Tree Depth 230
Overfitting Binary Trees 231
Measuring Overfit with Binary Trees 231
Balancing Binary Tree Complexity for Best Performance 232
Modifi cations for Classification and Categorical Features 235
Bootstrap Aggregation: “Bagging” 235
How Does the Bagging Algorithm Work? 236
Bagging Performance—Bias Versus Variance 239
How Bagging Behaves on Multivariable Problem 241
Bagging Needs Tree Depth for Performance 245
Summary of Bagging 246
Gradient Boosting 246
Basic Principle of Gradient Boosting Algorithm 246
Parameter Settings for Gradient Boosting 249
How Gradient Boosting Iterates toward a Predictive Model 249
Getting the Best Performance from Gradient Boosting 250
Gradient Boosting on a Multivariable Problem 253
Summary for Gradient Boosting 256
Random Forests 256
Random Forests: Bagging Plus Random Attribute Subsets 259
Random Forests Performance Drivers 260
Random Forests Summary 261
Summary 262

Chapter 7

Building Ensemble Models with Python 265
Solving Regression Problems with Python Ensemble Packages 265
Using Gradient Boosting to Predict Wine Taste 266
Using the Class Constructor for GradientBoostingRegressor 266
Using GradientBoostingRegressor to Implement a Regression Model 268
Assessing the Performance of a Gradient Boosting Model 271
Building a Random Forest Model to Predict Wine Taste 272
Constructing a RandomForestRegressor Object 273
Modeling Wine Taste with RandomForestRegressor 275
Visualizing the Performance of a Random Forest Regression Model 279
Incorporating Non-Numeric Attributes in Python Ensemble Models 279
Coding the Sex of Abalone for Gradient Boosting Regression in Python 280
Assessing Performance and the Importance of Coded Variables with Gradient Boosting 282
Coding the Sex of Abalone for Input to Random Forest Regression in Python 284
Assessing Performance and the Importance of Coded Variables 287
Solving Binary Classification Problems with Python Ensemble Methods 288
Detecting Unexploded Mines with Python Gradient Boosting 288
Determining the Performance of a Gradient Boosting Classifier 291
Detecting Unexploded Mines with Python Random Forest 292
Constructing a Random Forest Model to Detect Unexploded Mines 294
Determining the Performance of a Random Forest Classifier 298
Solving Multiclass Classification Problems with Python Ensemble Methods 300
Dealing with Class Imbalances 301
Classifying Glass Using Gradient Boosting 301
Determining the Performance of the Gradient Boosting Model on Glass Classification 306
Classifying Glass with Random Forests 307
Determining the Performance of the Random Forest Model on Glass Classification 310
Solving Regression Problems with PySpark Ensemble Packages 311
Predicting Wine Taste with PySpark Ensemble Methods 312
Predicting Abalone Age with PySpark Ensemble Methods 317
Distinguishing Mines from Rocks with PySpark
Ensemble Methods 321
Identifying Glass Types with PySpark Ensemble Methods 325
Summary 327

Index 329"

There are no comments on this title.

to post a comment.