Machine Learning for Text

Text analytics is a field that lies on the interface of information retrieval,machine learning, and natural language processing, and this textbook carefully covers a coherently organized framework drawn from these intersecting topics. The chapters of this textbook is organized into three categories:...

Full description

Saved in:
Bibliographic Details
Main Author Aggarwal, Charu C
Format eBook Book
LanguageEnglish
Published Cham Springer Nature 2018
Springer
Springer International Publishing AG
Springer International Publishing
Edition1
Subjects
Online AccessGet full text
ISBN9783319735313
3319735314
9783319735306
3319735306
DOI10.1007/978-3-319-73531-3

Cover

Table of Contents:
  • 6.1.5.2 Chapter Organization -- 6.2 Least-Squares Regression and Classification -- 6.2.1 Least-Squares Regression with L2-Regularization -- 6.2.1.1 Efficient Implementation -- 6.2.1.2 Approximate Estimation with Singular ValueDecomposition -- 6.2.1.3 Relationship with Principal Components Regression -- 6.2.1.4 The Path to Kernel Regression -- 6.2.2 LASSO: Least-Squares Regression with L1-Regularization -- 6.2.2.1 Interpreting LASSO as a Feature Selector -- 6.2.3 Fisher's Linear Discriminant and Least-Squares Classification -- 6.2.3.1 Linear Discriminant with Multiple Classes -- 6.2.3.2 Equivalence of Fisher Discriminant and Least-SquaresRegression -- 6.2.3.3 Regularized Least-Squares Classification and LLSF -- 6.2.3.4 The Achilles Heel of Least-Squares Classification -- 6.3 Support Vector Machines -- 6.3.1 The Regularized Optimization Interpretation -- 6.3.2 The Maximum Margin Interpretation -- 6.3.3 Pegasos: Solving SVMs in the Primal -- 6.3.3.1 Sparsity-Friendly Updates -- 6.3.4 Dual SVM Formulation -- 6.3.5 Learning Algorithms for Dual SVMs -- 6.3.6 Adaptive Nearest Neighbor Interpretation of Dual SVMs -- 6.4 Logistic Regression -- 6.4.1 The Regularized Optimization Interpretation -- 6.4.2 Training Algorithms for Logistic Regression -- 6.4.3 Probabilistic Interpretation of Logistic Regression -- 6.4.3.1 Probabilistic Interpretation of Stochastic GradientDescent Steps -- 6.4.3.2 Relationships Among Primal Updates of Linear Models -- 6.4.4 Multinomial Logistic Regression and Other Generalizations -- 6.4.5 Comments on the Performance of Logistic Regression -- 6.5 Nonlinear Generalizations of Linear Models -- 6.5.1 Kernel SVMs with Explicit Transformation -- 6.5.2 Why Do Conventional Kernels Promote Linear Separability? -- 6.5.3 Strengths and Weaknesses of Different Kernels -- 6.5.3.1 Capturing Linguistic Knowledge with Kernels
  • Intro -- Preface -- Acknowledgments -- Contents -- Author Biography -- 1 Machine Learning for Text: An Introduction -- 1.1 Introduction -- 1.1.1 Chapter Organization -- 1.2 What Is Special About Learning from Text? -- 1.3 Analytical Models for Text -- 1.3.1 Text Preprocessing and Similarity Computation -- 1.3.2 Dimensionality Reduction and Matrix Factorization -- 1.3.3 Text Clustering -- 1.3.3.1 Deterministic and Probabilistic Matrix FactorizationMethods -- 1.3.3.2 Probabilistic Mixture Models of Documents -- 1.3.3.3 Similarity-Based Algorithms -- 1.3.3.4 Advanced Methods -- 1.3.4 Text Classification and Regression Modeling -- 1.3.4.1 Decision Trees -- 1.3.4.2 Rule-Based Classifiers -- 1.3.4.3 Naïve Bayes Classifier -- 1.3.4.4 Nearest Neighbor Classifiers -- 1.3.4.5 Linear Classifiers -- 1.3.4.6 Broader Topics in Classification -- 1.3.5 Joint Analysis of Text with Heterogeneous Data -- 1.3.6 Information Retrieval and Web Search -- 1.3.7 Sequential Language Modeling and Embeddings -- 1.3.8 Text Summarization -- 1.3.9 Information Extraction -- 1.3.10 Opinion Mining and Sentiment Analysis -- 1.3.11 Text Segmentation and Event Detection -- 1.4 Summary -- 1.5 Bibliographic Notes -- 1.5.1 Software Resources -- 1.6 Exercises -- 2 Text Preparation and Similarity Computation -- 2.1 Introduction -- 2.1.1 Chapter Organization -- 2.2 Raw Text Extraction and Tokenization -- 2.2.1 Web-Specific Issues in Text Extraction -- 2.3 Extracting Terms from Tokens -- 2.3.1 Stop-Word Removal -- 2.3.2 Hyphens -- 2.3.3 Case Folding -- 2.3.4 Usage-Based Consolidation -- 2.3.5 Stemming -- 2.4 Vector Space Representation and Normalization -- 2.5 Similarity Computation in Text -- 2.5.1 Is idf Normalization and Stemming Always Useful? -- 2.6 Summary -- 2.7 Bibliographic Notes -- 2.7.1 Software Resources -- 2.8 Exercises -- 3 Matrix Factorization and Topic Modeling
  • 3.1 Introduction -- 3.1.1 Chapter Organization -- 3.1.2 Normalizing a Two-Way Factorization into a StandardizedThree-Way Factorization -- 3.2 Singular Value Decomposition -- 3.2.1 Example of SVD -- 3.2.2 The Power Method of Implementing SVD -- 3.2.3 Applications of SVD/LSA -- 3.2.4 Advantages and Disadvantages of SVD/LSA -- 3.3 Nonnegative Matrix Factorization -- 3.3.1 Interpretability of Nonnegative Matrix Factorization -- 3.3.2 Example of Nonnegative Matrix Factorization -- 3.3.3 Folding in New Documents -- 3.3.4 Advantages and Disadvantages of Nonnegative MatrixFactorization -- 3.4 Probabilistic Latent Semantic Analysis -- 3.4.1 Connections with Nonnegative Matrix Factorization -- 3.4.2 Comparison with SVD -- 3.4.3 Example of PLSA -- 3.4.4 Advantages and Disadvantages of PLSA -- 3.5 A Bird's Eye View of Latent Dirichlet Allocation -- 3.5.1 Simplified LDA Model -- 3.5.2 Smoothed LDA Model -- 3.6 Nonlinear Transformations and Feature Engineering -- 3.6.1 Choosing a Similarity Function -- 3.6.1.1 Traditional Kernel Similarity Functions -- 3.6.1.2 Generalizing Bag-of-Words to N-Grams -- 3.6.1.3 String Subsequence Kernels -- 3.6.1.4 Speeding Up the Recursion -- 3.6.1.5 Language-Dependent Kernels -- 3.6.2 Nyström Approximation -- 3.6.3 Partial Availability of the Similarity Matrix -- 3.7 Summary -- 3.8 Bibliographic Notes -- 3.8.1 Software Resources -- 3.9 Exercises -- 4 Text Clustering -- 4.1 Introduction -- 4.1.1 Chapter Organization -- 4.2 Feature Selection and Engineering -- 4.2.1 Feature Selection -- 4.2.1.1 Term Strength -- 4.2.1.2 Supervised Modeling for Unsupervised FeatureSelection -- 4.2.1.3 Unsupervised Wrappers with Supervised FeatureSelection -- 4.2.2 Feature Engineering -- 4.2.2.1 Matrix Factorization Methods -- 4.2.2.2 Nonlinear Dimensionality Reduction -- 4.2.2.3 Word Embeddings -- 4.3 Topic Modeling and Matrix Factorization
  • 5.2.2 Conditional Entropy -- 5.2.3 Pointwise Mutual Information -- 5.2.4 Closely Related Measures -- 5.2.5 The χ2-Statistic -- 5.2.6 Embedded Feature Selection Models -- 5.2.7 Feature Engineering Tricks -- 5.3 The Naïve Bayes Model -- 5.3.1 The Bernoulli Model -- 5.3.1.1 Prediction Phase -- 5.3.1.2 Training Phase -- 5.3.2 Multinomial Model -- 5.3.3 Practical Observations -- 5.3.4 Ranking Outputs with Naïve Bayes -- 5.3.5 Example of Naïve Bayes -- 5.3.5.1 Bernoulli Model -- 5.3.5.2 Multinomial Model -- 5.3.6 Semi-Supervised Naïve Bayes -- 5.4 Nearest Neighbor Classifier -- 5.4.1 Properties of 1-Nearest Neighbor Classifiers -- 5.4.2 Rocchio and Nearest Centroid Classification -- 5.4.3 Weighted Nearest Neighbors -- 5.4.3.1 Bagged and Subsampled 1-Nearest Neighborsas Weighted Nearest Neighbor Classifiers -- 5.4.4 Adaptive Nearest Neighbors: A Powerful Family -- 5.5 Decision Trees and Random Forests -- 5.5.1 Basic Procedure for Decision Tree Construction -- 5.5.2 Splitting a Node -- 5.5.2.1 Prediction -- 5.5.3 Multivariate Splits -- 5.5.4 Problematic Issues with Decision Trees in Text Classification -- 5.5.5 Random Forests -- 5.5.6 Random Forests as Adaptive Nearest Neighbor Methods -- 5.6 Rule-Based Classifiers -- 5.6.1 Sequential Covering Algorithms -- 5.6.1.1 Learn-One-Rule -- 5.6.1.2 Rule Pruning -- 5.6.2 Generating Rules from Decision Trees -- 5.6.3 Associative Classifiers -- 5.6.4 Prediction -- 5.7 Summary -- 5.8 Bibliographic Notes -- 5.8.1 Software Resources -- 5.9 Exercises -- 6 Linear Classification and Regression for Text -- 6.1 Introduction -- 6.1.1 Geometric Interpretation of Linear Models -- 6.1.2 Do We Need the Bias Variable? -- 6.1.3 A General Definition of Linear Models with Regularization -- 6.1.4 Generalizing Binary Predictions to Multiple Classes -- 6.1.5 Characteristics of Linear Models for Text -- 6.1.5.1 Chapter Notations
  • 6.5.4 The Kernel Trick
  • 4.3.1 Mixed Membership Models and Overlapping Clusters -- 4.3.2 Non-overlapping Clusters and Co-clustering: A Matrix Factorization View -- 4.3.2.1 Co-clustering by Bipartite Graph Partitioning -- 4.4 Generative Mixture Models for Clustering -- 4.4.1 The Bernoulli Model -- 4.4.2 The Multinomial Model -- 4.4.3 Comparison with Mixed Membership Topic Models -- 4.4.4 Connections with Naïve Bayes Model for Classification -- 4.5 The k-Means Algorithm -- 4.5.1 Convergence and Initialization -- 4.5.2 Computational Complexity -- 4.5.3 Connection with Probabilistic Models -- 4.6 Hierarchical Clustering Algorithms -- 4.6.1 Efficient Implementation and Computational Complexity -- 4.6.2 The Natural Marriage with k-Means -- 4.7 Clustering Ensembles -- 4.7.1 Choosing the Ensemble Component -- 4.7.2 Combining the Results from Different Components -- 4.8 Clustering Text as Sequences -- 4.8.1 Kernel Methods for Clustering -- 4.8.1.1 Kernel k-Means -- 4.8.1.2 Explicit Feature Engineering -- 4.8.1.3 Kernel Trick or Explicit Feature Engineering? -- 4.8.2 Data-Dependent Kernels: Spectral Clustering -- 4.9 Transforming Clustering into Supervised Learning -- 4.9.1 Practical Issues -- 4.10 Clustering Evaluation -- 4.10.1 The Pitfalls of Internal Validity Measures -- 4.10.2 External Validity Measures -- 4.10.2.1 Relationship of Clustering Evaluation to SupervisedLearning -- 4.10.2.2 Common Mistakes in Evaluation -- 4.11 Summary -- 4.12 Bibliographic Notes -- 4.12.1 Software Resources -- 4.13 Exercises -- 5 Text Classification: Basic Models -- 5.1 Introduction -- 5.1.1 Types of Labels and Regression Modeling -- 5.1.2 Training and Testing -- 5.1.3 Inductive, Transductive, and Deductive Learners -- 5.1.4 The Basic Models -- 5.1.5 Text-Specific Challenges in Classifiers -- 5.1.5.1 Chapter Organization -- 5.2 Feature Selection and Engineering -- 5.2.1 Gini Index