Principles and methods for data science

Principles and Methods for Data Science, Volume 43 in the Handbook of Statistics series, highlights new advances in the field, with this updated volume presenting interesting and timely topics, including Competing risks, aims and methods, Data analysis and mining of microbial community dynamics, Sup...

Full description

Saved in:

Bibliographic Details
Main Authors	Srinivasa Rao, Arni S. R., Rao, Calyampudi Radhakrishna
Format	eBook Book
Language	English
Published	Amsterdam North-Holland 2020 Elsevier Science & Technology
Edition	1
Series	Handbook of Statistics
Subjects	Bayesian statistical decision theory Markov processes Mathematical statistics Monte Carlo method Quantitative research
Online Access	Get full text
ISBN	0444642110 9780444642110

Cover

Table of Contents:

4.2.2. Pseudo-individual approach
5.3. Spike and slab LASSO -- 6. Continuous shrinkage priors -- 6.1. Bayesian LASSO -- 6.2. Horseshoe prior -- 6.3. Global-local shrinkage priors -- 6.4. Regularization of Bayesian priors -- 6.5. Prior elicitation-Hyperparameter selection -- 6.5.1. Empirical Bayes -- 6.5.2. Criterion-based tuning -- 7. Computation -- 7.1. Direct exploration of the model space -- 7.1.1. Shotgun stochastic search -- 7.2. Gibbs sampling -- 7.3. EM algorithm -- 7.4. Approximate algorithms -- 7.4.1. Laplace approximation -- 7.4.2. Variational approximation -- 8. Theoretical properties -- 8.1. Consistency properties of the posterior mode -- 8.2. Posterior concentration -- 8.3. Pairwise model comparison consistency -- 8.4. Strong model selection consistency -- 9. Implementation -- 10. An example -- 11. Discussion -- Acknowledgments -- References -- Chapter 5: Competing risks: Aims and methods -- 1. Introduction -- 2. Research aim: Explanation vs prediction -- 2.1. In-hospital infection and discharge -- 2.1.1. Marginal distribution: Discharge as censoring event -- 2.1.2. Cause-specific distribution: Discharge as competing event -- 2.2. Causes of death after HIV infection -- 2.3. AIDS and pre-AIDS death -- 3. Basic quantities and their estimators -- 3.1. Definitions and notation -- 3.1.1. Competing risks -- 3.1.2. Multistate approach -- 3.2. Data setup -- 3.3. Nonparametric estimation -- 3.3.1. Complete data -- 3.3.2. Cause-specific hazard: Aalen-Johansen estimator -- 3.3.3. Incomplete data: Weights -- 3.3.4. Subdistribution hazard: Weighted PL estimator -- 3.3.5. Weighted ECDF estimator -- 3.3.6. Equivalence -- 3.4. Standard errors and confidence intervals -- 3.5. Regression models -- 3.6. Software -- 4. Time-varying covariables and the subdistribution hazard -- 4.1. Overall survival -- 4.2. Spectrum in causes of death -- 4.2.1. Internal approach
Front Cover -- Principles and Methods for Data Science -- Copyright -- Contents -- Contributors -- Preface -- Chapter 1: Markov chain Monte Carlo methods: Theory and practice -- 1. Introduction -- 2. Introduction to Bayesian statistical analysis -- 2.1. Noninformative prior distributions -- 2.2. Informative prior distributions -- 2.2.1. Conjugate prior distributions -- 2.2.2. Nonconjugate prior distributions -- 2.3. Bayesian estimation -- 3. Markov chain Monte Carlo background -- 3.1. Discrete-state Markov chains -- 3.2. General state space Markov chain theory -- 4. Common MCMC algorithms -- 4.1. The Metropolis-Hastings algorithm -- 4.2. Multivariate Metropolis-Hastings -- 4.3. The Gibbs sampler -- 4.3.1. Sampling from intractable full conditional distributions -- 4.3.2. Rejection sampling -- 4.3.3. Adaptive rejection sampling -- 4.3.4. The tangent method -- 4.4. Slice sampling -- 4.5. Reversible jump MCMC -- 5. Markov chain Monte Carlo in practice -- 5.1. MCMC in regression models -- 5.2. Random effects models -- 5.3. Bayesian generalized linear models -- 5.4. Hierarchical models -- 6. Assessing Markov chain behavior -- 6.1. Using the theory to bound the mixing time -- 6.2. Output-based convergence diagnostics -- 6.2.1. Trace plots -- 6.2.2. Heidelberger and Welch (1983) Diagnostic -- 6.2.3. Geweke (1992) Spectral density diagnostic -- 6.2.4. Gelman and Rubin (1992) diagnostic -- 6.2.5. Yu and Mykland (1994) CUSUM plot diagnostic -- 6.3. Using auxiliary simulations to bound mixing time -- 6.3.1. Cowles and Rosenthal (1998) auxiliary simulation approach -- 6.3.2. An auxiliary simulation approach for random-scan random-walk Metropolis samplers -- 6.3.3. Auxiliary simulation approach for full-updating Metropolis samplers -- 6.4. Examining sampling frequency -- 7. Conclusion -- References -- Further reading
2.9.2. Multiclass classification -- 2.9.2.1. Multiclass SVM -- 2.9.2.2. Multiclass logistic regression -- 2.10. Regularization -- 2.10.1. Regularization in gradient methods -- 2.10.2. Regularization in other methods -- 2.11. Metrics in machine learning -- 2.11.1. Confusion matrix -- 2.11.2. Precision-recall curve -- 2.11.3. ROC curve -- 2.11.4. Metrics for the multiclass classification -- 3. Practical considerations in model building -- 3.1. Noise in the data -- 3.2. Missing values -- 3.3. Class imbalance -- 3.4. Model maintenance -- 4. Unsupervised methods -- 4.1. Clustering -- 4.1.1. K-means -- 4.1.2. Hierarchical clustering -- 4.1.3. Density-based clustering -- 4.2. Comparison of clustering algorithms over data sets -- 4.3. Matrix factorization -- 4.4. Principal component analysis -- 4.5. Understanding the SVD algorithm -- 4.5.1. LU decomposition -- 4.5.2. QR decomposition -- 4.6. Data distributions and visualization -- 4.6.1. Multidimensional scaling -- 4.6.2. tSNE -- 4.6.3. PCA-based visualization -- 4.6.4. Research directions -- 5. Graphical methods -- 5.1. Naive Bayes algorithm -- 5.2. Expectation maximization -- Example of email spam and nonspam problem-Posing as graphical model -- 5.2.1. E and M steps -- 5.2.2. Sampling error minimization -- 5.3. Markovian networks -- 5.3.1. Hidden Markov model -- 5.3.2. Latent Dirichlet analysis -- Topic modeling of audio data -- Topic modeling of image data -- 6. Deep learning -- 6.1. Neural network -- 6.1.1. Gradient magnitude issues -- 6.1.2. Relation to ensemble learning -- 6.2. Encoder -- 6.2.1. Vectorization of text -- 6.2.2. Autoencoder -- 6.2.3. Restricted Boltzmann machine -- 6.3. Convolutional neural network -- 6.3.1. Filter learning -- 6.3.2. Convolution layer -- 6.3.3. Max pooling -- 6.3.4. Fully connected layer -- 6.3.5. Popular CNN architectures -- 6.4. Recurrent neural network
Chapter 2: An information and statistical analysis pipeline for microbial metagenomic sequencing data -- 1. Introduction -- 2. A brief overview of shotgun metagenomic sequencing analysis -- 2.1. Sequence assembly and contig binning -- 2.2. Annotation of taxonomy, protein, metabolic, and biological functions -- 2.3. Statistical analysis and machine learning -- 2.4. Reconstruction of pseudo-dynamics and mathematical modeling -- 2.5. Construction of analysis pipeline with reproducibility and portability -- 3. Computational tools and resources -- 3.1. Tools and software -- 3.1.1. BLAST -- 3.1.2. BWA -- 3.1.3. SAMtools -- 3.1.4. CD-HIT -- 3.1.5. MEGAHIT -- 3.1.6. MaxBin -- 3.1.7. Prodigal -- 3.1.8. VirFinder -- 3.1.9. DIAMOND -- 3.1.10. MEGAN -- 3.1.11. TSCAN -- 3.2. Public resources and databases -- 3.2.1. NCBI reference database (RefSeq) -- 3.2.2. Integrated Microbial Genomes (IMG) and Genome OnLine Database (GOLD) -- 3.2.3. UniProt -- 3.2.4. InterPro -- 3.2.5. KEGG -- 3.2.6. SEED -- 3.2.7. EggNOG -- 3.3. Do-It-Yourself information analysis pipeline for metagenomic sequences -- 4. Notes -- Acknowledgments -- References -- Chapter 3: Machine learning algorithms, applications, and practices in data science -- Abbreviations -- 1. Introduction -- 2. Supervised methods -- 2.1. Data sets -- 2.2. Linear regression -- 2.2.1. Polynomial fitting -- 2.2.2. Thresholding and linear regression -- 2.3. Logistic regression -- 2.4. Support vector machine-Linear kernel -- 2.5. Decision tree -- Outline of decision tree -- 2.6. Ensemble methods -- 2.6.1. Boosting algorithms -- 2.6.2. Gradient boosting algorithm -- 2.7. Bias-variance trade off -- 2.7.1. Bias variance experiments -- 2.8. Cross validation and model selection -- 2.8.1. Model selection process -- 2.8.2. Learning curves -- 2.9. Multiclass and multivariate scenarios -- 2.9.1. Multivariate linear regression
6.4.1. Anatomy of simple RNN -- 6.4.2. Training a simple RNN -- 6.4.3. LSTM -- 6.4.4. Examples of sequence learning problem statements -- 6.4.5. Sequence to sequence mapping -- 6.5. Generative adversarial network -- 6.5.1. Training GAN -- 6.5.2. Applications of GANs -- 7. Optimization -- 8. Artificial intelligence -- 8.1. Notion of state space and search -- 8.2. State space-Search algorithms -- 8.2.1. Enumerative search methods -- 8.2.2. Heuristic search methods-Example A* algorithm -- 8.3. Planning algorithms -- 8.3.1. Example of a state -- 8.4. Formal logic -- 8.4.1. Predicate or propositional logic -- 8.4.2. First-order logic -- 8.4.3. Automated theorem proof -- 8.4.3.1. Forward chaining -- 8.4.3.2. Incompleteness of the forward chaining -- 8.4.3.3. Backward chaining -- 8.5. Resolution by refutation method -- 8.6. AI framework adaptability issues -- 9. Applications and laboratory exercises -- 9.1. Automatic differentiation -- 9.2. Machine learning exercises -- 9.3. Clustering exercises -- 9.4. Graphical model exercises -- 9.4.1. Exercise-Topics in text data -- 9.4.2. Exercise-Topics in image data -- 9.4.3. Exercise-Topics in audio data -- 9.5. Data visualization exercises -- 9.6. Deep learning exercises -- References -- Chapter 4: Bayesian model selection for high-dimensional data -- 1. Introduction -- 2. Classical variable selection methods -- 2.1. Best subset selection -- 2.2. Stepwise selection methods -- 2.3. Criterion functions -- 3. The penalization framework -- 3.1. LASSO and generalizations -- 3.1.1. Strong irrepresentable condition -- 3.1.2. Adaptive LASSO -- 3.1.3. Elastic net -- 3.2. Nonconvex penalization -- 3.3. Variable screening -- 4. The Bayesian framework for model selection -- 5. Spike and slab priors -- 5.1. Point mass spike prior -- 5.1.1. g-priors -- 5.1.2. Nonlocal priors -- 5.2. Continuous spike priors