Handbook of big data technologies

This handbook offers comprehensive coverage of recent advancements in Big Data technologies and related paradigms. Chapters are authored by international leading experts in the field, and have been reviewed and revised for maximum reader value.

Saved in:

Bibliographic Details
Main Authors	Zomaya, Albert Y., Sakr, Sherif, Sahni, Sartaj
Format	eBook Book
Language	English
Published	Cham Springer 2017 Springer International Publishing AG Springer International Publishing
Edition	1
Subjects	Big data Big data > Technological innovations Big Data/Analytics Communications Engineering, Networks Computer Science Computer Systems Organization and Communication Networks Data Storage Representation Programming Techniques
Online Access	Get full text
ISBN	3319493396 9783319493398
DOI	10.1007/978-3-319-49340-4

Cover

Table of Contents:

Large-Scale Data Stream Processing Systems -- 1 Introduction -- 1.1 Stream Processing and Its Precursors -- 1.2 Large-Scale Data Stream Processing on Commodity Clusters -- 1.3 Distinctive Features of Data Stream Processing Systems -- 1.4 Chapter Overview -- 2 Programming Models -- 2.1 Programming with Streams -- 2.2 Lower-Level Dataflow Programming -- 2.3 Functional APIs -- 2.4 Stream Windows -- 3 System Support for Distributed Data Streaming -- 3.1 An Analysis of Large-Scale Stream Processing Systems -- 3.2 Execution Models -- 3.3 Processing Guarantees Upon Failure -- 3.4 Flow Control -- 3.5 Execution Plan Optimisations -- 4 Case Study: Stream Processing with Apache Flink -- 4.1 The Apache Flink Stack -- 4.2 The Apache Flink System Architecture -- 4.3 Lightweight Asynchronous Snapshots -- 5 Applications, Trends and Open Challenges -- 5.1 Graph Stream Processing -- 5.2 Online Learning -- 5.3 Complex Event Processing -- 6 Conclusions and Outlook -- References -- Part II Semantic Big Data Management -- Semantic Data Integration -- 1 An Important Challenge -- 1.1 Linked Data -- 1.2 Ontologies -- 1.3 Ontology and Data Alignment -- 2 Current State-of-the-Art -- 2.1 Interactive and Collaborative Approaches -- 2.2 Visualizing the Data Integration Process -- 2.3 Integrating Geospatial Data -- 2.4 Integrating Biomedical Data -- 3 The Path Forward -- 3.1 Moving Beyond 1-to-1 Equivalence Mappings -- 3.2 Advancing Alignment Evaluation -- 3.3 Contextualizing Alignments -- References -- Linked Data Management -- 1 Introduction -- 2 Background Information -- 3 Native Linked Data Stores -- 3.1 Quadruple Systems -- 3.2 Index Permuted Stores -- 3.3 Graph-Based Systems -- 4 Provenance for Linked Data -- 4.1 Provenance Representations -- 4.2 Provenance in Data Management Systems -- References -- Non-native RDF Storage Engines -- 1 Introduction
3.3 OpenNebula -- 3.4 OpenStack -- 4 Systems for Big Data Analytics in the Cloud -- 4.1 MapReduce -- 4.2 Spark -- 4.3 Mahout -- 4.4 Hunk -- 4.5 Sector/Sphere -- 4.6 BigML -- 4.7 Kognitio Analytical Platform -- 4.8 Data Analysis Workflows -- 4.9 NoSQL Models for Data Analytics -- 4.10 Visual Analytics -- 4.11 Big Data Funding Projects -- 4.12 Historical Review -- 4.13 Summary -- 5 Research Trends -- 6 Conclusions -- References -- Data Organization and Curation in Big Data -- 1 Big Data Indexing Techniques -- 1.1 Overview -- 1.2 Record-Level Non-adaptive Indexing -- 1.3 Record-Level Adaptive Indexing -- 1.4 Split-Level Indexing -- 1.5 Hadoop-RDBMS Hybrid Indexing -- 2 Data Organization and Layout Techniques -- 2.1 Overview -- 2.2 Result Materialization and Caching Techniques -- 2.3 Pre-processing and Colocation Techniques -- 2.4 None Row-Oriented Storage Layouts -- 3 Non-traditional Workloads in Big Data -- 3.1 Overview -- 3.2 Techniques for Recurring Workloads -- 3.3 Techniques for Fast Online Analytics -- 4 Curation and Metadata Management in Big Data -- 4.1 Overview -- 4.2 Execution-Centric Metadata Approach -- 4.3 Provenance-Centric Metadata Approach -- 4.4 Data-Centric Metadata Approach -- 5 Conclusion -- References -- Big Data Query Engines -- 1 Introduction -- 1.1 MPP Query Engines -- 1.2 Hadoop Query Engines -- 1.3 Chapter Organization -- 2 Massively Parallel Query Engines -- 2.1 Teradata -- 2.2 Greenplum -- 2.3 Vertica -- 3 Hadoop Query Engines -- 3.1 MapReduce -- 3.2 Hive -- 3.3 Spark -- 4 SQL on Hadoop -- 4.1 HAWQ -- 4.2 Impala -- 4.3 Presto -- 5 Query Optimization -- 5.1 Research Problems -- 5.2 Orca -- 5.3 Catalyst -- 5.4 V2Opt -- 5.5 Impala Query Optimizer -- 6 Query Execution -- 6.1 Research Problems -- 6.2 Hadoop-Based Execution Engines -- 6.3 Parallel Databases Execution Engines -- 6.4 Code Generation -- 7 Summary -- References
6.4 Performance Comparison with the State-of-the-Art Work -- 6.5 Experimental Conclusion -- 7 Related Work -- 7.1 Semantic Caching -- 7.2 Query Suggestion -- 8 Discussion and Conclusion -- References -- Part III Big Graph Analytics -- Management and Analysis of Big Graph Data: Current Systems and Open Challenges -- 1 Introduction -- 2 Graph Databases -- 2.1 Recent Graph Database Systems -- 2.2 Graph Data Models -- 2.3 Query Language Support -- 3 Graph Processing -- 3.1 General Architecture -- 3.2 Think Like a Vertex -- 3.3 Think Like a Graph -- 4 Graph Dataflow Systems -- 4.1 Apache Flink -- 4.2 Apache Flink Gelly -- 4.3 Comparison to Other Graph Dataflow Frameworks -- 5 Gradoop -- 5.1 Architecture -- 5.2 Extended Property Graph Model -- 6 Comparison -- 7 Current Research and Open Challenges -- 7.1 Graph Data Allocation and Partitioning -- 7.2 Benchmarking and Evaluation of Graph Data Systems -- 7.3 Analysis of Dynamic Graphs -- 7.4 Graph-Based Data Integration and Knowledge Graphs -- 7.5 Interactive Graph Analytics -- 8 Conclusions and Outlook -- References -- Similarity Search in Large-Scale Graph Databases -- 1 Introduction -- 2 Preliminaries -- 3 The Pruning-Verification Framework -- 4 State-of-the-Art Approaches -- 4.1 A Tree-Based Approach: K-Adjacent Tree -- 4.2 A Star-Based Approach: SEGOS -- 4.3 A Path-Based Approach: GSimJoin -- 4.4 A Partition-Based Approach: Pars -- 5 Future Research Directions -- 5.1 New GED Bounds and Search Algorithms -- 5.2 Rich Semantics of Similarity Search -- 5.3 Graph Query Formulation and Understanding -- 6 Summary -- References -- Big-Graphs: Querying, Mining, and Beyond -- 1 Introduction -- 2 Graph Data Models -- 2.1 RDF -- 2.2 Property Graph -- 3 Pattern Matching Techniques Over Big-Graphs -- 3.1 SQL and NoSQL Approaches -- 3.2 Keyword Search -- 3.3 Graph Matching Query -- 3.4 Graph Query by Example
Intro -- Foreword -- Preface -- Contents -- Part I Fundamentals of Big Data Processing -- Big Data Storage and Data Models -- 1 Storage Models -- 1.1 Block-Based Storage -- 1.2 File-Based Storage -- 1.3 Object-Based Storage -- 1.4 Comparison of Storage Models -- 2 Data Models -- 2.1 NoSQL (Not only SQL) -- 2.2 Relational-Based -- 2.3 Summary of Data Models -- References -- Big Data Programming Models -- 1 MapReduce -- 1.1 Features -- 1.2 Examples -- 2 Functional Programming -- 2.1 Features -- 2.2 Example Frameworks -- 3 SQL-Like -- 3.1 Features -- 3.2 Examples -- 4 Actor Model -- 4.1 Features -- 4.2 Examples -- 5 Statistical and Analytical -- 5.1 Features -- 5.2 Examples -- 6 Dataflow-Based -- 6.1 Features -- 6.2 Examples -- 7 Bulk Synchronous Parallel -- 7.1 Features -- 7.2 Examples -- 8 High Level DSL -- 8.1 Pig Latin -- 8.2 Crunch/FlumeJava -- 8.3 Cascading -- 8.4 Dryad LINQ -- 8.5 Trident -- 8.6 Green Marl -- 8.7 Asterix Query Language (AQL) -- 8.8 IBM Jaql -- 9 Discussion and Conclusion -- References -- Programming Platforms for Big Data Analysis -- 1 Introduction -- 2 Requirements of Big Data Programming Support -- 3 Classification of Programming Platforms -- 3.1 Data Source -- 3.2 Processing Technique -- 4 Major Existing Programming Platforms -- 4.1 Data Parallel Programming Platforms -- 4.2 Graph Parallel Programming Platforms -- 4.3 Task Parallel Platforms -- 4.4 Stream Processing Programming Platforms -- 5 A Unifying Framework -- 5.1 Comparison of Existing Programming Platforms -- 5.2 Need for Unifying Framework -- 5.3 MatrixMap Framework -- 6 Conclusion and Future Directions -- References -- Big Data Analysis on Clouds -- 1 Introduction -- 2 Introducing Cloud Computing -- 2.1 Basic Concepts -- 2.2 Cloud Service Distribution and Deployment Models -- 3 Cloud Solutions for Big Data -- 3.1 Microsoft Azure -- 3.2 Amazon Web Services
4 Mining Techniques Over Big-Graphs
2 Storing Linked Data Using Relational Databases -- 2.1 Statement Table -- 2.2 Optimizing Data Storage -- 2.3 Property Tables -- 2.4 Query Execution -- 3 No-SQL Stores -- 4 Massively Parallel Processing for Linked Data -- 4.1 Data Storage and Partitioning -- 4.2 Query Execution -- References -- Exploratory Ad-Hoc Analytics for Big Data -- 1 Exploratory Analytics for Big Data -- 1.1 Requirements -- 1.2 Architecture Overview -- 2 A Top-K Entity Augmentation System -- 2.1 Motivation and Challenges -- 2.2 Requirements -- 2.3 Top-k Consistent Entity Augmentation -- 2.4 Related Work -- 3 DrillBeyond -- Processing Open World SQL -- 3.1 Motivation and Challenges -- 3.2 Requirements -- 3.3 The DrillBeyond System -- 3.4 Processing Multi-result Queries -- 3.5 Related Work -- 4 Summary and Future Work -- 4.1 Future Work -- References -- Pattern Matching Over Linked Data Streams -- 1 Overview -- 2 Linked Data Dissemination System -- 2.1 System Overview -- 2.2 TP-Automata for Single Triple Pattern Query Matching -- 2.3 CTP-Automata for Conjunctive Triple Pattern Query Matching -- 3 Experimental Evaluation -- 3.1 Experimental Setup -- 3.2 Evaluation of TP-Automata -- 3.3 Evaluation of CTP-Automata -- 3.4 Limitations -- 4 Related Work -- 5 Summary -- References -- Searching the Big Data: Practices and Experiences in Efficiently Querying Knowledge Bases -- 1 Introduction -- 2 Background -- 2.1 Knowledge Base Preliminary -- 3 The Framework of Cache-Based Knowledge Base Querying -- 4 Similar Queries Suggestion -- 4.1 Query Distance Calculation -- 4.2 Feature Modeling -- 5 Cache Replacement -- 5.1 Modified Simple Exponential Smoothing -- 5.2 Replacement Algorithms -- 6 Implementation and Experimental Evaluation -- 6.1 Setup -- 6.2 Performance of Cache Replacement Algorithm -- 6.3 Comparison of Feature Modeling Approaches