Divide and recombine (D&R) data science projects for deep analysis of big data and high computational complexity

The focus of data science is data analysis. This article begins with a categorization of the data science technical areas that play a direct role in data analysis. Next, big data are addressed, which create computational challenges due to the data size, as does the computational complexity of many a...

Full description

Saved in:

Bibliographic Details
Published in	Japanese journal of statistics and data science Vol. 1; no. 1; pp. 139 - 156
Main Authors	Tung, Wen-wen, Barthur, Ashrith, Bowers, Matthew C., Song, Yuying, Gerth, John, Cleveland, William S.
Format	Journal Article
Language	English
Published	Singapore Springer Singapore 01.06.2018
Subjects	Chemistry and Earth Sciences Computer Science Economics Finance Health Sciences Humanities Insurance Law Management Mathematics and Statistics Medicine Physics Statistical Theory and Methods Statistics Statistics and Computing/Statistics Programs Statistics for Business Statistics for Engineering Statistics for Life Sciences Statistics for Social Sciences Hadoop Data science Blacklisting IP addresses Big data Parallel Weather and climate data analysis Distributed computing
Online Access	Get full text
ISSN	2520-8756 2520-8764
DOI	10.1007/s42081-018-0008-4

Cover

More Information
Summary:	The focus of data science is data analysis. This article begins with a categorization of the data science technical areas that play a direct role in data analysis. Next, big data are addressed, which create computational challenges due to the data size, as does the computational complexity of many analytic methods. Divide and recombine (D&R) is a statistical approach whose goal is to meet the challenges. In D&R, the data are divided into subsets, an analytic method is applied independently to each subset, and the outputs are recombined. This enables a large component of embarrassingly-parallel computation, the fastest parallel computation. DeltaRho open-source software implements D&R. At the front end, the analyst programs in R . The back end is the Hadoop distributed file system and parallel compute engine. The goals of D&R are the following: access to thousands of methods of machine learning, statistics, and data visualization; deep analysis of the data, which means analysis of the detailed data at their finest granularity; easy programming of analyses; and high computational performance. To succeed, D&R requires research in all of the technical areas of data science. Network cybersecurity and climate science are two subject-matter areas with big, complex data benefiting from D&R. We illustrate this by discussing two datasets, one from each area. The first is the measurements of 13 variables for each of 10,615,054,608 queries to the Spamhaus IP address blacklisting service. The second has 50,632 3-hourly satellite rainfall estimates at 576,000 locations.
ISSN:	2520-8756 2520-8764
DOI:	10.1007/s42081-018-0008-4