THE CALSTATDN MODEL, SmartMLchain

The advancement of the Internet of Things and big data analytics systems requires new methods for analyzing large volumes of data from software systems, machines and embedded sensors that are used for application areas like natural ecosystems, bioinformatics, smart homes, smart cities, smart cars, airplanes, and others. These new methods need to address the challenges involved in near-real time collection, processing, indexing, analyzing and sharing of data from and among the sensors, machines and humans.

SmartMLchain has implemented a patented model, CALSTATDN (US Patent# 10176435) for machine learning by iterating over a sequence of computing methods of calculus (CAL), statistics (STAT) and database normalization (DN) respectively, to reduce error and processing times of extremely large volumes of streaming data. The model has been applied to a Smart Home Analytics system to improve performance by several orders of magnitude.

The model executes a sequence of steps. The first stage in the sequence involves calculus (CAL) to compute variations over sensor data values captured in near-real time. The model uses variations of data in space, time and other dimensions by computing derivatives of one or more variables with respect to one or more other variables, respectively. Generalizations over such values of variables or derivatives of variables are computed in the second stage using a statistics (STAT) based model like clustering, regression, support vector machine, etc. These generalized values are then modeled as primary keys and foreign keys in a data normalization (DN) process in the third stage of the sequence. This leads to normalized database tables and an efficient data partitioning scheme for the ease of parallel query executions in subsequent iterations. This model is applicable to any system with sensors and intelligent devices to bring forth significant reduction in time to analyze data for insight with high level of correctness.

The diagram shows how the CALSTATDN model exploits the concepts of rates of changes, differentiation and integration in calculus along with ideas of generalizations over sets of values in statistical computing to derive the best hypothesis g(x). The derived hypothesis explains the behavior of any function f(x) with fewer data points and with fewer generalization over unseen data points, compared to other conventional machine learning methods. The final hypothesis derived is then applied to large volumes of data collected from Internet of Things (IoTs) and other sources. Further, it goes through iterations of analysis based on models in calculus, statistics and data normalization to generate the final output with best accuracy.