19th of August, 2014

Machine Learning is now quite a matured discipline and captures in general any activity that involves automated learning from data or experience. At the core of ML is the ability of a software or machine to improve the performance of certain tasks through being exposed to data and experiences. Typical machine learning model first learns the knowledge from the data it is exposed to and then applies this knowledge to deliver predictions about the new, same type but previously unseen data [1]- [7].

The quality of the predictions delivered by ML model depend on a number of factors: how well the relevant knowledge is represented by the model, how complete and representative were the learning data, how predictable is the problem in general i.e. how explanative are the input data about the object of predictions and many others. Machine learning became very good at certain recognition, identification or categorisation tasks like fingerprint, faces or voice recognition [1]- [4]. Likewise recent clustering algorithms became very good at automatically grouping people according to their profiles, forming communities, identify market segments and even segment images or distinguish genes [1], [6]. The successful application of ML models to these problems was possible not only thanks to the learning model ingenuity but primarily thanks to the very careful identification, transformation and generation of multidimensional data that made the learning problem discriminative enough to make easy distinctions between the predicted targets based on the data [1], [2], [6].

To learn to distinguish between an elephant and a mouse it is sufficient to just compare their weight but to learn to distinguish a spam from genuine emails requires a lot of complex analysis and extraction of explanative complementary data features that make recognition possible. Generalising on this point and in line with the recent observations it becomes clear that attaining a highly explanative data is crucial for learning and the evidence shows that it is more efficient to generate quality explanative data and then use simple learning models than throwing all kinds of data at complex learning models and hoping for the best [1], [4], [5].

Depending on what is the depth of knowledge that is available for learning, ML models can be categorised into supervised, unsupervised and semi-supervised learning algorithms [1], [3], [5], [7].

Supervised learning algorithms use labelled training examples. The input data and the target outputs are given explicitly for the model to learn the mapping or function between them. Once this is captured the model then uses the learned mapping and the new unseen input data to predict their outputs. Within supervised learning family we can further distinguish between classification models which focus on prediction of discrete (categorical) outputs or regression models which predict continues outputs. Among large number of models reported in the literature linear and nonlinear density based classifiers, decision trees, naïve Bayes, support vector machines, neural networks and nearest neighbour are the most frequently cited and applied in practical applications [1]- [5].

Decision trees despite their simplicity were particularly successful in various industrial applications due to their robustness, transparency and the portability of the model effectively resulting in a set of SQL queries. Despite their operational simplicity induction a decision tree is at best O(n log n) complex and still requires significant data reduction effort to handle very large data [1]- [5].

Support vector machines (SVM) were also reported as very powerful in terms of achieved performance on the numerical data thanks to their explicit effort to maximise the margin of misclassification along the boundaries among the classes of data [1]- [5]. Traditional SVMs are quadratically complex O(n2), yet there are successful attempts to cleverly reduce their complexity to linear complexity [15]. We intend to expand on these developments.

Neural networks (NN) are yet another example of very powerful, flexible and robust predictors for both classification and regression problems [1]- [5]. Its generic structure allows to learn virtually any function successfully with enough hidden layers of processing nodes and training data. Like for SVM, NNs are typically O(n3) or at best quadratically complex with the number of examples. Complex structures further expand computational demands of learning NN. There have been recent attempts to redesign NN into more scalable architectures [16], further work is required to make NNs fully scalable for big data processing.

Figure 1 captures a brief demonstration of a supervised learning applied to the famous Iris species detection problem based on just 2-dimensional data of petal width and petal length:

Figure 1. Visualisation of the supervised learning: detecting iris species from petal width and length. |

Unsupervised learningtypically referred to as clustering is concerned with recognising natural grouping among the data i.e. separating similar from dissimilar data based on multidimensional data. Figure 2 visually depicts some examples of with k-means and mixture of Gaussians clustering methods

Many clustering methods from k-means through density based DBSCAN and hierarchical clustering models were reported for simple problems on small data sizes [6], [10]- [13]. Traditional hierarchical clustering usually involves computation of the pairwise distances among the input data examples which normally involves quadratically complex computation step. Assigning data samples to different clusters based on the distance hierarchy poses yet another computational overhead.

DBSCAN is one of the most cited clustering algorithms. It is based on the data density and uses the concept local reachability among points to decide about cluster memberships. However, even if optimised with efficient indexing structure it is at best O(n log n) complex in the number of examples [6].

K-means clustering, on the other hand, is linearly complex in the number of examples and data dimensionality but assumes fixed number of algorithm iterations and fixed number of clusters. Due to its linear complexity it is however the most scalable clustering model to date which also evolved into a successful artificial neural network equivalent – the Kohonen Nets [6]. Some recent research reported further improved scalability of k-means model when the average aggregation is replaced with the median [10].

We intend to expand on the promising scalable supervised and unsupervised learning models as well as develop the new models that flexibly manage the balance between the computational complexity of learning and the prediction performance through modularisation and recurrence at different data granularity. ** **

There is an immense body of evidence demonstrating applications of both supervised and unsupervised machine learning models reported in the literature and across different industries [1]- [5], [21]- [28]. The refocus of ML research from accuracy and complexity towards scalability and deployment on big data is already evident in a very recent work: [8]- [20]. It is our aim to be on the forefront of the innovation in this fast-paced area by significantly enhancing the applicability of ML on large, distributed data resources and thereby enable a truly new breed of applications that require access and processing of large scale distributed data resources.

Figure 2. Visualisation of the unsupervised learning (clustering): a pair Gaussians fitted after 20 epochs with Gaussian k-means clustering (left), mixture of Gaussian clusters applied to the banana shape artificaial dataset (right). |

Reinforcement learning lies somewhat in between supervised and unsupervised learning in a sense that learning is not done on the fully labelled or evaluated data but a hint in a form of local reward is provided to every actionable data. The objective of the reinforcement learning is to maximise the long term reward through exploration and learning the optimal action policy in response to the environmental data. Reinforcement learning found prime applications in robotics, agent technologies and many complex exploration systems requiring continuous response to the changing environment while maximising strategic longer-term objectives. Figure 3 below illustrates the typical reinforcement learning problem applied to learn the optimal checkers play strategy. The strategy is continuously refined through the experience and observation of what movements lead to immediate and long term rewards (winning the game). Our interest in supervised learning lies in devising novel scalable reinforcement learning methods suitable for big data scales and robust hierarchical representations of the reinforcement learning problems.

Figure 3. Reinforcement learning example: checkers. The agent continuously improves his strategy to move in response to the board state by learning from examples that give him advantage (short-term reward) and ultimately lead to winning the game (long-term reward). |

The emergence of the notion of Big Data (BD) initiated the true revolution in the way we store, manage and process the data. Coming now in Exabytes per day big data and its popularity continue to grow exponentially as it can be seen from the Google Trend chart presented in Figure 4.

Figure 4. Exponential growth of the “big data” popularity on google trends, along geographical distribution and the most popular related terms, source http://www.google.com/trends/. |

Big data are linked to the explosive growth in mobile devices and their ability to generate, collect, share and access massive amounts of textual, numeric, imagery and video data. Combined with a correlated expansion of internet resources, networked services and the cult of sharing on ever growing social networks what we are experiencing is a true data revolution.

Published in 2013, Gartner’s Emerging Technologies Hype Cycle, shown in Figure 5, indicates that Big Data hype is currently at the peak of possibly inflated expectations and there is long way before it gets efficiently consumed across the portfolio of matured applications.

Figure 5. Gartner’s emerging technologies hype cycle for 2013, source http://www.gartner.com/newsroom/id/2575515. |

Irrespective of the level to which big data is just a buzz word, the truth is that due to its exponential growth, stored big data coming in a variety of heterogeneous, imprecise and noisy data significantly outpaced our progress in algorithmic ability to manage, model and learn from it. Big companies, service providers and research centres are in a race to try to tap into the immense big data resources but these efforts were hampered by slow access to the data and inability to process data and extract useful insights from it within reasonable timeframe. In fact even basic access queries against petabytes of data records is likely to stall traditional database management system.

This situation led to a paradox of individual computer or database systems sitting on ever growing massive amounts of data but not being able to extract value from it. Attempts to resolve this problem through parallel database systems failed to scale up data infrastructure efficiently as it still followed the vertical data infrastructure in which improvements were directed only towards costly and ever changing critical components, typically involving servers and memory.

The breakthrough change came from the data technology leader Google, who effectively had no choice but to develop a novel technology to store, access and process immense data from web searches in real time. Google solution was to link their home-grown distributed file system with the distributed Map Reduce processing model that were jointly capable to run processing tasks against massive data on large clusters of cheap “commodity computing” resources. Apache Hadoop very quickly emerged as an open source software framework that implements Map-Reduce model for large-scale storage and processing of big data on virtual clusters of commodity hardware. All these new technologies effectively enabled the old-school parallel processing models through the efficient distribution and management of data that ensures that it is at all times as close to the processing units as possible to eliminate access and distribution delays.

Various implementations of Map Reduce model have been proposed, however the open-source Apache Hadoop framework appears to be the top choice for both academic research and commercial applications and it successfully provides the core capability of big data technology: the storage, fast access and large scale parallel processing of/on massively distributed big data resources.

The core abilities of the data analytics and the machine learning to model, learn from, and predict data were very deeply affected by the emergence of big data. Almost in an instant the challenge came from all sides: accessing large amounts of data, learning the models from it and carry out mass predictions all in a reasonable time.

Since big data processing requires decomposition, parallelism, modularity and/or recurrence, inflexible black-box type ML models failed in an outset. In fact all ML algorithms with computational complexity of O(n2)+ immediately become intractable when faced with billions of data points. Quadratically complex ML models actually account for very large and significant part of machine learning. All pairwise proximity or distance based models including kernel methods unfortunately fall into this category. Stochastic, “simulatory” learning also becomes intractable as the required number of samples to approximate or model the input becomes astronomical. Even worse outlook emerges for the evolutionary (genetic) models working with a whole population of solutions which in these new conditions is an unaffordable luxury.

The obvious question emerges therefore: how can we learn from big data and is there actually any advantage of learning from big data?

There is an evidence that for a certain class of problems, notably where the input data and hence the approximating function come in large number of varieties, learning from very large data set significantly improves the predictive performance. Already in 2001, two researchers from Microsoft (Banko and Brill) [29] illustrated that the complex problem of automated words disambiguation keeps improving after learning from beyond billion of words.

Figure 6. Learning from big data significantly improves words disambiguation, source Banko and Brill
[29]. |

If however the objective of learning or prediction is trivial for example guarded by some normal distribution, submitting huge number of input samples for learning beyond millions would be virtually no better than using thousands. What it comes down to is the complexity of the input space, and using big data appears to be particularly suitable for problems where the input space exhibit large varieties. The core human cognitive abilities to learn by using natural language and recognize complex structures from images are exactly these type of problems with huge variety in the input space and this is where big data are expected to provide huge improvements. Overwhelmed with the ease humans deal with these complex cognitive tasks, human brain continues to provide guidance as to how to efficiently learn from such a large variety of complex, uncertain data coming in multiple forms.

The emergence of the Machine Learning introduced a set of new qualities and rigor that a new ML model will have to meet. These requirements can be divided into a set that simply enables the application of the Ml model to very large data and another not necessarily exclusive group of requirements to actually deliver anticipated predictive performance improvements from using big data. The key enabler requirement here is that ML model allows problem partitioning and follows a conquer-and-divide approach to solve the learning problem. Ideally if the partitioning could be also applied to the learning algorithm itself such that the learning task could be decomposed into a sequence of much simpler tasks where the output from one task would feed as input to the next task etc. Such decomposition promotes modularity and recurrence and enables in fact a learning structure that is vital for so-called deep learning [30], recently reported as a very successful in learning from highly varying learning problems. As far as the computational processing is concerned, a strictly linear complexity in both data size and dimensionality is required to achieve the true future-proof model scalability. Robust linear models or simplified more complex models combined with the powerful so-called distributed data representations should be able to deliver new highly scalable learning paradigms which may see an evolution towards structure learning in addition to the traditional parameters learning.

BigML is a project run at EBTIC that aims at reenabling machine learning on big data and provide a generic capability for highly scalable learning from large distributed data resources to deliver novel high impact predictive services across multiple industries: telecom, health, energy, finance, education. The project scope aims at a complete rediscovery and reenablement of machine learning for big data environment across unsupervised, supervised and semi-supervised learning and along all the steps of ML methodology from data acquisition up to model fusion and predictions post-processing in both static and temporal contexts.

The project will focus on several modelling work-packages:

- PREDICTIVE DATA FILTERING
- RECURRENT HIGHLY SCALABLE LEARNING
- MULTITARGET PREDICTION MODELS

The outcomes from these work-packages will be combined together to deliver new breed of scalable, reusable predictive service capabilities:

- PREDICTIVE TEXT ANALYTICS
- PREDICTIVE SEQUENCE MODELLING
- PREDICTIVE CONTENT MANAGEMENT
- PREDICTIVE INVESTMENT AND RISK MANAGEMENT
- PREDICTIVE NETWORK ANALYTICS

The above capabilities will be used to deliver novel high impact applications within finance, telecom, health, education and energy industries which are of strategic importance for the UAE and its 2030 vision.

[1]. T.M. Mitchell. *Machine Learning*. McGraw-Hill, New York, 1997.

[2]. R.O. Duda, P.E. Hart and D.G. Stork. *Pattern Classification*. John Wiley & Sons, New York, 2001.

[3]. C.M. Bishop. *Pattern Recognition and Machine Learning*. Springer, New York, 2006.

[4]. S. Marsland. *Machine Learning, An Algorithmic Perspective*. Chapman and Hall / CRC Press, Boca Raton, 2009.

[5]. A. Mohri, A. Rostamizadeh and A. Talwalker. *Foundations of Machine Learning*. The MIT Press, Cambridge, 2012.

[6]. C.C. Aggarwal and S.K. Reddy. *Data Clustering, Algorithms and Applications*. Chapman and Hall / CRC, Boca Raton, 2014.

[7]. O. Chapelle, B. Cholkopf, A. Zien. *Semi-Supervised Learning (Adaptive Computation and Machine Learning Series)*. MIT Press, Cambridge, 2006.

[8]. J. Liebowitz. Big Data and Business Analytics. CRC Press, Boca Raton, 2013.

[9]. V. Mayer-Schonberger and K. Cukier. Big Data: A Revolution That Will Transform How We Live, Work and Think. Houghton Mifflin Harcourt Publishing, New York, 2013.

[10].YC Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, S. Loebman. Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster. Scientific and Statistical Database Management 6187: 132-150, 2010.

[11].A. Kumar, Y.Sabharwal, S. Sen. Linear-Time Approximation Schemes for Clustering Problems in Any Dimensions. Journal of ACM 57(2), article no. 5, 2010

[12].M. Suguyama and A. Yamamoto. A Fast and Flexible Clustering Algorithm Using Binary Discretization. Proc. of the 11th Int. Conf. on Data Mining, pp: 1212-1217, 2009

[13].R. Paredes, A. Ulges, T. Breuel. *Fast Discriminative Linear Models for Scalable Video Tagging*. Proc. Of the Int. Conf. on Machine Learning and Applications (ICMLA’09), pp: 571-576, 2009

[14].W.H. Hsu and S.R. Ray. Quantitative Model Selection for Heterogeneous Time Series Learning. AAAI Technical Report WS-86-16, pp 8-12, 1998.

[15].T. Joachims. Training Linear SVMs in Linear Time. Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’06), pp: 217-226, 2006.

[16].A. Zhu, H. He, J. A. Starzyk, C. Tseng. Self-Organizing Learning Array and its Application to Economic and Financial Problems. Information Sciences 177, pp: 1180-1192, Elsevier, 2007

[17].S. Singh, J. Kubica, S. Larsen, D. Sorokina. Parallel Large Scale Feature Selection for Logistic Regression. Proc. of the SIAM Int. Conf. on Data Mining (SDM’09), pp: 1172-1183, 2009

[18].G. Aloisio, S. Fiore, I. Foster, D. Williams. Scientific Big Data Challenges at Large Scale. White paper, Proc. of the Int. Workshop on Big Data and Extreme Computing, 2013

[19].K. Dembczynski, W. Waegeman, W. Cheng, E. Hullermeier. On Label Dependence and Loss Minimization in Multi-Label Classification. Machine Learning 88: 5-45, 2012.

[20].Y. Low, D. Bickson, J. Gonzales, C. Guestrin, A. Kyrola and J.M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. of the VLDB Endowment 5(8): 716-727, 2012.

[21].D. Ruta, P. Kazienko, P. Bródka. Network-Aware Customer Value in Telecommunication Social Networks.IC-AI 2009: 261-267

[22].D. Ruta, B. Majeed. Business Process Forecasting in Telecom Industry. IEEE/GCC Conference for Sustainable and Ubiquitous Technology, Dubai, UAE, 2011.

[23].D. Ruta, B. Gabrys and C. Lemke. A Generic Multilevel Architecture for Time Series Prediction. IEEE Trans. On Knowledge Data Engineering 23(3): 350-359, 2011.

[24].D. Ruta andB. Gabrys. A Framework for Machine Learning Based on Dynamic Physical Fields. Natural Computing 8(2): 219-237, 2009.

[25].D. Ruta, C. Adl, D. Nauck. Data Mining Strategies for Churn Prediction in Telecom Industry. In G.F. Wang (eds) Intelligent Data Analysis Through Pattern Discovery and Recovery. IGI Global, New York, pp 218235, 2008.

[26].D. Ruta. Mixture of Gaussians Model for Robust Pedestrian Images Detection. Proc. of the Eur. Conf. of the Artificial Intelligence (ECAI’08), pp 713-717, IOS Press, 2008.

[27].B. Gabrys, D. Ruta: Genetic Algorithms in Classifier Fusion. Appl. Soft Computing 6(4): 337-347, 2006.

[28].D. Ruta. Towards Automated Share Investment Systems. In Perception-based Data Mining and Decision Making in Economics and Finance, Springer, pp 135-153, 2007

[29].M. Banko and E. Brill. "Scaling to very very large corpora for natural language disambiguation." *Proceedings of the 39th Annual Meeting on Association for Computational Linguistics*. Association for Computational Linguistics, 2001.

[30].Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning 2(1):1-127, 2009.

Back© Copyright 2017 EBTIC. All Rights Reserved.