Big data classification is an important process that helps in efficient analysis of larger datasets. The designing of highly parallelized learning algorithms provides efficient big data classification which can be done by applying parallel computation on Extreme Learning Machine Tree (ELM-Tree) model. In the ELM-tree Model, the decision tree nodes are split based on uncertainty measures such as information entropy and ambiguity. The over partitioning problem caused due to compression of data to a fixed amount of memory and high computation time due to more iterations are needed to be sorted. The problem of over partitioning can be overcome by embedding ELMs as the leaf nodes when the gain ratios of all the available splits are below a given threshold. The input weights in ELM are assigned randomly inorder to approximate the training instances but this approach can further be improved by optimizing the weights that can be achieved only by identifying the optimal cut-points for each attribute. Similarly, the calculation of information gain and gain ratio of all attributes and their cut points increases the computation time. This can be reduced by optimally scheduling the computation tasks to the available host nodes so that the computation of the information gain and the gain ratio takes less time. In this paper, efficient optimization algorithms are first utilized for optimizing the cut-points for each attribute that helps in determining the optimal weights of the attributes. The genetic algorithm, Particle Swarm Optimization (PSO) and firefly optimization algorithms are used in this approach to determine the optical cut-points. The available cut-points of an attribute are randomly assigned to the efficient host nodes using the optimal scheduling algorithm and the cut-point with best gain values is chosen as the optimal cut-pint. The scheduling algorithm is called in between for the efficient scheduling of the cut-point gain computation tasks. The computation of the information gain and gain ratio of the attributes takes more time which can be reduced by the efficient scheduling. The task scheduling algorithm utilizes the optimization algorithms to decide which node performs the task efficiently and allocates the task randomly to the nodes. This strategy does not send all the data to the host nodes instead sends the randomly selected data to the nodes which reduces the overall iteration of tasks thus reducing the computation time. Experimental results also show that the presented technique effectively selects the optimal cut-points and also reduces the computation time.
S. Gayathri Devi and M. Sabrigiriraj. An Efficient Method for Big Data Classification.
DOI: https://doi.org/10.36478/ajit.2016.5051.5059
URL: https://www.makhillpublications.co/view-article/1682-3915/ajit.2016.5051.5059