The Application of Gray Model and Support Vector Machine in the Forecast of Online Public Opinion

The forecast of online public opinion is a kind of complex forecasting problem with information, small sample and uncertainty. In order to improve the accuracy for the forecast of online public opinion, a new forecasting method based on a gray model and a support vector machine is proposed. The method comprises the steps of clustering the text, extracting the hotspots, aggregating the data and implementing other pretreatments of the network data, then creating a model GM (1, 1) for the time series of online public opinion, correcting the forecasting results of the model GM (1, 1) with a support vector machine, and then testing through a simulation experiment. The experimental results show that compared with traditional forecasting methods, the application of gray model and support vector machine improves the accuracy for the forecast of online public opinion. Moreover, a new method for the forecast of online public opinion is presented to some extent.

INTRODUCTION Online public opinion, which is also known as network public opinion, refers to the opinions or remarks with a certain influence and tendentiousness of netizen to the social public affairs, especially the hot social focuses through the Internet [1][2]. With the rapid development of Interne in China, network has become one of the main carriers for the reflection of social public opinion [3]. At present, the economic and social development of China is in a crucial stage, in which the various deeply rooted contradictions and problems arise day by day, so the hotspots of online public opinion are emerged one after another, which involve broad regions as well as extensive contents. In such a situation, the negative online public opinion will have great negative impact on the national security and social stability, if the online public opinion cannot be guided and supervised correctly [4]. Therefore, it has become a hotspot of research at present to forecast the trend of the development of online public opinion accurately.
In recent years, there are more and more studies focusing on the forecast of online public opinion, which can basically be divided into two categories: traditional forecasting method and modern forecasting method. According to the traditional forecasting method, the data of online public opinion is converted into time series, and the model is created by using the forecasting methods of autoregressive moving average, exponential smoothing and other time series. This method is simple and easy to be carried out. However, it assumes online public opinion is changed linearly, which is inconsistent with the actual changing characteristics, and therefore the results of forecast are not ideal. As for the modern forecasting method, the model is created on the basis of nonlinear theory. Compared with traditional forecasting method, the accuracy for the forecast of online public opinion is improved correspondingly, and the main forecasting models include Hidden Markov Model [5], G (Galam) [6], intuitionistic fuzzy reasoning [7], support vector machine [8][9], etc. Online public opinion is a kind of uncertain forecasting problem with information and small samples. In order to improve the accuracy of forecast further, some scholars have proposed some assembled forecasting models for the online public opinion based on the combination optimization theory and the advantages of each single model. For example, Zhang Jue put forward the online public opinion forecasting model based on ARIMA and BP neural network and achieved good forecasting results [10].
Gray forecasting theory [11] is proposed for the first time by domestic scholar, Deng Julong, in 1982, which studies "small sample" and "poor data information" uncertain system of "partial known information and partial unknown information. GM (1, 1) model, the important component of gray forecasting theory, is featured with less data required by model establishment. Support vector machine (SVM), referring to a modern machine learning algorithm specially for the small sample and uncertainty forecasting problems based on the statistical learning theory (SLT), is widely applied in the study of the field of nonlinear time series forecasting.
In the study, the grey model is attempted to be combined with the support vector machine model and applied in the forecast of online public opinion. Firstly, GM (1, 1) is used to establish the forecasting model of online public opinion. Secondly, the forecasting result of GM (1, 1) is modified by the support vector machine. At last the performance of the model is verified by simulation experiment. to collect the various information sources of online public opinion thereon [12][13]. However, the messy and disordered data of online public opinion will be acquired, which should be converted into the related data by text clustering treatment.

A. Text Clustering
The hierarchical clustering algorithm [14][15][16] is used in the study to cluster the data of network pubic opinion, and the advantage and disadvantage of clustering is evaluated on the basis of the purity index. After the text clustering, the purity index for the clustering r is defined as follows: In the formula, r n is the number of documents in the r (th) clustering category, and i r n is the number of texts belonging to the predefined category i , but distributed into the r (th) clustering category by mistake.
So, all the purity indexes of the text clustering result are defined as follows:

B. Hotspot Acquisition
The hot topic of network means the information set regarding the network as the communications media, paid attention by a certain of crowd widely and continuously and capable of reflecting the situation of online pubic opinion [17][18]. The process of hotspot acquisition is as follows: (1) The reporting frequency, continuous reporting time and network click rate of the topic are adopted as the characteristics of the hotspot topic, which are performed statistics.
(2) The values of media attention and public attention are calculated.
(3) The specific gravity balance factor and threshold value are set and the pubic attention is calculated (4) If the public attention is more than the threshold value, it shows that the topic is the hotspot topic.

C. Data Aggregation
The collected online public opinion information with different vectors is organized by data aggregation and converted into the discrete-time series of hotspot topic by the data aggregation software.

III. FORECASTING MODEL OF ONLINE PUBLIC OPINION BASED ON GM (1, 1) AND SUPPORT VECTOR MACHINE
The gray model is capable of revealing the development trend of the data, but is not suitable for the forecast of time-invariance and nonlinear data, while the support vector machine is applied to describe the nonlinear and small sample data series. Thus, the forecasting model of online public opinion based on GM (1, 1) and support vector machine can be established by combining the advantages of the both.

A. GM (1, 1) Model
In recent years, the gray model GM (1, N) is widely applied and studied, wherein GM (1, 1), referring to the most common and simplest gray model and the model composed of differential equation only including single variable, is a special case of GM (1, M). Assume that the original data series is the model establishment series   0 X of GM (1,1), that is to say: The original data series is accumulated by the accumulation and generation method, and 1-AGO series generated by accumulation for one time is as follows: In the formula,         So, the gray differential equation model of GM (1, 1) is as follows: is put into the above formula and the formula is obtained as follows: The above equation can be converted into the matrix equation as follows: In the formula, B is the data matrix, N Y is the data vector, P is the parameter vector, that is to say: The solution is carried out by least square method to obtain the formula as follows: The obtained coefficient is put into the formula (6), and then the differential equation is solved to obtain the expression of gray GM (1, 1) intrinsic model as follows: In the formula, is the residual; that is to say: The residual is inversely proportional to the accuracy of model. For the general requirements, % 20 ) (  k  , and the best condition is % 10

B. Model of Support Vector Machine
The complexity of corresponding quadratic programming problem solved by the support vector machine is inversely proportional to the calculating speed. The least squares support vector machine (LSSVM), modifies the model of support vector machine and reduces the complexity of solution, so it has the advantages of less calculation resources as required as well as fast solution speed and convergence speed. Therefore, LSSVM is adopted as the forecasting mode in the study. For the time series of online public opinion, the regression function of LSSVM is as follows: In the formula, w is the weight vector and b is the bias constant.
According to the inductive principle of structure risk minimization, the model of least squares support vector machine for solving the regression problem is as follows: The constraint condition is as follows: In the formula,  is the regularization parameter and i  is the slack variable.
Lagrange multiplier is introduced to obtain the formula as follows: In the formula, ) , , is Lagrange multiplier.
The following formula is obtained according to KKT (Karush-Kuhn-Tucker) condition in the optimization theory: So, the last solution can be obtained as follows: In the formula, . According to the Mercer condition, the kernel function is defined as follows: is introduced to convert the formula (17) to obtain the forecasting model of LSSVM as follows: The radial basis function (RBF) is featured with good universality and better expression for processing the time series problem than that of other kernel functions, so in this paper, the radial basis function is used as the kernel function of LSSVM, and the expression is as follows: In the formula, 2  is the kernel width of RBF.
IV. EXPERIMENTAL RESULT AND ANALYSIS In order to verify the function of the gray model and support vector machine in the forecast of online public opinion, in the environment of Intel Core i5 3.2G CPU, 4GB RAM and hardware having Microsoft Windows Sever 2003 as the operating system, the implementation algorithm is realized by programming via MATLAB. A certain hot topic on the internet is forecasted and 30 data of the amount of the relevant posts is obtained, which is shown in Figure 1 for detail. In order to quicken the training speed of the model to reflect the variation trend of the online public opinion better, the time series of online public opinion is pretreated, and the normalized data is shown below: In the formula, i x is the data after normalization, max x and min x represent the maximum and minimum of the time series of online public opinion, respectively.
The data is divided into two parts. The former 22 data is used as the training sample set, and the later 8 data as the test sample set. And then the test sample set is forecasted by several models respectively. The obtained forecasting result is shown in Figure 2.
It is shown in the analysis for the value result of the test sample of online public opinion by the forecasting model in Figure 2   results of the single gray model and support vector machine model and the actual value is large, and the errors are quite high, which indicates that a single model only can explain the fragment and part of variation rules of the complex online public opinion, but difficult to describe the laws of time invariance and nonlinear variation of online public opinion completely and accurately. However, the advantages of the forecasting models of the GM (1, 1) and support vector machine are combined with the advantages of the gray model and support vector machine to overcome the defect of the single model, which not only can explain the characteristics of set information and small sample of the online public opinion, and accurately forecast the rules of time invariance and uncertainty variation, but also capture the variation trend of the online public opinion, so as to improve the forecasting accuracy of it.
V. CONCLUSIONS Online public opinion is a complex and time-varying system with larger burstiness and volatility. If it can be forecasted accurately, particularly, those hot public opinion which can arouse the attentions of most netizens, it will help the relevant departments to find out the potential risks timely, research and respond to the online public opinion actively, improve the ability to communicate with the public, and lead the online public opinion to the healthy development. In order to improve the accuracy for the forecast of online public opinion, a forecasting model of online public opinion based on the combination of a gray model and a support vector machine was proposed, aiming at the characteristics of online public opinion by taking advantages of the gray model and support vector machine. The experimental results show that the application of gray model and support vector machine can not only improve the accuracy for the forecast of online public opinion effectively and make up for the deficiency of a single forecasting model, but also provide a new idea for the study on the forecast of online public opinion.