共查询到20条相似文献,搜索用时 31 毫秒
1.
Christian H. Weiß 《Statistics and Computing》2008,18(2):185-194
This article utilizes stochastic ideas for reasoning about association rule mining, and provides a formal statistical view
of this discipline. A simple stochastic model is proposed, based on which support and confidence are reasonable estimates
for certain probabilities of the model. Statistical properties of the corresponding estimators, like moments and confidence
intervals, are derived, and items and itemsets are observed for correlations.
After a brief review of measures of interest of association rules, with the main focus on interestingness measures motivated
by statistical principles, two new measures are described. These measures, called α- and σ-precision, respectively, rely on statistical properties of the estimators discussed before. Experimental results demonstrate
the effectivity of both measures. 相似文献
2.
《Journal of Statistical Computation and Simulation》2012,82(2):384-396
DNA microarrays allow for measuring expression levels of a large number of genes between different experimental conditions and/or samples. Association rule mining (ARM) methods are helpful in finding associational relationships between genes. However, classical association rule mining (CARM) algorithms extract only a subset of the associations that exist among different binary states; therefore can only infer part of the relationships on gene regulations. To resolve this problem, we developed an extended association rule mining (EARM) strategy along with a new way of the association rule definition. Compared with the CARM method, our new approach extracted more frequent genesets from a public microarray data set. The EARM method discovered some biologically interesting association rules that were not detected by CARM. Therefore, EARM provides an effective tool for exploring relationships among genes. 相似文献
3.
Balaji Padmanabhan 《Journal of applied statistics》2004,31(8):1019-1035
Noting that several rule discovery algorithms in data mining can produce a large number of irrelevant or obvious rules from data, there has been substantial research in data mining that addressed the issue of what makes rules truly 'interesting'. This resulted in the development of a number of interestingness measures and algorithms that find all interesting rules from data. However, these approaches have the drawback that many of the discovered rules, while supposed to be interesting by definition, may actually (1) be obvious in that they logically follow from other discovered rules or (2) be expected given some of the other discovered rules and some simple distributional assumptions. In this paper we argue that this is a paradox since rules that are supposed to be interesting, in reality are uninteresting for the above reason. We show that this paradox exists for various popular interestingness measures and present an abstract characterization of an approach to alleviate the paradox. We finally discuss existing work in data mining that addresses this issue and show how these approaches can be viewed with respect to the characterization presented here. 相似文献
4.
基于关联规则挖掘的股票板块指数联动分析 总被引:7,自引:0,他引:7
本文应用数据挖掘中的关联规则算法,对我国股票市场中板块指数的关系进行实证分析。通过采用关联规则的Apriori算法技术,可以从大量数据中挖掘出我国股票市场的板块指数之间的强关联规则,并对其进行可视化描述与评价。文中所得的关联规则可以帮助市场参与者发现股票板块轮动的规则模式,并在此基础上规避证券市场风险。 相似文献
5.
6.
7.
随着中国城市机动车保有量的急剧增多,交通拥堵已经成为现代城市病。交通拥堵在道路网络中呈现向四周放射的传导特性,拥堵路段倾向于将拥堵扩散传导到其他相邻路段,该特性此前未被系统研究过,综合比较各种方法的适用性,从时间和大数据规则挖掘角度对拥堵建模;使用时间序列规则挖掘算法建立交通拥堵传导规律模型,并基于传导规则预测未来交通流状况;更重要的是,挖掘出来的拥堵传导规则直观可用,能够用于建立拥堵预警防治机制,完善道路路网建设规划中不合理的部分,从而达到提升交通效率的目的。研究结果证明本模型能够较好达到研究目的,挖掘出的拥堵传导规则可以精确分析交通拥堵状况并预测未来交通流状况,因此可以为交通拥堵治理决策提供重要参考。 相似文献
8.
《Journal of statistical planning and inference》2006,136(6):1962-1984
We consider two-stage adaptive designs for clinical trials where data from the two stages are dependent. This occurs when additional data are obtained from patients during their second stage follow-up. While the proposed flexible approach allows modifications of trial design, sample size, or statistical analysis using the first stage data, there is no need for a complete prespecification of the adaptation rule. Methods are provided for an adaptive closed testing procedure, for calculating overall adjusted p-values, and for obtaining unbiased estimators and confidence bounds for parameters that are invariant to modifications. A motivating example is used to illustrate these methods. 相似文献
9.
M.C. Wang 《统计学通讯:理论与方法》2013,42(2):405-427
A multinomial classification rule is proposed based on a prior-valued smoothing for the state probabilities. Asymptotically, the proposed rule has an error rate that converges uniformly and strongly to that of the Bayes rule. For a fixed sample size the prior-valued smoothing is effective in obtaining reason¬able classifications to the situations such as missing data. Empirically, the proposed rule is compared favorably with other commonly used multinomial classification rules via Monte Carlo sampling experiments 相似文献
10.
Christophe Denis Charlotte Dion Miguel Martinez 《Scandinavian Journal of Statistics》2020,47(2):516-554
The recent advent of modern technology has generated a large number of datasets which can be frequently modeled as functional data. This paper focuses on the problem of multiclass classification for stochastic diffusion paths. In this context we establish a closed formula for the optimal Bayes rule. We provide new statistical procedures which are built either on the plug-in principle or on the empirical risk minimization principle. We show the consistency of these procedures under mild conditions. We apply our methodologies to the parametric case and illustrate their accuracy with a simulation study through examples. 相似文献
11.
The problem of classification into two univariate normal populations with a common mean is considered. Several classification rules are proposed based on efficient estimators of the common mean. Detailed numerical comparisons of probabilities of misclassifications using these rules have been carried out. It is shown that the classification rule based on the Graybill-Deal estimator of the common mean performs the best. Classification rules are also proposed for the case when variances are assumed to be ordered. Comparison of these rules with the rule based on the Graybill-Deal estimator has been done with respect to individual probabilities of misclassification. 相似文献
12.
We consider the empirical Bayes decision theory where the component problems are the optimal fixed sample size decision problem and a sequential decision problem. With these components, an empirical Bayes decision procedure selects both a stopping rule function and a terminal decision rule function. Empirical Bayes stopping rules are constructed for each case and the asymptotic behaviours are investigated. 相似文献
13.
缺失数据是影响调查问卷数据质量的重要因素,对调查问卷中的缺失值进行插补可以显著提高调查数据的质量。调查问卷的数据类型多以分类型数据为主,数据挖掘技术中的分类算法是处理属性分类问题的常用方法,随机森林模型是众多分类算法中精度较高的方法之一。将随机森林模型引入调查问卷缺失数据的插补研究中,提出了基于随机森林模型的分类数据缺失值插补方法,并根据不同的缺失模式探讨了相应的插补步骤。通过与其它方法的实证模拟比较,表明随机森林插补法得到的插补值准确度更优、可信度更高。 相似文献
14.
Data mining seeks to extract useful, but previously unknown, information from typically massive collections of non-experimental, sometimes non-traditional data. From the perspective of statisticians, this paper surveys techniques used and contributions from fields such as data warehousing, machine learning from artificial intelligence, and visualization as well as statistics. It concludes that statistical thinking and design of analysis, as exemplified by achievements in clinical epidemiology, may fit well with the emerging activities of data mining and 'knowledge discovery in databases' (DM&KDD). 相似文献
15.
Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw. 相似文献
16.
17.
An expanded class of multiplicative-interaction (M-I) models is proposed for two-way contingency tables. These models a generalization of Goodman's association models, fill in the gap between the independence and the saturated models. Diagnostic rules based on a transformation of the data are proposed for the detection of such models. These rules, utilizing the singular value decomposition of the transformed data, are very easy to use. Maximum likelihood estimation is considered and the computational algorithms discussed. A data set from Goodman (1981) and another from Gabriel and Zamir (1979) are used to demostrate the diagnostic rules. 相似文献
18.
Ayça Çakmak Pehlivanlı 《Journal of applied statistics》2016,43(6):1140-1154
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size. 相似文献
19.
Charles l. Dunn 《统计学通讯:模拟与计算》2013,42(4):1013-1026
Two stopping rules are defined for the purpose of minimizing the number of iterations needed to provide simulated percentile points with a certain precision: one stopping rule is a result of defining precision relative to the scale of the random variable while the other is a result of defining precision relative to the tail area of the distribution. A simulation experiment is conducted to investigate the effects of the stopping rules as well as the effects of changes in scale. The effects of interest are the precision of the simulated percentile point and the number of iterations needed to achieve that precision. It is shown that the stopping rules are effective in reducing the number of iterations while providing an acceptable precision in the percentile points. Also, increases in scale produce increases in the number of iterations and/or decreases in certain measures of precision 相似文献
20.
《Journal of Statistical Computation and Simulation》2012,82(1-4):157-172
The normal linear discriminant rule (NLDR) and the normal quadratic discriminant rule (NQDR) are popular classifiers when working with normal populations. Several papers in the literature have been devoted to a comparison of these rules with respect to classification performance. An aspect which has, however, not received any attention is the effect of an initial variable selection step on the relative performance of these classification rules. Cross model validation variabie selection has been found to perform well in the linear case, and can be extended to the quadratic case. We report the results of a simulation study comparing the NLDR and the NQDR with respect to the post variable selection classification performance. It is of interest that the NQDR generally benefits from an initial variable selection step. We also comment briefly on the problem of estimating the post selection error rates of the two rules. 相似文献