首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article utilizes stochastic ideas for reasoning about association rule mining, and provides a formal statistical view of this discipline. A simple stochastic model is proposed, based on which support and confidence are reasonable estimates for certain probabilities of the model. Statistical properties of the corresponding estimators, like moments and confidence intervals, are derived, and items and itemsets are observed for correlations. After a brief review of measures of interest of association rules, with the main focus on interestingness measures motivated by statistical principles, two new measures are described. These measures, called α- and σ-precision, respectively, rely on statistical properties of the estimators discussed before. Experimental results demonstrate the effectivity of both measures.  相似文献   

2.
DNA microarrays allow for measuring expression levels of a large number of genes between different experimental conditions and/or samples. Association rule mining (ARM) methods are helpful in finding associational relationships between genes. However, classical association rule mining (CARM) algorithms extract only a subset of the associations that exist among different binary states; therefore can only infer part of the relationships on gene regulations. To resolve this problem, we developed an extended association rule mining (EARM) strategy along with a new way of the association rule definition. Compared with the CARM method, our new approach extracted more frequent genesets from a public microarray data set. The EARM method discovered some biologically interesting association rules that were not detected by CARM. Therefore, EARM provides an effective tool for exploring relationships among genes.  相似文献   

3.
Noting that several rule discovery algorithms in data mining can produce a large number of irrelevant or obvious rules from data, there has been substantial research in data mining that addressed the issue of what makes rules truly 'interesting'. This resulted in the development of a number of interestingness measures and algorithms that find all interesting rules from data. However, these approaches have the drawback that many of the discovered rules, while supposed to be interesting by definition, may actually (1) be obvious in that they logically follow from other discovered rules or (2) be expected given some of the other discovered rules and some simple distributional assumptions. In this paper we argue that this is a paradox since rules that are supposed to be interesting, in reality are uninteresting for the above reason. We show that this paradox exists for various popular interestingness measures and present an abstract characterization of an approach to alleviate the paradox. We finally discuss existing work in data mining that addresses this issue and show how these approaches can be viewed with respect to the characterization presented here.  相似文献   

4.
基于关联规则挖掘的股票板块指数联动分析   总被引:7,自引:0,他引:7  
本文应用数据挖掘中的关联规则算法,对我国股票市场中板块指数的关系进行实证分析。通过采用关联规则的Apriori算法技术,可以从大量数据中挖掘出我国股票市场的板块指数之间的强关联规则,并对其进行可视化描述与评价。文中所得的关联规则可以帮助市场参与者发现股票板块轮动的规则模式,并在此基础上规避证券市场风险。  相似文献   

5.
徐雪松  王四春 《统计研究》2012,29(4):108-112
根据免疫否定选择原理,设计了基于掩码分段匹配的否定选择分类器,克服连续r位匹配法的缺陷。给出了适用于免疫优化的分类规则编码及分类信息分的评价。通过免疫进化对其进行群体优化以约简数据规则集。避免了传统分类算法缺乏全局优化能力的缺点,提高了对样本的识别能力。实验结果表明本文方法提高了数据分类的准确性,在数据分类准确率及平均信息分上优于传统的分类方法。  相似文献   

6.
基于聚类关联规则的缺失数据处理研究   总被引:2,自引:1,他引:2       下载免费PDF全文
 本文提出了基于聚类和关联规则的缺失数据处理新方法,通过聚类方法将含有缺失数据的数据集相近的记录归到一类,然后利用改进后的关联规则方法对各子数据集挖掘变量间的关联性,并利用这种关联性来填补缺失数据。通过实例分析,发现该方法对缺失数据处理,尤其是海量数据集具有较好的效果。  相似文献   

7.
随着中国城市机动车保有量的急剧增多,交通拥堵已经成为现代城市病。交通拥堵在道路网络中呈现向四周放射的传导特性,拥堵路段倾向于将拥堵扩散传导到其他相邻路段,该特性此前未被系统研究过,综合比较各种方法的适用性,从时间和大数据规则挖掘角度对拥堵建模;使用时间序列规则挖掘算法建立交通拥堵传导规律模型,并基于传导规则预测未来交通流状况;更重要的是,挖掘出来的拥堵传导规则直观可用,能够用于建立拥堵预警防治机制,完善道路路网建设规划中不合理的部分,从而达到提升交通效率的目的。研究结果证明本模型能够较好达到研究目的,挖掘出的拥堵传导规则可以精确分析交通拥堵状况并预测未来交通流状况,因此可以为交通拥堵治理决策提供重要参考。  相似文献   

8.
We consider two-stage adaptive designs for clinical trials where data from the two stages are dependent. This occurs when additional data are obtained from patients during their second stage follow-up. While the proposed flexible approach allows modifications of trial design, sample size, or statistical analysis using the first stage data, there is no need for a complete prespecification of the adaptation rule. Methods are provided for an adaptive closed testing procedure, for calculating overall adjusted p-values, and for obtaining unbiased estimators and confidence bounds for parameters that are invariant to modifications. A motivating example is used to illustrate these methods.  相似文献   

9.
A multinomial classification rule is proposed based on a prior-valued smoothing for the state probabilities. Asymptotically, the proposed rule has an error rate that converges uniformly and strongly to that of the Bayes rule. For a fixed sample size the prior-valued smoothing is effective in obtaining reason¬able classifications to the situations such as missing data. Empirically, the proposed rule is compared favorably with other commonly used multinomial classification rules via Monte Carlo sampling experiments  相似文献   

10.
The recent advent of modern technology has generated a large number of datasets which can be frequently modeled as functional data. This paper focuses on the problem of multiclass classification for stochastic diffusion paths. In this context we establish a closed formula for the optimal Bayes rule. We provide new statistical procedures which are built either on the plug-in principle or on the empirical risk minimization principle. We show the consistency of these procedures under mild conditions. We apply our methodologies to the parametric case and illustrate their accuracy with a simulation study through examples.  相似文献   

11.
The problem of classification into two univariate normal populations with a common mean is considered. Several classification rules are proposed based on efficient estimators of the common mean. Detailed numerical comparisons of probabilities of misclassifications using these rules have been carried out. It is shown that the classification rule based on the Graybill-Deal estimator of the common mean performs the best. Classification rules are also proposed for the case when variances are assumed to be ordered. Comparison of these rules with the rule based on the Graybill-Deal estimator has been done with respect to individual probabilities of misclassification.  相似文献   

12.
We consider the empirical Bayes decision theory where the component problems are the optimal fixed sample size decision problem and a sequential decision problem. With these components, an empirical Bayes decision procedure selects both a stopping rule function and a terminal decision rule function. Empirical Bayes stopping rules are constructed for each case and the asymptotic behaviours are investigated.  相似文献   

13.
缺失数据是影响调查问卷数据质量的重要因素,对调查问卷中的缺失值进行插补可以显著提高调查数据的质量。调查问卷的数据类型多以分类型数据为主,数据挖掘技术中的分类算法是处理属性分类问题的常用方法,随机森林模型是众多分类算法中精度较高的方法之一。将随机森林模型引入调查问卷缺失数据的插补研究中,提出了基于随机森林模型的分类数据缺失值插补方法,并根据不同的缺失模式探讨了相应的插补步骤。通过与其它方法的实证模拟比较,表明随机森林插补法得到的插补值准确度更优、可信度更高。  相似文献   

14.
Data mining seeks to extract useful, but previously unknown, information from typically massive collections of non-experimental, sometimes non-traditional data. From the perspective of statisticians, this paper surveys techniques used and contributions from fields such as data warehousing, machine learning from artificial intelligence, and visualization as well as statistics. It concludes that statistical thinking and design of analysis, as exemplified by achievements in clinical epidemiology, may fit well with the emerging activities of data mining and 'knowledge discovery in databases' (DM&KDD).  相似文献   

15.
A note on using the F-measure for evaluating record linkage algorithms   总被引:1,自引:0,他引:1  
Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw.  相似文献   

16.
17.
An expanded class of multiplicative-interaction (M-I) models is proposed for two-way contingency tables. These models a generalization of Goodman's association models, fill in the gap between the independence and the saturated models. Diagnostic rules based on a transformation of the data are proposed for the detection of such models. These rules, utilizing the singular value decomposition of the transformed data, are very easy to use. Maximum likelihood estimation is considered and the computational algorithms discussed. A data set from Goodman (1981) and another from Gabriel and Zamir (1979) are used to demostrate the diagnostic rules.  相似文献   

18.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

19.
Two stopping rules are defined for the purpose of minimizing the number of iterations needed to provide simulated percentile points with a certain precision: one stopping rule is a result of defining precision relative to the scale of the random variable while the other is a result of defining precision relative to the tail area of the distribution. A simulation experiment is conducted to investigate the effects of the stopping rules as well as the effects of changes in scale. The effects of interest are the precision of the simulated percentile point and the number of iterations needed to achieve that precision. It is shown that the stopping rules are effective in reducing the number of iterations while providing an acceptable precision in the percentile points. Also, increases in scale produce increases in the number of iterations and/or decreases in certain measures of precision  相似文献   

20.
The normal linear discriminant rule (NLDR) and the normal quadratic discriminant rule (NQDR) are popular classifiers when working with normal populations. Several papers in the literature have been devoted to a comparison of these rules with respect to classification performance. An aspect which has, however, not received any attention is the effect of an initial variable selection step on the relative performance of these classification rules. Cross model validation variabie selection has been found to perform well in the linear case, and can be extended to the quadratic case. We report the results of a simulation study comparing the NLDR and the NQDR with respect to the post variable selection classification performance. It is of interest that the NQDR generally benefits from an initial variable selection step. We also comment briefly on the problem of estimating the post selection error rates of the two rules.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号