首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of non-significant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint.  相似文献   

2.
To learn about the progression of a complex disease, it is necessary to understand the physiology and function of many genes operating together in distinct interactions as a system. In order to significantly advance our understanding of the function of a system, we need to learn the causal relationships among its modeled genes. To this end, it is desirable to compare experiments of the system under complete interventions of some genes, e.g., knock-out of some genes, with experiments of the system without interventions. However, it is expensive and difficult (if not impossible) to conduct wet lab experiments of complete interventions of genes in animal models, e.g., a mouse model. Thus, it will be helpful if we can discover promising causal relationships among genes with observational data alone in order to identify promising genes to perturb in the system that can later be verified in wet laboratories. While causal Bayesian networks have been actively used in discovering gene pathways, most of the algorithms that discover pairwise causal relationships from observational data alone identify only a small number of significant pairwise causal relationships, even with a large dataset. In this article, we introduce new causal discovery algorithms—the Equivalence Local Implicit latent variable scoring Method (EquLIM) and EquLIM with Markov chain Monte Carlo search algorithm (EquLIM-MCMC)—that identify promising causal relationships even with a small observational dataset.  相似文献   

3.
基于聚类关联规则的缺失数据处理研究   总被引:2,自引:1,他引:2       下载免费PDF全文
 本文提出了基于聚类和关联规则的缺失数据处理新方法,通过聚类方法将含有缺失数据的数据集相近的记录归到一类,然后利用改进后的关联规则方法对各子数据集挖掘变量间的关联性,并利用这种关联性来填补缺失数据。通过实例分析,发现该方法对缺失数据处理,尤其是海量数据集具有较好的效果。  相似文献   

4.
ABSTRACT

Useful knowledge acquisition from known and systematized information (data) is a big challenge for researchers, users and finally, decision makers. In this sense, knowledge discovery from data (KDD) process represents a valuable tool for information analysis. Moreover, this work presents an approach through KDD in time series pattern identification for anchovy and sardine fisheries and environmental data, in northern Chile. Time series, multivariate analysis and data mining techniques, along with technical literature review for results validation. The KDD approach and the data mining techniques implemented achieved an integration between these variables, identifying relevant patterns associated with fisheries abundance fluctuations and strong association with environmental changes such as El Niño and long-term cold–warm regimes between them, establishing anchovy and sardine pre-dominant time-periods, associated with environmental conditions are identified. The latter establishes groundwork for studying underlying functional relationships that could reduce gaps in the national fisheries management policies for those fisheries.  相似文献   

5.
随着中国城市机动车保有量的急剧增多,交通拥堵已经成为现代城市病。交通拥堵在道路网络中呈现向四周放射的传导特性,拥堵路段倾向于将拥堵扩散传导到其他相邻路段,该特性此前未被系统研究过,综合比较各种方法的适用性,从时间和大数据规则挖掘角度对拥堵建模;使用时间序列规则挖掘算法建立交通拥堵传导规律模型,并基于传导规则预测未来交通流状况;更重要的是,挖掘出来的拥堵传导规则直观可用,能够用于建立拥堵预警防治机制,完善道路路网建设规划中不合理的部分,从而达到提升交通效率的目的。研究结果证明本模型能够较好达到研究目的,挖掘出的拥堵传导规则可以精确分析交通拥堵状况并预测未来交通流状况,因此可以为交通拥堵治理决策提供重要参考。  相似文献   

6.
This article considers a problem of normal based two group classification when the groups are artificially dichotomized by a screening variable. Each group distribution is derived and the best regions for the classification are obtained. These derivations yield yet another classification rule. The rule is studied from several aspects such as the distribution of the rule, the optimal error rate, and the testing of a hypothesis. This article gives relationships among these aspects along with the investigation of the performance of the rule. The classification method and ideas are illustrated in detail with two examples.  相似文献   

7.
一、引言数据挖掘(Data Mining)是近年来随着人工智能和数据库技术的发展而出现的一门新兴学科。它是从大量的数据中筛选出隐含的、可信的、新颖的、有效的信息的高级处理过程。关联规则(Association Rule)是其中重要的研究课题,是数据挖掘的主要技术之一,也是在无指导学习系统  相似文献   

8.
In many experimental situations we need to test the hypothesis concerning the equality of parameters of two or more binomial populations. Of special interest is the knowledge of the sample sizes needed to detect certain differences among the parameters, for a specified power, and at a given level of significance. Al-Bayyati (1971) derived a rule of thumb for a quick calculation of the sample size needed to compare two binomial parameters. The rule is defined in terms of the difference desired to be detected between the two parameters.

In this paper, we introduce a generalization of Al-Bayyatifs rule to several independent proportions. The generalized rule gives a conservative estimate of the sample size needed to achieve a specified power in detecting certain differences among the binomial parameters at a given level of significance. The method is illustrated with an example  相似文献   

9.
This article utilizes stochastic ideas for reasoning about association rule mining, and provides a formal statistical view of this discipline. A simple stochastic model is proposed, based on which support and confidence are reasonable estimates for certain probabilities of the model. Statistical properties of the corresponding estimators, like moments and confidence intervals, are derived, and items and itemsets are observed for correlations. After a brief review of measures of interest of association rules, with the main focus on interestingness measures motivated by statistical principles, two new measures are described. These measures, called α- and σ-precision, respectively, rely on statistical properties of the estimators discussed before. Experimental results demonstrate the effectivity of both measures.  相似文献   

10.
Interactions among multiple genes across the genome may contribute to the risks of many complex human diseases. Whole-genome single nucleotide polymorphisms (SNPs) data collected for many thousands of SNP markers from thousands of individuals under the case-control design promise to shed light on our understanding of such interactions. However, nearby SNPs are highly correlated due to linkage disequilibrium (LD) and the number of possible interactions is too large for exhaustive evaluation. We propose a novel Bayesian method for simultaneously partitioning SNPs into LD-blocks and selecting SNPs within blocks that are associated with the disease, either individually or interactively with other SNPs. When applied to homogeneous population data, the method gives posterior probabilities for LD-block boundaries, which not only result in accurate block partitions of SNPs, but also provide measures of partition uncertainty. When applied to case-control data for association mapping, the method implicitly filters out SNP associations created merely by LD with disease loci within the same blocks. Simulation study showed that this approach is more powerful in detecting multi-locus associations than other methods we tested, including one of ours. When applied to the WTCCC type 1 diabetes data, the method identified many previously known T1D associated genes, including PTPN22, CTLA4, MHC, and IL2RA. The method also revealed some interesting two-way associations that are undetected by single SNP methods. Most of the significant associations are located within the MHC region. Our analysis showed that the MHC SNPs form long-distance joint associations over several known recombination hotspots. By controlling the haplotypes of the MHC class II region, we identified additional associations in both MHC class I (HLA-A, HLA-B) and class III regions (BAT1). We also observed significant interactions between genes PRSS16, ZNF184 in the extended MHC region and the MHC class II genes. The proposed method can be broadly applied to the classification problem with correlated discrete covariates.  相似文献   

11.
The integration of different data sources is a widely discussed topic among both the researchers and the Official Statistics. Integrating data helps to contain costs and time required by new data collections. The non-parametric micro Statistical Matching (SM) allows to integrate ‘live’ data resorting only to the observed information, potentially avoiding the misspecification bias and speeding the computational effort. Despite these pros, the assessment of the integration goodness when we use this method is not robust. Moreover, several applications comply with some commonly accepted practices which recommend e.g. to use the biggest data set as donor. We propose a validation strategy to assess the integration goodness. We apply it to investigate these practices and to explore how different combinations of the SM techniques and distance functions perform in terms of the reliability of the synthetic (complete) data set generated. The validation strategy takes advantage of the relation existing among the variables pre-and-post the integration. The results show that ‘the biggest, the best’ rule must not be considered mandatory anymore. Indeed, the integration goodness increases in relation to the variability of the matching variables rather than with respect to the dimensionality ratio between the recipient and the donor data set.  相似文献   

12.
We consider a challenging problem of testing any possible association between a response variable and a set of predictors, when the dimensionality of predictors is much greater than the number of observations. In the context of generalized linear models, a new approach is proposed for testing against high-dimensional alternatives. Our method uses soft-thresholding to suppress stochastic noise and applies the independence rule to borrow strength across the predictors. Moreover, the method can provide a ranked predictor list and automatically select “important” features to retain in the test statistic. We compare the performance of this method with some competing approaches via real data and simulation studies, demonstrating that our method maintains relatively higher power against a wide family of alternatives.  相似文献   

13.
In a clinical trial with a biased allocation rule whereby all and only those patients at risk are given the new treatment, Robbins and Zhang (1989) derived an asymptotically normal and efficient estimator of the mean difference between the new and old treatments on those at risk. This paper with the use of a well known identity of Stein (1981) generalizes the result to the multivariate situation.  相似文献   

14.
We propose a new meta-analysis method to pool univariate p-values across independent studies and we compare our method with that of Fisher, Stouffer, and George through simulations and identify sub-spaces where each of these methods are optimal and propose a strategy to choose the best meta-analysis method under different sub-spaces. We compare these meta-analysis approaches using p-values from periodicity tests of 4,940 S. Pombe genes from 10 independent time-course experiments and show that our new approach ranks the periodic, conserved, and cycling genes much higher, and detects at least as many genes among the top 1,000 genes, compared to other methods.  相似文献   

15.
基于关联规则挖掘的股票板块指数联动分析   总被引:7,自引:0,他引:7  
本文应用数据挖掘中的关联规则算法,对我国股票市场中板块指数的关系进行实证分析。通过采用关联规则的Apriori算法技术,可以从大量数据中挖掘出我国股票市场的板块指数之间的强关联规则,并对其进行可视化描述与评价。文中所得的关联规则可以帮助市场参与者发现股票板块轮动的规则模式,并在此基础上规避证券市场风险。  相似文献   

16.
The ecological fallacy is related to Simpson's paradox (1951) where relationships among group means may be counterintuitive and substantially different from relationships within groups, where the groups are usually geographic entities such as census tracts. We consider the problem of estimating the correlation between two jointly normal random variables where only ecological data (group means) are available. Two empirical Bayes estimators and one fully Bayesian estimator are derived and compared with the usual ecological estimator, which is simply the Pearson correlation coefficient of the group sample means. We simulate the bias and mean squared error performance of these estimators, and also give an example employing a dataset where the individual level data are available for model checking. The results indicate superiority of the empirical Bayes estimators in a variety of practical situations where, though we lack individual level data, other relevant prior information is available.  相似文献   

17.
The purpose of this article is to review two text mining packages, namely, WordStat and SAS TextMiner. WordStat is developed by Provalis Research. SAS TextMiner is a product of SAS. We review the features offered by each package on each of the following key steps in analyzing unstructured data: (1) data preparation, including importing and cleaning; (2) performing association analysis; and (3) presenting the findings, including illustrative quotes and graphs. We also evaluate each package on its ability to help researchers extract major themes from a dataset. Both packages offer a variety of features that effectively help researchers run associations and present results. However, in extracting themes from unstructured data, both packages were only marginally helpful. The researcher still needs to read the data and make all the difficult decisions. This finding stems from the fact that the software can search only for specific terms in documents or categorize documents based on common terms. Respondents, however, may use the same term or combination of terms to mean different things. This implies that a text mining approach, which is based on analysis units other than terms, may be more powerful in extracting themes, an idea we touch upon in the conclusion section.  相似文献   

18.
There has been ever increasing interest in the use of microarray experiments as a basis for the provision of prediction (discriminant) rules for improved diagnosis of cancer and other diseases. Typically, the microarray cancer studies provide only a limited number of tissue samples from the specified classes of tumours or patients, whereas each tissue sample may contain the expression levels of thousands of genes. Thus researchers are faced with the problem of forming a prediction rule on the basis of a small number of classified tissue samples, which are of very high dimension. Usually, some form of feature (gene) selection is adopted in the formation of the prediction rule. As the subset of genes used in the final form of the rule have not been randomly selected but rather chosen according to some criterion designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rules if care is not taken. We shall present various situations where selection bias arises in the formation of a prediction rule and where there is a consequent need for the correction of this bias. We describe the design of cross-validation schemes that are able to correct for the various selection biases.  相似文献   

19.
Relationships between species and their environment are a key component to understand ecological communities. Usually, this kind of data are repeated over time or space for communities and their environment, which leads to a sequence of pairs of ecological tables, i.e. multi-way matrices. This work proposes a new method which is a combined approach of STATICO and Tucker3 techniques and deals to the problem of describing not only the stable part of the dynamics of structure–function relationships between communities and their environment (in different locations and/or at different times), but also the interactions and changes associated with the ecosystems’ dynamics. At the same time, emphasis is given to the comparison with the STATICO method on the same (real) data set, where advantages and drawbacks are explored and discussed. Thus, this study produces a general methodological framework and develops a new technique to facilitate the use of these practices by researchers. Furthermore, from this first approach with estuarine environmental data one of the major advantages of modeling ecological data sets with the CO-TUCKER model is the gain in interpretability.  相似文献   

20.
Currently there is much interest in using microarray gene-expression data to form prediction rules for the diagnosis of patient outcomes. A process of gene selection is usually carried out first to find those genes that are most useful according to some criterion for distinguishing between the given classes of tissue samples. However, there is a bias (selection bias) introduced in the estimate of the final version of a prediction rule that has been formed from a smaller subset of the genes that have been selected according to some optimality criterion. In this paper, we focus on the bias that arises when a full data set is not available in the first instance and the prediction rule is formed subsequently by working with the top-ranked genes from the full set. We demonstrate how large the subset of top genes must be before this selection bias is not of practical consequence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号