期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluation and optimization of frequent,closed and maximal association rule based classification

I. N. M. Shaharanee F. Hadzic 《Statistics and Computing》2014,24(5):821-843

Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of non-significant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint. 相似文献

2.

An extended association rule mining strategy for gene relationship discovery from microarray data

《Journal of Statistical Computation and Simulation》2012,82(2):384-396

DNA microarrays allow for measuring expression levels of a large number of genes between different experimental conditions and/or samples. Association rule mining (ARM) methods are helpful in finding associational relationships between genes. However, classical association rule mining (CARM) algorithms extract only a subset of the associations that exist among different binary states; therefore can only infer part of the relationships on gene regulations. To resolve this problem, we developed an extended association rule mining (EARM) strategy along with a new way of the association rule definition. Compared with the CARM method, our new approach extracted more frequent genesets from a public microarray data set. The EARM method discovered some biologically interesting association rules that were not detected by CARM. Therefore, EARM provides an effective tool for exploring relationships among genes. 相似文献

3.

The Role of Classification Trees and Expert Knowledge in Building Bayesian Networks: A Case Study in Medicine

L. Stracqualursi P. Agati 《统计学通讯:理论与方法》2014,43(4):839-850

In clinical research an early and prompt detection of the risk class of a new patient may really play a crucial role in determining the effectiveness of the treatment and, consequently, achieving a satisfying prognosis of the patient's chances. There exists a number of popular rule-based algorithms for classification, whose performances are very attractive whenever data of large number of patients are available. However, when datasets only include data of a few hundred patients, the most common approaches give unstable results and developing effective decision-support systems become scientifically challenging. Since rules can be derived from different models as well as expert knowledge resources, each of them having its advantages and weaknesses, this article suggests a “hybrid” approach to address the classification problem when the number of patients is too small to effectively use a single technique only. The hybrid strategy was applied to a case study and its predictive performance was compared with performances of each single approach: due to the seriousness of a misclassification of high-risk patients, special attention was paid on the specificity. The results show that the hybrid strategy outperforms each single strategy involved. 相似文献

4.

Search with small sets in presence of a liar

Gyula O. H. Katona 《Journal of statistical planning and inference》2002,100(2):319-336

One unknown element of an n-element set is sought by asking if it is contained in given subsets. It is supposed that the question sets are of size at most k and all the questions are decided in advance, the choice of the next question cannot depend on previous answers. At most l of the answers can be incorrect. The minimum number of such questions is determined when the order of magnitude of k is n with <1. The problem can be formulated as determination of the maximum sized l-error-correcting code (of length n) in which the number of ones in a given position is at most k. 相似文献

5.

Classification of Higher-order Data with Separable Covariance and Structured Multiplicative or Additive Mean Models

Ricardo Leiva 《统计学通讯:理论与方法》2014,43(5):989-1012

Although devised in 1936 by Fisher, discriminant analysis is still rapidly evolving, as the complexity of contemporary data sets grows exponentially. Our classification rules explore these complexities by modeling various correlations in higher-order data. Moreover, our classification rules are suitable to data sets where the number of response variables is comparable or larger than the number of observations. We assume that the higher-order observations have a separable variance-covariance matrix and two different Kronecker product structures on the mean vector. In this article, we develop quadratic classification rules among g different populations where each individual has κth order (κ ≥2) measurements. We also provide the computational algorithms to compute the maximum likelihood estimates for the model parameters and eventually the sample classification rules. 相似文献

6.

Cluster Data Streams with Noisy Variables

Hu Yang Chenqun Yu 《统计学通讯:模拟与计算》2016,45(4):1381-1396

Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result. 相似文献

7.

Vector quantization of amino acids: Analysis of the HIV V3 loop region

A.B. Olshen P.C. Cosman A.G. Rodrigo P.J. Bickel R.A. Olshen 《Journal of statistical planning and inference》2005,130(1-2):277-298

This paper is about techniques for clustering sequences such as nucleic or amino acids. Our application is to defining viral subtypes of HIV on the basis of similarities of V3 loop region amino acids of the envelope (env) gene. The techniques introduced here could apply with virtually no change to other HIV genes as well as to other problems and data not necessarily of viral origin. These algorithms as they apply to quantitative data have found much application in engineering contexts to compressing images and speech. They are called vector quantization and involve a mapping from a large number of possible inputs into a much smaller number of outputs. Many implementations, in particular those that go by the name generalized Lloyd or k-means, exist for choosing sets of possible outputs and mappings. With each there is an attempt to maximize similarities among inputs that map to any single output, or, alternatively, to minimize some measure of distortion between input and output. Here, two standard types of vector quantization are brought to bear upon the cited problem of clustering V3 loop amino acid sequences. Results of this clustering are compared to those of the well known UPGMA algorithms, the unweighted pair group method in which arithmetic averages are employed. 相似文献

8.

A factor model approach for the joint segmentation with between‐series correlation

Xavier Collilieux Emilie Lebarbier Stphane Robin 《Scandinavian Journal of Statistics》2019,46(3):686-705

We consider the detection of changes in the mean of a set of time series. The breakpoints are allowed to be series specific, and the series are assumed to be correlated. The correlation between the series is supposed to be constant along time but is allowed to take an arbitrary form. We show that such a dependence structure can be encoded in a factor model. Thanks to this representation, the inference of the breakpoints can be achieved via dynamic programming, which remains one the most efficient algorithms. We propose a model selection procedure to determine both the number of breakpoints and the number of factors. This proposed method is implemented in the FASeg R package, which is available on the CRAN. We demonstrate the performances of our procedure through simulation experiments and present an application to geodesic data. 相似文献

9.

On Testing Equality of Distributions of Technical Efficiency Scores 总被引：5，自引：0，他引：5

L opold Simar Valentin Zelenyuk 《Econometric Reviews》2006,25(4):497-522

The challenge of the econometric problem in production efficiency analysis is that the efficiency scores to be analyzed are unobserved. Statistical properties have recently been discovered for a type of estimator popular in the literature, known as data envelopment analysis (DEA). This opens up a wide range of possibilities for well-grounded statistical inference about the true efficiency scores from their DEA estimates. In this paper we investigate the possibility of using existing tests for the equality of two distributions in such a context. Considering the statistical complications pertinent to our context, we consider several approaches to adapting the Li test to the context and explore their performance in terms of the size and power of the test in various Monte Carlo experiments. One of these approaches shows good performance for both the size and the power of the test, thus encouraging its use in empirical studies. We also present an empirical illustration analyzing the efficiency distributions of countries in the world, following up a recent study by Kumar and Russell (2002), and report very interesting results. 相似文献

10.

THE EFFECT OF SAMPLE STRUCTURE ON ANALYTICAL SURVEYS^1,2

K. R. W. Brewer R. W. Mellor 《Australian & New Zealand Journal of Statistics》1973,15(3):145-152

The core of this paper is a dialogue between "Harry", an experienced survey statistician and "Fred", a young mathematical statistician. They have contrasting approaches to the problem of estimating a regression relationship from a stratified sample, but after one or two red herrings are dragged out, both realize that the situation is not as simple as they had supposed. The role played by the probabilities of selection is a central issue. Estimation sampling for means, totals, and ratios is also considered, and seen to be a special case of the general analytical sampling synthesis they had already agreed upon. 相似文献

11.

Model-free variable selection

Lexin Li R. Dennis Cook Christopher J. Nachtsheim 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2005,67(2):285-299

Summary. The importance of variable selection in regression has grown in recent years as computing power has encouraged the modelling of data sets of ever-increasing size. Data mining applications in finance, marketing and bioinformatics are obvious examples. A limitation of nearly all existing variable selection methods is the need to specify the correct model before selection. When the number of predictors is large, model formulation and validation can be difficult or even infeasible. On the basis of the theory of sufficient dimension reduction, we propose a new class of model-free variable selection approaches. The methods proposed assume no model of any form, require no nonparametric smoothing and allow for general predictor effects. The efficacy of the methods proposed is demonstrated via simulation, and an empirical example is given. 相似文献

12.

A tutorial on spectral clustering 总被引：33，自引：0，他引：33

Ulrike von Luxburg 《Statistics and Computing》2007,17(4):395-416

In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed. 相似文献

13.

Clustering large number of extragalactic spectra of galaxies and quasars through canopies

Tuli De Didier Fraix Burnet Asis Kumar Chattopadhyay 《统计学通讯:理论与方法》2013,42(9):2638-2653

Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix. 相似文献

14.

数据流分类中的概念漂移转移估计方法研究

张杰孙曰瑶《统计与信息论坛》2011,(12):19-25

数据流分类中的概念漂移问题是数据挖掘技术领域的前沿和难点,其重点是等级分类可能随着数据序列的转移而产生漂移现象。虽然估计动态漂移及其调整分类的算法已被提出,但现有算法由于目标分布例证的缺失在概念漂移估计方面的表现并不是很好,例证的多少严重影响了估计效果。鉴此,提出了一种新的参数估计方法,称为转移估计法,运用目标分布数据,结合相似分布理论,对现存的算法进行改进,以便实现对数据流分类中的概念漂移现象进行正确检测和估计。通过对虚拟和真实数据集的仿真实验表明,改进算法在数据流分类中的概念漂移估计方面优于现存算法。相似文献

15.

EXPLORA:content interpretation of data

Peter Hoschka Willi Klösgen 《Journal of applied statistics》1991,18(1):87-97

Most approaches to applying knowledge-based techniques for data analyses concentrate on the context-independent statistical support. EXPLORA however is developed for the subject-specific interpretation with regard to the contents of the data to be analyzed (i.e. content interpretation). Therefore its knowledge base includes also the objects and semantic relations of the real system that produces the data. In this paper we describe the functional model representing the process of content interpretation, summarize the software architecture of the system and give some examples of its applications by pilot-users in survey analysis. EXPLORA addresses applications with data produced regularly which have to be analyzed in a routine way. The system systematically searches for statistical results (facts) to detect relations which possibly could be overlooked by a human analyst. On the other hand EXPLORA will help overcome the large bulk of information which today is usually still produced when presenting the data. Therefore a second knowledge process of content interpretation consists in discovering messages about the data by condensing the facts. Approaches for inductive generalization which have been developed for machine learning are utilized to identify common values of attributes of the objects to which the facts relate. At a later stage the system searches for interesting facts by applying redundancy rules and domaindependent selection rules. EXPLORA formulates the messages in terms of the domain, groups and orders them and even provides flexible navigations in the fact spaces. 相似文献

16.

Financial data modeling by Poisson mixture regression

S. Faria F. Gonçalves 《Journal of applied statistics》2013,40(10):2150-2162

In many financial applications, Poisson mixture regression models are commonly used to analyze heterogeneous count data. When fitting these models, the observed counts are supposed to come from two or more subpopulations and parameter estimation is typically performed by means of maximum likelihood via the Expectation–Maximization algorithm. In this study, we discuss briefly the procedure for fitting Poisson mixture regression models by means of maximum likelihood, the model selection and goodness-of-fit tests. These models are applied to a real data set for credit-scoring purposes. We aim to reveal the impact of demographic and financial variables in creating different groups of clients and to predict the group to which each client belongs, as well as his expected number of defaulted payments. The model's conclusions are very interesting, revealing that the population consists of three groups, contrasting with the traditional good versus bad categorization approach of the credit-scoring systems. 相似文献

17.

A novel feature selection scheme for high-dimensional data sets: four-Staged Feature Selection

Ayça Çakmak Pehlivanlı 《Journal of applied statistics》2016,43(6):1140-1154

Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size. 相似文献

18.

基于关联规则挖掘的股票板块指数联动分析 总被引：7，自引：0，他引：7

叶银龙《统计教育》2008,(9):56-58

本文应用数据挖掘中的关联规则算法,对我国股票市场中板块指数的关系进行实证分析。通过采用关联规则的Apriori算法技术,可以从大量数据中挖掘出我国股票市场的板块指数之间的强关联规则,并对其进行可视化描述与评价。文中所得的关联规则可以帮助市场参与者发现股票板块轮动的规则模式,并在此基础上规避证券市场风险。相似文献

19.

Security systems and renewal processes

Vic Barnett Mike G. Kenward 《统计学通讯:理论与方法》2013,42(3):475-487

An interesting feature of successive inter-event times in a Poisson renewal process, when observed through a superimposed (possibly random) grid, can be interpreted as an extended form of the ‘inspection paradox’. Probabilistic measures are determined and an estimation and testing procedure is outlined which could be used to examine possible departure from randomness (Poisson form). The problem arose from study of security systems and the results have important applications in that field. 相似文献

20.

Comparison between method of moments and entropy regularization algorithm applied to parameter estimation for mixed-Weibull distribution

Wen-Liang Hung Yen-Chang Chang 《Journal of applied statistics》2011,38(12):2709-2722

Mixed-Weibull distribution has been used to model a wide range of failure data sets, and in many practical situations the number of components in a mixture model is unknown. Thus, the parameter estimation of a mixed-Weibull distribution is considered and the important issue of how to determine the number of components is discussed. Two approaches are proposed to solve this problem. One is the method of moments and the other is a regularization type of fuzzy clustering algorithm. Finally, numerical examples and two real data sets are given to illustrate the features of the proposed approaches. 相似文献