首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We develop classification rules for data that have an autoregressive circulant covariance structure under the assumption of multivariate normality. We also develop classification rules assuming a general circulant covariance structure. The new classification rules are efficient in reducing the misclassification error rates when the number of observations is not large enough to estimate the unknown variance–covariance matrix. The proposed classification rules are demonstrated by simulation study for their validity and illustrated by a real data analysis for their use. Analyses of both simulated data and real data show the effectiveness of our new classification rules.  相似文献   

2.
We study the problem of classification with multiple q-variate observations with and without time effect on each individual. We develop new classification rules for populations with certain structured and unstructured mean vectors and under certain covariance structures. The new classification rules are effective when the number of observations is not large enough to estimate the variance–covariance matrix. Computational schemes for maximum likelihood estimates of required population parameters are given. We apply our findings to two real data sets as well as to a simulated data set.  相似文献   

3.
Various aspects of the classification tree methodology of Breiman et al., (1984) are discussed. A method of displaying classification trees, called block diagrams, is developed. Block diagrams give a clear presentation of the classification, and are useful both to point out features of the particular data set under consideration and also to highlight deficiencies in the classification method being used. Various splitting criteria are discussed; the usual Gini-Simpson criterion presents difficulties when there is a relatively large number of classes and improved splitting criteria are obtained. One particular improvement is the introduction of adaptive anti-end-cut factors that take advantage of highly asymmetrical splits where appropriate. They use the number and mix of classes in the current node of the tree to identify whether or not it is likely to be advantageous to create a very small offspring node. A number of data sets are used as examples.  相似文献   

4.
We present an estimating framework for quantile regression where the usual L 1-norm objective function is replaced by its smooth parametric approximation. An exact path-following algorithm is derived, leading to the well-known ‘basic’ solutions interpolating exactly a number of observations equal to the number of parameters being estimated. We discuss briefly possible practical implications of the proposed approach, such as early stopping for large data sets, confidence intervals, and additional topics for future research.  相似文献   

5.
In this article we study the problem of classification of three-level multivariate data, where multiple qq-variate observations are measured on uu-sites and over pp-time points, under the assumption of multivariate normality. The new classification rules with certain structured and unstructured mean vectors and covariance structures are very efficient in small sample scenario, when the number of observations is not adequate to estimate the unknown variance–covariance matrix. These classification rules successfully model the correlation structure on successive repeated measurements over time. Computation algorithms for maximum likelihood estimates of the unknown population parameters are presented. Simulation results show that the introduction of sites in the classification rules improves their performance over the existing classification rules without the sites.  相似文献   

6.
A Bayesian multi-category kernel classification method is proposed. The algorithm performs the classification of the projections of the data to the principal axes of the feature space. The advantage of this approach is that the regression coefficients are identifiable and sparse, leading to large computational savings and improved classification performance. The degree of sparsity is regulated in a novel framework based on Bayesian decision theory. The Gibbs sampler is implemented to find the posterior distributions of the parameters, thus probability distributions of prediction can be obtained for new data points, which gives a more complete picture of classification. The algorithm is aimed at high dimensional data sets where the dimension of measurements exceeds the number of observations. The applications considered in this paper are microarray, image processing and near-infrared spectroscopy data.  相似文献   

7.
The naïve Bayes rule (NBR) is a popular and often highly effective technique for constructing classification rules. This study examines the effectiveness of NBR as a method for constructing classification rules (credit scorecards) in the context of screening credit applicants (credit scoring). For this purpose, the study uses two real-world credit scoring data sets to benchmark NBR against linear discriminant analysis, logistic regression analysis, k-nearest neighbours, classification trees and neural networks. Of the two aforementioned data sets, the first one is taken from a major Greek bank whereas the second one is the Australian Credit Approval data set taken from the UCI Machine Learning Repository (available at http://www.ics.uci.edu/~mlearn/MLRepository.html). The predictive ability of scorecards is measured by the total percentage of correctly classified cases, the Gini coefficient and the bad rate amongst accepts. In each of the data sets, NBR is found to have a lower predictive ability than some of the other five methods under all measures used. Reasons that may negatively affect the predictive ability of NBR relative to that of alternative methods in the context of credit scoring are examined.  相似文献   

8.
In this article we study a linear discriminant function of multiple m-variate observations at u-sites and over v-time points under the assumption of multivariate normality. We assume that the m-variate observations have a separable mean vector structure and a “jointly equicorrelated covariance” structure. The new discriminant function is very effective in discriminating individuals in a small sample scenario. No closed-form expression exists for the maximum likelihood estimates of the unknown population parameters, and their direct computation is nontrivial. An iterative algorithm is proposed to calculate the maximum likelihood estimates of these unknown parameters. A discriminant function is also developed for unstructured mean vectors. The new discriminant functions are applied to simulated data sets as well as to a real data set. Results illustrating the benefits of the new classification methods over the traditional one are presented.  相似文献   

9.
Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of non-significant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint.  相似文献   

10.
Abstract

We consider the classification of high-dimensional data under the strongly spiked eigenvalue (SSE) model. We create a new classification procedure on the basis of the high-dimensional eigenstructure in high-dimension, low-sample-size context. We propose a distance-based classification procedure by using a data transformation. We also prove that our proposed classification procedure has consistency property for misclassification rates. We discuss performances of our classification procedure in simulations and real data analyses using microarray data sets.  相似文献   

11.
We explore criteria that data must meet in order for the Kruskal–Wallis test to reject the null hypothesis by computing the number of unique ranked datasets in the balanced case where each of the m alternatives has n observations. We show that the Kruskal–Wallis test tends to be conservative in rejecting the null hypothesis, and we offer a correction that improves its performance. We then compute the number of possible datasets producing unique rank-sums. The most commonly occurring data lead to an uncommonly small set of possible rank-sums. We extend prior findings about row- and column-ordered data structures.  相似文献   

12.
We consider the problem of optimal design of experiments for random effects models, especially population models, where a small number of correlated observations can be taken on each individual, while the observations corresponding to different individuals are assumed to be uncorrelated. We focus on c-optimal design problems and show that the classical equivalence theorem and the famous geometric characterization of Elfving (1952) from the case of uncorrelated data can be adapted to the problem of selecting optimal sets of observations for the n individual patients. The theory is demonstrated by finding optimal designs for a linear model with correlated observations and a nonlinear random effects population model, which is commonly used in pharmacokinetics.  相似文献   

13.
Registration of temporal observations is a fundamental problem in functional data analysis. Various frameworks have been developed over the past two decades where registrations are conducted based on optimal time warping between functions. Comparison of functions solely based on time warping, however, may have limited application, in particular when certain constraints are desired in the registration. In this paper, we study registration with norm-preserving constraint. A closely related problem is on signal estimation, where the goal is to estimate the ground-truth template given random observations with both compositional and additive noises. We propose to adopt the Fisher–Rao framework to compute the underlying template, and mathematically prove that such framework leads to a consistent estimator. We then illustrate the constrained Fisher–Rao registration using simulations as well as two real data sets. It is found that the constrained method is robust with respect to additive noise and has superior alignment and classification performance to conventional, unconstrained registration methods.  相似文献   

14.
The proportional odds model (POM) is commonly used in regression analysis to predict the outcome for an ordinal response variable. The maximum likelihood estimation (MLE) approach is typically used to obtain the parameter estimates. The likelihood estimates do not exist when the number of parameters, p, is greater than the number of observations n. The MLE also does not exist if there are no overlapping observations in the data. In a situation where the number of parameters is less than the sample size but p is approaching to n, the likelihood estimates may not exist, and if they exist they may have quite large standard errors. An estimation method is proposed to address the last two issues, i.e. complete separation and the case when p approaches n, but not the case when p>n. The proposed method does not use any penalty term but uses pseudo-observations to regularize the observed responses by downgrading their effect so that they become close to the underlying probabilities. The estimates can be computed easily with all commonly used statistical packages supporting the fitting of POMs with weights. Estimates are compared with MLE in a simulation study and an application to the real data.  相似文献   

15.
Eunju Hwang 《Statistics》2017,51(4):904-920
In long-memory data sets such as the realized volatilities of financial assets, a sequential test is developed for the detection of structural mean breaks. The long memory, if any, is adjusted by fitting an HAR (heterogeneous autoregressive) model to the data sets and taking the residuals. Our test consists of applying the sequential test of Bai and Perron [Estimating and testing linear models with multiple structural changes. Econometrica. 1998;66:47–78] to the residuals. The large-sample validity of the proposed test is investigated in terms of the consistency of the estimated number of breaks and the asymptotic null distribution of the proposed test. A finite-sample Monte-Carlo experiment reveals that the proposed test tends to produce an unbiased break time estimate, while the usual sequential test of Bai and Perron tends to produce biased break times in the case of long memory. The experiment also reveals that the proposed test has a more stable size than the Bai and Perron test. The proposed test is applied to two realized volatility data sets of the S&P index and the Korea won-US dollar exchange rate for the past 7 years and finds 2 or 3 breaks, while the Bai and Perron test finds 8 or more breaks.  相似文献   

16.
In this paper, we extend the censored linear regression model with normal errors to Student-t errors. A simple EM-type algorithm for iteratively computing maximum-likelihood estimates of the parameters is presented. To examine the performance of the proposed model, case-deletion and local influence techniques are developed to show its robust aspect against outlying and influential observations. This is done by the analysis of the sensitivity of the EM estimates under some usual perturbation schemes in the model or data and by inspecting some proposed diagnostic graphics. The efficacy of the method is verified through the analysis of simulated data sets and modelling a real data set first analysed under normal errors. The proposed algorithm and methods are implemented in the R package CensRegMod.  相似文献   

17.
We study the non-parametric estimation of a continuous distribution function F based on the partially rank-ordered set (PROS) sampling design. A PROS sampling design first selects a random sample from the underlying population and uses judgement ranking to rank them into partially ordered sets, without measuring the variable of interest. The final measurements are then obtained from one of the partially ordered sets. Considering an imperfect PROS sampling procedure, we first develop the empirical distribution function (EDF) estimator of F and study its theoretical properties. Then, we consider the problem of estimating F, where the underlying distribution is assumed to be symmetric. We also find a unique admissible estimator of F within the class of nondecreasing step functions with jumps at observed values and show the inadmissibility of the EDF. In addition, we introduce a smooth estimator of F and discuss its theoretical properties. Finally, we expand on various numerical illustrations of our results via several simulation studies and a real data application and show the advantages of PROS estimates over their counterparts under the simple random and ranked set sampling designs.  相似文献   

18.
Pharmacokinetic (PK) data often contain concentration measurements below the quantification limit (BQL). While specific values cannot be assigned to these observations, nevertheless these observed BQL data are informative and generally known to be lower than the lower limit of quantification (LLQ). Setting BQLs as missing data violates the usual missing at random (MAR) assumption applied to the statistical methods, and therefore leads to biased or less precise parameter estimation. By definition, these data lie within the interval [0, LLQ], and can be considered as censored observations. Statistical methods that handle censored data, such as maximum likelihood and Bayesian methods, are thus useful in the modelling of such data sets. The main aim of this work was to investigate the impact of the amount of BQL observations on the bias and precision of parameter estimates in population PK models (non‐linear mixed effects models in general) under maximum likelihood method as implemented in SAS and NONMEM, and a Bayesian approach using Markov chain Monte Carlo (MCMC) as applied in WinBUGS. A second aim was to compare these different methods in dealing with BQL or censored data in a practical situation. The evaluation was illustrated by simulation based on a simple PK model, where a number of data sets were simulated from a one‐compartment first‐order elimination PK model. Several quantification limits were applied to each of the simulated data to generate data sets with certain amounts of BQL data. The average percentage of BQL ranged from 25% to 75%. Their influence on the bias and precision of all population PK model parameters such as clearance and volume distribution under each estimation approach was explored and compared. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

19.
20.
Approximate normality and unbiasedness of the maximum likelihood estimate (MLE) of the long-memory parameter H of a fractional Brownian motion hold reasonably well for sample sizes as small as 20 if the mean and scale parameter are known. We show in a Monte Carlo study that if the latter two parameters are unknown the bias and variance of the MLE of H both increase substantially. We also show that the bias can be reduced by using a parametric bootstrap procedure. In very large samples, maximum likelihood estimation becomes problematic because of the large dimension of the covariance matrix that must be inverted. To overcome this difficulty, we propose a maximum likelihood method based upon first differences of the data. These first differences form a short-memory process. We split the data into a number of contiguous blocks consisting of a relatively small number of observations. Computation of the likelihood function in a block then presents no computational problem. We form a pseudo-likelihood function consisting of the product of the likelihood functions in each of the blocks and provide a formula for the standard error of the resulting estimator of H. This formula is shown in a Monte Carlo study to provide a good approximation to the true standard error. The computation time required to obtain the estimate and its standard error from large data sets is an order of magnitude less than that required to obtain the widely used Whittle estimator. Application of the methodology is illustrated on two data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号