首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A unit ω is to be classified into one of two correlated homoskedastic normal populations by linear discriminant function known as W classification statistic [T.W. Anderson, An asymptotic expansion of the distribution of studentized classification statistic, Ann. Statist. 1 (1973), pp. 964–972; T.W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd edn, Wiley, New York, 1984; G.J. Mclachlan, Discriminant Analysis and Statistical Pattern Recognition, John Wiley and Sons, New York, 1992]. The two populations studied here are two different states of the same population, like two different states of a disease where the population is the population of diseased patient. When a sample unit is observed in both the states (populations), the observations made on it (which form a pair) become correlated. A training sample is unbalanced when not all sample units are observed in both the states. Paired and also unbalanced samples are natural in studies related to correlated populations. S. Bandyopadhyay and S. Bandyopadhyay [Choosing better training sample for classifying an individual into one of two correlated normal populations, Calcutta Statist. Assoc. Bull. 54(215–216) (2003), pp. 167–180] studied the effect of unbalanced training sample structure on the performance of W statistics in the univariate correlated normal set-up for finding optimal sampling strategy for a better classification rate. In this study, the results are extended to the multivariate case with discussion on application in real scenario.  相似文献   

2.
Everyday we face all kinds of risks, and insurance is in the business of providing us a means to transfer or share these risks, usually to eliminate or reduce the resulting financial burden, in exchange for a predetermined price or tariff. Actuaries are considered professional experts in the economic assessment of uncertain events, and equipped with many statistical tools for analytics, they help formulate a fair and reasonable tariff associated with these risks. An important part of the process of establishing fair insurance tariffs is risk classification, which involves the grouping of risks into various classes that share a homogeneous set of characteristics allowing the actuary to reasonably price discriminate. This article is a survey paper on the statistical tools for risk classification used in insurance. Because of recent availability of more complex data in the industry together with the technology to analyze these data, we additionally discuss modern techniques that have recently emerged in the statistics discipline and can be used for risk classification. While several of the illustrations discussed in the paper focus on general, or non-life, insurance, several of the principles we examine can be similarly applied to life insurance. Furthermore, we also distinguish between a priori and a posteriori ratemaking. The former is a process which forms the basis for ratemaking when a policyholder is new and insufficient information may be available. The latter process uses additional historical information about policyholder claims when this becomes available. In effect, the resulting a posteriori premium allows one to correct and adjust the previous a priori premium making the price discrimination even more fair and reasonable.  相似文献   

3.
Summary. Models for multiple-test screening data generally require the assumption that the tests are independent conditional on disease state. This assumption may be unreasonable, especially when the biological basis of the tests is the same. We propose a model that allows for correlation between two diagnostic test results. Since models that incorporate test correlation involve more parameters than can be estimated with the available data, posterior inferences will depend more heavily on prior distributions, even with large sample sizes. If we have reasonably accurate information about one of the two screening tests (perhaps the standard currently used test) or the prevalences of the populations tested, accurate inferences about all the parameters, including the test correlation, are possible. We present a model for evaluating dependent diagnostic tests and analyse real and simulated data sets. Our analysis shows that, when the tests are correlated, a model that assumes conditional independence can perform very poorly. We recommend that, if the tests are only moderately accurate and measure the same biological responses, researchers use the dependence model for their analyses.  相似文献   

4.
A new method of discrimination and classification based on a Hausdorff type distance is proposed. In two groups, the Hausdorff distance is defined as the sum of the furthest distance of the nearest elements of one set to another. This distance has some useful properties and is exploited in developing a discriminant criterion between individual objects belonging to two groups based on a finite number of classification variables. The discrimination criterion is generalized to more than two groups in a couple of ways. Several data sets are analysed and their classification accuracy is compared to that obtained from linear discriminant function and the results are encouraging. The method in simple, lends itself to parallel computation and imposes less stringent conditions on the data.  相似文献   

5.
Data for studies of biological shape often consist of the locations of individually named pointslandmarks considered to be homologous' (to correspond biologically) from form to form. In 1917 D'Arcy Thompson introduced an elegant model of homology as deformation: the configuration of landmark locations for any one form is viewed as a finite sample from a smooth mapping representing its biological relationship to any other form of the data set. For data in two dimensions, multivariate statistical analysis of landmark locations may proceed unambiguously in terms of complex-valued shape coordinates (e,v) = (C?A)/(B?A) for sets of landmark triangles ABC. These are the coordinates of one vertex/landmark after scaling so that the remaining two vertices are at (0,0) and (1,0). Expressed in this fashion, the biological interpretation of the statistical analysis as a homology mapping would appear to depend on the triangulation. This paper introduces an analysis of landmark data and homology mappings using a hierarchy of geometric components of shape difference or shape change. Each component is a smooth deformation taking the form of a bivariate polynomial in the shape coordinates and is estimated in a manner nearly invariant with respect to the choice of a triangulation.  相似文献   

6.
In spatial epidemiology, detecting areas with high ratio of disease is important as it may lead to identifying risk factors associated with disease. This in turn may lead to further epidemiological investigations into the nature of disease. Disease mapping studies have been widely performed with considering only one disease in the estimated models. Simultaneous modelling of different diseases can also be a valuable tool both from the epidemiological and also from the statistical point of view. In particular, when we have several measurements recorded at each spatial location, one can consider multivariate models in order to handle the dependence among the multivariate components and the spatial dependence between locations. In this paper, spatial models that use multivariate conditionally autoregressive smoothing across the spatial dimension are considered. We study the patterns of incidence ratios and identify areas with consistently high ratio estimates as areas for further investigation. A hierarchical Bayesian approach using Markov chain Monte Carlo techniques is employed to simultaneously examine spatial trends of asthma visits by children and adults to hospital in the province of Manitoba, Canada, during 2000–2010.  相似文献   

7.
In a relapse clinical trial patients who have recovered from some recurrent disease (e.g.,ulcer or cancer) are examined at a number of predetermined times. A relapse can be detected either at one of these planned inspections or at a spontaneous visit initiated by the patient because of symptoms. In the first case the observations of the time to relapse, X, is interval-censored by two predetermined time-points. In the second case the upper endpoint of the interval is an observation of the time to symptoms,Y . To model the progression of the disease we use a partially observable Markov process. This approach results in a bivariate phase-type distribution for the joint distribution of (X,Y). It is a flexible model which contains several natural distributions for X, and allows the conditional distributions of the marginals to smoothly depend on each other. To estimate the distributions involved we develop an EM-algorithm. The estimation procedure is evaluated and compared with a non-parametric method in a couple of examples based on simulated data.  相似文献   

8.
The naïve Bayes rule (NBR) is a popular and often highly effective technique for constructing classification rules. This study examines the effectiveness of NBR as a method for constructing classification rules (credit scorecards) in the context of screening credit applicants (credit scoring). For this purpose, the study uses two real-world credit scoring data sets to benchmark NBR against linear discriminant analysis, logistic regression analysis, k-nearest neighbours, classification trees and neural networks. Of the two aforementioned data sets, the first one is taken from a major Greek bank whereas the second one is the Australian Credit Approval data set taken from the UCI Machine Learning Repository (available at http://www.ics.uci.edu/~mlearn/MLRepository.html). The predictive ability of scorecards is measured by the total percentage of correctly classified cases, the Gini coefficient and the bad rate amongst accepts. In each of the data sets, NBR is found to have a lower predictive ability than some of the other five methods under all measures used. Reasons that may negatively affect the predictive ability of NBR relative to that of alternative methods in the context of credit scoring are examined.  相似文献   

9.
10.
Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the associated classification error rate. In this paper, the authors study the effect of observations in the training sample on the error rate by deriving influence functions. They obtain a general expression for the influence function of the error rate, and they compute it for the maximum likelihood estimator as well as for several robust logistic discrimination procedures. Besides being of interest in their own right, the influence functions are also used to derive asymptotic classification efficiencies of different logistic discrimination rules. The authors also show how influential points can be detected by means of a diagnostic plot based on the values of the influence function  相似文献   

11.
The location model is a familiar basis for discriminant analysis of mixtures of categorical and continuous variables. Its usual implementation involves second-order smoothing, using multivariate regression for the continuous variables and log-linear models for the categorical variables. In spite of the smoothing, these procedures still require many parameters to be estimated and this in turn restricts the categorical variables to a small number if implementation is to be feasible. In this paper we propose non-parametric smoothing procedures for both parts of the model. The number of parameters to be estimated is dramatically reduced and the range of applicability thereby greatly increased. The methods are illustrated on several data sets, and the performances are compared with a range of other popular discrimination techniques. The proposed method compares very favourably with all its competitors.  相似文献   

12.
Air quality control usually requires a monitoring system of multiple indicators measured at various points in space and time. Hence, the use of space–time multivariate techniques are of fundamental importance in this context, where decisions and actions regarding environmental protection should be supported by studies based on either inter-variables relations and spatial–temporal correlations. This paper describes how canonical correlation analysis can be combined with space–time geostatistical methods for analysing two spatial–temporal correlated aspects, such as air pollution concentrations and meteorological conditions. Hourly averages of three pollutants (nitric oxide, nitrogen dioxide and ozone) and three atmospheric indicators (temperature, humidity and wind speed) taken for two critical months (February and August) at several monitoring stations are considered and space–time variograms for the variables are estimated. Simultaneous relationships between such sample space–time variograms are determined through canonical correlation analysis. The most correlated canonical variates are used for describing synthetically the underlying space–time behaviour of the components of the two sets.  相似文献   

13.
Abstract. In this study we are concerned with inference on the correlation parameter ρ of two Brownian motions, when only high‐frequency observations from two one‐dimensional continuous Itô semimartingales, driven by these particular Brownian motions, are available. Estimators for ρ are constructed in two situations: either when both components are observed (at the same time), or when only one component is observed and the other one represents its volatility process and thus has to be estimated from the data as well. In the first case it is shown that our estimator has the same asymptotic behaviour as the standard one for i.i.d. normal observations, whereas a feasible estimator can still be defined in the second framework, but with a slower rate of convergence.  相似文献   

14.
Although devised in 1936 by Fisher, discriminant analysis is still rapidly evolving, as the complexity of contemporary data sets grows exponentially. Our classification rules explore these complexities by modeling various correlations in higher-order data. Moreover, our classification rules are suitable to data sets where the number of response variables is comparable or larger than the number of observations. We assume that the higher-order observations have a separable variance-covariance matrix and two different Kronecker product structures on the mean vector. In this article, we develop quadratic classification rules among g different populations where each individual has κth order (κ ≥2) measurements. We also provide the computational algorithms to compute the maximum likelihood estimates for the model parameters and eventually the sample classification rules.  相似文献   

15.
The objective of this paper is to investigate through simulation the possible presence of the incidental parameters problem when performing frequentist model discrimination with stratified data. In this context, model discrimination amounts to considering a structural parameter taking values in a finite space, with k points, k≥2. This setting seems to have not yet been considered in the literature about the Neyman–Scott phenomenon. Here we provide Monte Carlo evidence of the severity of the incidental parameters problem also in the model discrimination setting and propose a remedy for a special class of models. In particular, we focus on models that are scale families in each stratum. We consider traditional model selection procedures, such as the Akaike and Takeuchi information criteria, together with the best frequentist selection procedure based on maximization of the marginal likelihood induced by the maximal invariant, or of its Laplace approximation. Results of two Monte Carlo experiments indicate that when the sample size in each stratum is fixed and the number of strata increases, correct selection probabilities for traditional model selection criteria may approach zero, unlike what happens for model discrimination based on exact or approximate marginal likelihoods. Finally, two examples with real data sets are given.  相似文献   

16.
In practice, when a principal component analysis is applied on a large number of variables the resultant principal components may not be easy to interpret, as each principal component is a linear combination of all the original variables. Selection of a subset of variables that contains, in some sense, as much information as possible and enhances the interpretations of the first few covariance principal components is one possible approach to tackle this problem. This paper describes several variable selection criteria and investigates which criteria are best for this purpose. Although some criteria are shown to be better than others, the main message of this study is that it is unwise to rely on only one or two criteria. It is also clear that the interdependence between variables and the choice of how to measure closeness between the original components and those using subsets of variables are both important in determining the best criteria to use.  相似文献   

17.
This paper studies the effect of autocorrelation on the smoothness of the trend of a univariate time series estimated by means of penalized least squares. An index of smoothness is deduced for the case of a time series represented by a signal-plus-noise model, where the noise follows an autoregressive process of order one. This index is useful for measuring the distortion of the amount of smoothness by incorporating the effect of autocorrelation. Different autocorrelation values are used to appreciate the numerical effect on smoothness for estimated trends of time series with different sample sizes. For comparative purposes, several graphs of two simulated time series are presented, where the estimated trend is compared with and without autocorrelation in the noise. Some findings are as follows, on the one hand, when the autocorrelation is negative (no matter how large) or positive but small, the estimated trend gets very close to the true trend. Even in this case, the estimation is improved by fixing the index of smoothness according to the sample size. On the other hand, when the autocorrelation is positive and large the simulated and estimated trends lie far away from the true trend. This situation is mitigated by fixing an appropriate index of smoothness for the estimated trend in accordance to the sample size at hand. Finally, an empirical example serves to illustrate the use of the smoothness index when estimating the trend of Mexico’s quarterly GDP.  相似文献   

18.
In many applications of generalized linear mixed models to clustered correlated or longitudinal data, often we are interested in testing whether a random effects variance component is zero. The usual asymptotic mixture of chi‐square distributions of the score statistic for testing constrained variance components does not necessarily hold. In this article, the author proposes and explores a parametric bootstrap test that appears to be valid based on its estimated level of significance under the null hypothesis. Results from a simulation study indicate that the bootstrap test has a level much closer to the nominal one while the asymptotic test is conservative, and is more powerful than the usual asymptotic score test based on a mixture of chi‐squares. The proposed bootstrap test is illustrated using two sets of real‐life data obtained from clinical trials. The Canadian Journal of Statistics © 2009 Statistical Society of Canada  相似文献   

19.
This paper deals with the analysis of datasets, where the subjects are described by the estimated means of a p-dimensional variable. Classical statistical methods of data analysis do not treat measurements affected by intrinsic variability, as in the case of estimates, so that the heterogeneity induced among subjects by this condition is not taken into account. In this paper a way to solve the problem is suggested in the context of symbolic data analysis, whose specific aim is to handle data tables where single valued measurements are substituted by complex data structures like frequency distributions, intervals, and sets of values. A principal component analysis is carried out according to this proposal, with a significant improvement in the treatment of information.  相似文献   

20.
In the article, a condition-based maintenance policy is proposed for a linear consecutive-k-out-of-n: F system. The failure times of components are assumed to be independent and identically distributed. It is assumed that the component states in the system can be known at any time and the system failure can be detected immediately. The preventive maintenance action is based on the number of working components in minimal cut sets of the system. If there is at least one minimal cut set consisting of only one working component, the system is maintained preventively after a certain time interval. The proposed policy is compared with corrective maintenance and age-based maintenance policies. As an extended case, it is assumed that the component states can only be known by inspection, but the system failure can be detected immediately. In this case, the system is inspected periodically and is also maintained preventively based on the system state at inspection. Numerical examples are studied to evaluate the performance of the proposed policy and investigate the effects of cost parameters on the expected cost rate.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号