首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.
Summary.  Non-hierarchical clustering methods are frequently based on the idea of forming groups around 'objects'. The main exponent of this class of methods is the k -means method, where these objects are points. However, clusters in a data set may often be due to certain relationships between the measured variables. For instance, we can find linear structures such as straight lines and planes, around which the observations are grouped in a natural way. These structures are not well represented by points. We present a method that searches for linear groups in the presence of outliers. The method is based on the idea of impartial trimming. We search for the 'best' subsample containing a proportion 1− α of the data and the best k affine subspaces fitting to those non-discarded observations by measuring discrepancies through orthogonal distances. The population version of the sample problem is also considered. We prove the existence of solutions for the sample and population problems together with their consistency. A feasible algorithm for solving the sample problem is described as well. Finally, some examples showing how the method proposed works in practice are provided.  相似文献   

2.
We consider regression analysis when part of covariates are incomplete in generalized linear models. The incomplete covariates could be due to measurement error or missing for some study subjects. We assume there exists a validation sample in which the data is complete and is a simple random subsample from the whole sample. Based on the idea of projection-solution method in Heyde (1997, Quasi-Likelihood and its Applications: A General Approach to Optimal Parameter Estimation. Springer, New York), a class of estimating functions is proposed to estimate the regression coefficients through the whole data. This method does not need to specify a correct parametric model for the incomplete covariates to yield a consistent estimate, and avoids the ‘curse of dimensionality’ encountered in the existing semiparametric method. Simulation results shows that the finite sample performance and efficiency property of the proposed estimates are satisfactory. Also this approach is computationally convenient hence can be applied to daily data analysis.  相似文献   

3.
We propose an approach that utilizes the Delaunay triangulation to identify a robust/outlier-free subsample. Given that the data structure of the non-outlying points is convex (e.g. of elliptical shape), this subsample can then be used to give a robust estimation of location and scatter (by applying the classical mean and covariance). The estimators derived from our approach are shown to have a high breakdown point. In addition, we provide a diagnostic plot to expand the initial subset in a data-driven way, further increasing the estimators’ efficiency.  相似文献   

4.
Sporting careers observed over a preset time interval can be partitioned into two distinct subsamples. These samples consist of individuals whose careers had already commenced at the start of the time interval (prevalent subsample) and individuals whose careers began during the time interval (incident subsample) as well the respective individual-level covariate data such as salary, height, weight, performance statistics, draft position, etc. Under the assumption of a proportional hazards model, we propose a partial likelihood estimator to model the effect of covariates on survival via an adjusted risk set sampling procedure for when the incident cohort data is used in conjunction with the prevalent cohort data. We use simulated failure time data to validate the combined cohort proportional hazards methodology and illustrate our model using an NBA data set for career durations measured between 1990 and 2008.  相似文献   

5.
We consider logistic regression with covariate measurement error. Most existing approaches require certain replicates of the error‐contaminated covariates, which may not be available in the data. We propose generalized method of moments (GMM) nonparametric correction approaches that use instrumental variables observed in a calibration subsample. The instrumental variable is related to the underlying true covariates through a general nonparametric model, and the probability of being in the calibration subsample may depend on the observed variables. We first take a simple approach adopting the inverse selection probability weighting technique using the calibration subsample. We then improve the approach based on the GMM using the whole sample. The asymptotic properties are derived, and the finite sample performance is evaluated through simulation studies and an application to a real data set.  相似文献   

6.
In the present paper we develop second-order theory using the subsample bootstrap in the context of Pareto index estimation. We show that the bootstrap is not second-order accurate, in the sense that it fails to correct the first term describing departure from the limit distribution. Worse than this, even when the subsample size is chosen optimally, the error between the subsample bootstrap approximation and the true distribution is often an order of magnitude larger than that oi tue asymptotic approximation. To overcome this deficiency, we show that an extrapolation method, based quite literally on a mixture of asymptotic and subsample bootstrap methods, can lead to second-order correct confidence intervals for the Pareto index.  相似文献   

7.
Errors in measurement frequently occur in observing responses. If case–control data are based on certain reported responses, which may not be the true responses, then we have contaminated case–control data. In this paper, we first show that the ordinary logistic regression analysis based on contaminated case–control data can lead to very serious biased conclusions. This can be concluded from the results of a theoretical argument, one example, and two simulation studies. We next derive the semiparametric maximum likelihood estimate (MLE) of the risk parameter of a logistic regression model when there is a validation subsample. The asymptotic normality of the semiparametric MLE will be shown along with consistent estimate of asymptotic variance. Our example and two simulation studies show these estimates to have reasonable performance under finite sample situations.  相似文献   

8.
Parallel bootstrap is an extremely useful statistical method with good performance. In the present study, we introduce a working correlation matrix on the method, which is called parallel bootstrap matrix. We consider some properties of it and the optimal size of the subsample in smooth function models. We also present some performance results of parallel bootstrap estimators, the subsample length selection on the finance time series data.  相似文献   

9.
We propose a unified approach to the estimation of regression parameters under double-sampling designs, in which a primary sample consisting of data on the rough or proxy measures for the response and/or explanatory variables as well as a validation subsample consisting of data on the exact measurements are available. We assume that the validation sample is a simple random subsample from the primary sample. Our proposal utilizes a specific parametric model to extract the partial information contained in the primary sample. The resulting estimator is consistent even if such a model is misspecified, and it achieves higher asymptotic efficiency than the estimator based only on the validation data. Specific cases are discussed to illustrate the application of the estimator proposed.  相似文献   

10.
Clustered multinomial data with random cluster sizes commonly appear in health, environmental and ecological studies. Traditional approaches for analyzing clustered multinomial data contemplate two assumptions. One of these assumptions is that cluster sizes are fixed, whereas the other demands cluster sizes to be positive. Randomness of the cluster sizes may be the determinant of the within-cluster correlation and between-cluster variation. We propose a baseline-category mixed model for clustered multinomial data with random cluster sizes based on Poisson mixed models. Our orthodox best linear unbiased predictor approach to this model depends only on the moment structure of unobserved distribution-free random effects. Our approach also consolidates the marginal and conditional modeling interpretations. Unlike the traditional methods, our approach can accommodate both random and zero cluster sizes. Two real-life multinomial data examples, crime data and food contamination data, are used to manifest our proposed methodology.  相似文献   

11.
"The [U.S.] Current Population Survey (CPS) reinterview sample consists of two subsamples: (a) a sample of CPS households is reinterviewed and the discrepancies between the reinterview responses and the original interview responses are reconciled for the purpose of obtaining more accurate responses..., and (b) a sample of CPS households, nonoverlapping with sample (a), is reinterviewed 'independently' of the original interview for the purpose of estimating simple response variance (SRV). In this article a model and estimation procedure are proposed for obtaining estimates of SRV from subsample (a) as well as the customary estimates of SRV from subsample (b).... Data from the CPS reinterview program for both subsamples (a) and (b) are analyzed both (1) to illustrate the methodology and (2) to check the validity of the CPS reinterview data. Our results indicate that data from subsample (a) are not consistent with the data from subsample (b) and provide convincing evidence that errors in subsample (a) are the source of the inconsistency."  相似文献   

12.
Often in applied econometric work, the sample of observations is split so that within each subsample the observations can reasonably be assumed to have the same parameter values. In this article I present a procedure for sample splitting that uses cluster analysis techniques. The procedure is illustrated with data from a cross-section of households obtaining subsamples with homogeneous demand parameters. The groups turn out to be determined, primarily, by occupation of the family head. Demand behavior is studied in each of the resulting groups.  相似文献   

13.
Abstract.  We consider classification of the realization of a multivariate spatial–temporal Gaussian random field into one of two populations with different regression mean models and factorized covariance matrices. Unknown means and common feature vector covariance matrix are estimated from training samples with observations correlated in space and time, assuming spatial–temporal correlations to be known. We present the first-order asymptotic expansion of the expected error rate associated with a linear plug-in discriminant function. Our results are applied to ecological data collected from the Lithuanian Economic Zone in the Baltic Sea.  相似文献   

14.
Projection techniques for nonlinear principal component analysis   总被引:4,自引:0,他引:4  
Principal Components Analysis (PCA) is traditionally a linear technique for projecting multidimensional data onto lower dimensional subspaces with minimal loss of variance. However, there are several applications where the data lie in a lower dimensional subspace that is not linear; in these cases linear PCA is not the optimal method to recover this subspace and thus account for the largest proportion of variance in the data.Nonlinear PCA addresses the nonlinearity problem by relaxing the linear restrictions on standard PCA. We investigate both linear and nonlinear approaches to PCA both exclusively and in combination. In particular we introduce a combination of projection pursuit and nonlinear regression for nonlinear PCA. We compare the success of PCA techniques in variance recovery by applying linear, nonlinear and hybrid methods to some simulated and real data sets.We show that the best linear projection that captures the structure in the data (in the sense that the original data can be reconstructed from the projection) is not necessarily a (linear) principal component. We also show that the ability of certain nonlinear projections to capture data structure is affected by the choice of constraint in the eigendecomposition of a nonlinear transform of the data. Similar success in recovering data structure was observed for both linear and nonlinear projections.  相似文献   

15.
When subjects are not found at home in a social survey, the question arises whether the subsample encountered at home, on the first or subsequent visits, is random or biased. A procedure is presented by which this question can be statistically tested, by comparing the decline rate in unfound subjects, over repeated visits, with those expected if the subsample were random or strongly biased. The randomness of the subsamples can be compared between the first and subsequent visits. The procedure can be carried out during a programme of revisits, to check quickly whether a satisfactory sample is being obtained.  相似文献   

16.
This paper focuses on bivariate kernel density estimation that bridges the gap between univariate and multivariate applications. We propose a subsampling-extrapolation bandwidth matrix selector that improves the reliability of the conventional cross-validation method. The proposed procedure combines a U-statistic expression of the mean integrated squared error and asymptotic theory, and can be used in both cases of diagonal bandwidth matrix and unconstrained bandwidth matrix. In the subsampling stage, one takes advantage of the reduced variability of estimating the bandwidth matrix at a smaller subsample size m (m < n); in the extrapolation stage, a simple linear extrapolation is used to remove the incurred bias. Simulation studies reveal that the proposed method reduces the variability of the cross-validation method by about 50% and achieves an expected integrated squared error that is up to 30% smaller than that of the benchmark cross-validation. It shows comparable or improved performance compared to other competitors across six distributions in terms of the expected integrated squared error. We prove that the components of the selected bivariate bandwidth matrix have an asymptotic multivariate normal distribution, and also present the relative rate of convergence of the proposed bandwidth selector.  相似文献   

17.
We examine moving average (MA) filters for estimating the integrated variance (IV) of a financial asset price in a framework where high-frequency price data are contaminated with market microstructure noise. We show that the sum of squared MA residuals must be scaled to enable a suitable estimator of IV. The scaled estimator is shown to be consistent, first-order efficient, and asymptotically Gaussian distributed about the integrated variance under restrictive assumptions. Under more plausible assumptions, such as time-varying volatility, the MA model is misspecified. This motivates an extensive simulation study of the merits of the MA-based estimator under misspecification. Specifically, we consider nonconstant volatility combined with rounding errors and various forms of dependence between the noise and efficient returns. We benchmark the scaled MA-based estimator to subsample and realized kernel estimators and find that the MA-based estimator performs well despite the misspecification.  相似文献   

18.
We examine moving average (MA) filters for estimating the integrated variance (IV) of a financial asset price in a framework where high-frequency price data are contaminated with market microstructure noise. We show that the sum of squared MA residuals must be scaled to enable a suitable estimator of IV. The scaled estimator is shown to be consistent, first-order efficient, and asymptotically Gaussian distributed about the integrated variance under restrictive assumptions. Under more plausible assumptions, such as time-varying volatility, the MA model is misspecified. This motivates an extensive simulation study of the merits of the MA-based estimator under misspecification. Specifically, we consider nonconstant volatility combined with rounding errors and various forms of dependence between the noise and efficient returns. We benchmark the scaled MA-based estimator to subsample and realized kernel estimators and find that the MA-based estimator performs well despite the misspecification.  相似文献   

19.
In biostatistical applications interest often focuses on the estimation of the distribution of time T between two consecutive events. If the initial event time is observed and the subsequent event time is only known to be larger or smaller than an observed monitoring time C, then the data conforms to the well understood singly-censored current status model, also known as interval censored data, case I. Additional covariates can be used to allow for dependent censoring and to improve estimation of the marginal distribution of T. Assuming a wrong model for the conditional distribution of T, given the covariates, will lead to an inconsistent estimator of the marginal distribution. On the other hand, the nonparametric maximum likelihood estimator of FT requires splitting up the sample in several subsamples corresponding with a particular value of the covariates, computing the NPMLE for every subsample and then taking an average. With a few continuous covariates the performance of the resulting estimator is typically miserable. In van der Laan, Robins (1996) a locally efficient one-step estimator is proposed for smooth functionals of the distribution of T, assuming nothing about the conditional distribution of T, given the covariates, but assuming a model for censoring, given the covariates. The estimators are asymptotically linear if the censoring mechanism is estimated correctly. The estimator also uses an estimator of the conditional distribution of T, given the covariates. If this estimate is consistent, then the estimator is efficient and if it is inconsistent, then the estimator is still consistent and asymptotically normal. In this paper we show that the estimators can also be used to estimate the distribution function in a locally optimal way. Moreover, we show that the proposed estimator can be used to estimate the distribution based on interval censored data (T is now known to lie between two observed points) in the presence of covariates. The resulting estimator also has a known influence curve so that asymptotic confidence intervals are directly available. In particular, one can apply our proposal to the interval censored data without covariates. In Geskus (1992) the information bound for interval censored data with two uniformly distributed monitoring times at the uniform distribution (for T has been computed. We show that the relative efficiency of our proposal w.r.t. this optimal bound equals 0.994, which is also reflected in finite sample simulations. Finally, the good practical performance of the estimator is shown in a simulation study. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

20.
Various approaches to obtaining estimates based on preliminary data are outlined. A case is then considered which frequently arises when selecting a subsample of units, the information for which is collected within a deadline that allows preliminary estimates to be produced. At the moment when these estimates have to be produced it often occurs that, although the collection of data on subsample units is still not complete, information is available on a set of units which does not belong to the sample selected for the production of the preliminary estimates. An estimation method is proposed which allows all the data available on a given date to be used to the full-and the expression of the expectation and variance are derived. The proposal is based on two-phase sampling theory and on the hypothesis that the response mechanism is the result of random processes whose parameters can be suitably estimated. An empirical analysis of the performance of the estimator on the Italian Survey on building permits concludes the work. The Sects. 1,2,3,4 and the technical appendixes have been developed by Giorgio Alleva and Piero Demetrio Falorsi; Sect. 5 has been done by Fabio Bacchini and Roberto Iannaccone. Piero Demetrio Falorsi is chief statisticians at Italian National Institute of Statistics (ISTAT); Giorgio Alleva is Professor of Statistics at University “La Sapienza” of Rome, Fabio Bacchini and Roberto Iannaccone are researchers at ISTAT.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号