首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
ABSTRACT

Correlated bilateral data arise from stratified studies involving paired body organs in a subject. When it is desirable to conduct inference on the scale of risk difference, one needs first to assess the assumption of homogeneity in risk differences across strata. For testing homogeneity of risk differences, we herein propose eight methods derived respectively from weighted-least-squares (WLS), the Mantel-Haenszel (MH) estimator, the WLS method in combination with inverse hyperbolic tangent transformation, and the test statistics based on their log-transformation, the modified Score test statistic and Likelihood ratio test statistic. Simulation results showed that four of the tests perform well in general, with the tests based on the WLS method and inverse hyperbolic tangent transformation always performing satisfactorily even under small sample size designs. The methods are illustrated with a dataset.  相似文献   

2.
Abstract

This paper investigates the statistical analysis of grouped accelerated temperature cycling test data when the product lifetime follows a Weibull distribution. A log-linear acceleration equation is derived from the Coffin-Manson model. The problem is transformed to a constant-stress accelerated life test with grouped data and multiple acceleration variables. The Jeffreys prior and reference priors are derived. Maximum likelihood estimation and Bayesian estimation with objective priors are obtained by applying the technique of data augmentation. A simulation study shows that both of these two methods perform well when sample size is large, and the Bayesian method gives better performance under small sample sizes.  相似文献   

3.
Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix.  相似文献   

4.
The statistical inference problem on effect size indices is addressed using a series of independent two-armed experiments from k arbitrary populations. The effect size parameter simply quantifies the difference between two groups. It is a meaningful index to be used when data are measured on different scales. In the context of bivariate statistical models, we define estimators of the effect size indices and propose large sample testing procedures to test the homogeneity of these indices. The null and non-null distributions of the proposed testing procedures are derived and their performance is evaluated via Monte Carlo simulation. Further, three types of interval estimation of the proposed indices are considered for both combined and uncombined data. Lower and upper confidence limits for the actual effect size indices are obtained and compared via bootstrapping. It is found that the length of the intervals based on the combined effect size estimator are almost half the length of the intervals based on the uncombined effect size estimators. Finally, we illustrate the proposed procedures for hypothesis testing and interval estimation using a real data set.  相似文献   

5.
ABSTRACT

Despite the popularity of the general linear mixed model for data analysis, power and sample size methods and software are not generally available for commonly used test statistics and reference distributions. Statisticians resort to simulations with homegrown and uncertified programs or rough approximations which are misaligned with the data analysis. For a wide range of designs with longitudinal and clustering features, we provide accurate power and sample size approximations for inference about fixed effects in the linear models we call reversible. We show that under widely applicable conditions, the general linear mixed-model Wald test has noncentral distributions equivalent to well-studied multivariate tests. In turn, exact and approximate power and sample size results for the multivariate Hotelling–Lawley test provide exact and approximate power and sample size results for the mixed-model Wald test. The calculations are easily computed with a free, open-source product that requires only a web browser to use. Commercial software can be used for a smaller range of reversible models. Simple approximations allow accounting for modest amounts of missing data. A real-world example illustrates the methods. Sample size results are presented for a multicenter study on pregnancy. The proposed study, an extension of a funded project, has clustering within clinic. Exchangeability among the participants allows averaging across them to remove the clustering structure. The resulting simplified design is a single-level longitudinal study. Multivariate methods for power provide an approximate sample size. All proofs and inputs for the example are in the supplementary materials (available online).  相似文献   

6.
《Statistics》2012,46(6):1306-1328
ABSTRACT

In this paper, we consider testing the homogeneity of risk differences in independent binomial distributions especially when data are sparse. We point out some drawback of existing tests in either controlling a nominal size or obtaining powers through theoretical and numerical studies. The proposed test is designed to avoid the drawbacks of existing tests. We present the asymptotic null distribution and asymptotic power function for the proposed test. We also provide numerical studies including simulations and real data examples showing the proposed test has reliable results compared to existing testing procedures.  相似文献   

7.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

8.
Abstract

We propose a statistical method for clustering multivariate longitudinal data into homogeneous groups. This method relies on a time-varying extension of the classical K-means algorithm, where a multivariate vector autoregressive model is additionally assumed for modeling the evolution of clusters' centroids over time. Model inference is based on a least-squares method and on a coordinate descent algorithm. To illustrate our work, we consider a longitudinal dataset on human development. Three variables are modeled, namely life expectancy, education and gross domestic product.  相似文献   

9.
For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative hypothesis requires complex analytic approximations, and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say p≤20. On the other hand, assuming that the data dimension p as well as the number q of regression variables are fixed while the sample size n grows, several asymptotic approximations are proposed in the literature for Wilk's Λ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension p and a large sample size n. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null hypothesis and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large p and large n context, but also for moderately large data dimensions such as p=30 or p=50. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in multivariate analysis of variance which is valid for high-dimensional data.  相似文献   

10.
Nearest Shrunken Centroid (NSC) classification has proven successful in ultra-high-dimensional classification problems involving thousands of features measured on relatively few individuals, such as in the analysis of DNA microarrays. The method requires the set of candidate classes to be closed. However, open-set classification is essential in many other applications including speaker identification, facial recognition, and authorship attribution. The authors review closed-set NSC classification, and then propose a diagnostic for whether open-set classification is needed. The diagnostic involves graphical and statistical comparison of posterior predictions of the test vectors to the observed test vectors. The authors propose a simple modification to NSC that allows the set of classes to be open. The open-set modification posits an unobserved class with a distribution of features just barely consistent with the test sample. A tuning constant reflects the combined considerations of power, specificity, multiplicity, number of features, and sample size. The authors illustrate and investigate properties of the diagnostic test and open-set NSC classification procedure using several example data sets. The diagnostic and the open-set NSC procedures are shown to be useful for identifying vectors that are not consistent with any of the candidate classes.  相似文献   

11.
ABSTRACT

Identifying homogeneous subsets of predictors in classification can be challenging in the presence of high-dimensional data with highly correlated variables. We propose a new method called cluster correlation-network support vector machine (CCNSVM) that simultaneously estimates clusters of predictors that are relevant for classification and coefficients of penalized SVM. The new CCN penalty is a function of the well-known Topological Overlap Matrix whose entries measure the strength of connectivity between predictors. CCNSVM implements an efficient algorithm that alternates between searching for predictors’ clusters and optimizing a penalized SVM loss function using Majorization–Minimization tricks and a coordinate descent algorithm. This combining of clustering and sparsity into a single procedure provides additional insights into the power of exploring dimension reduction structure in high-dimensional binary classification. Simulation studies are considered to compare the performance of our procedure to its competitors. A practical application of CCNSVM on DNA methylation data illustrates its good behaviour.  相似文献   

12.
Abstract

In this paper, we propose maximum entropy in the mean methods for propensity score matching classification problems. We provide a new methodological approach and estimation algorithms to handle explicitly cases when data is available: (i) in interval form; (ii) with bounded measurement or observational errors; or (iii) both as intervals and with bounded errors. We show that entropy in the mean methods for these three cases generally outperform benchmark error-free approaches.  相似文献   

13.
Abstract

The homogeneity hypothesis is investigated in a location family of distributions. A moment-based test is introduced based on data collected from a ranked set sampling scheme. The asymptotic distribution of the proposed test statistic is determined and the performance of the test is studied via simulation. Furthermore, for small sample sizes, the bootstrap procedure is used to distinguish the homogeneity of data. An illustrative example is also presented to explain the proposed procedures in this paper.  相似文献   

14.
Abstract

A new parametric hypothesis test of mean interval for interval-valued data set, which can deal with massive information contained in nowadays massive data “Big data” sets, is proposed. An approach using an orthogonal transformation is introduced to obtain an equivalent hypothesis test of mean interval in terms of the mid-point and mid-range of the interval-valued variable. The new test is very efficient in small interval-valued sample scenarios. Some simulation studies are conducted for the investigation of the sample size and the power of test. The performance of the proposed test is illustrated with two real-life examples.  相似文献   

15.
Abstract

The notions of (sample) mean, median and mode are common tools for describing the central tendency of a given probability distribution. In this article, we propose a new measure of central tendency, the sample monomode, which is related to the notion of sample mode. We also illustrate the computation of the sample monomode and propose a statistical test for discrete monomodality based on the likelihood ratio statistic.  相似文献   

16.
ABSTRACT

One main challenge for statistical prediction with data from multiple sources is that not all the associated covariate data are available for many sampled subjects. Consequently, we need new statistical methodology to handle this type of “fragmentary data” that has become more and more popular in recent years. In this article, we propose a novel method based on the frequentist model averaging that fits some candidate models using all available covariate data. The weights in model averaging are selected by delete-one cross-validation based on the data from complete cases. The optimality of the selected weights is rigorously proved under some conditions. The finite sample performance of the proposed method is confirmed by simulation studies. An example for personal income prediction based on real data from a leading e-community of wealth management in China is also presented for illustration.  相似文献   

17.
The k nearest neighbors (k-NN) classifier is one of the most popular methods for statistical pattern recognition and machine learning. In practice, the size k, the number of neighbors used for classification, is usually arbitrarily set to one or some other small numbers, or based on the cross-validation procedure. In this study, we propose a novel alternative approach to decide the size k. Based on a k-NN-based multivariate multi-sample test, we assign each k a permutation test based Z-score. The number of NN is set to the k with the highest Z-score. This approach is computationally efficient since we have derived the formulas for the mean and variance of the test statistic under permutation distribution for multiple sample groups. Several simulation and real-world data sets are analyzed to investigate the performance of our approach. The usefulness of our approach is demonstrated through the evaluation of prediction accuracies using Z-score as a criterion to select the size k. We also compare our approach to the widely used cross-validation approaches. The results show that the size k selected by our approach yields high prediction accuracies when informative features are used for classification, whereas the cross-validation approach may fail in some cases.  相似文献   

18.
Summary. Genome-wide measurement of gene expression is a promising approach to the identification of subclasses of cancer that are currently not differentiable, but potentially biologically heterogeneous. This type of molecular classification gives hope for highly individualized and more effective prognosis and treatment of cancer. Statistically, the analysis of gene expression data from unclassified tumours is a complex hypothesis-generating activity, involving data exploration, modelling and expert elicitation. We propose a modelling framework that can be used to inform and organize the development of exploratory tools for classification. Our framework uses latent categories to provide both a statistical definition of differential expression and a precise, experiment-independent, definition of a molecular profile. It also generates natural similarity measures for traditional clustering and gives probabilistic statements about the assignment of tumours to molecular profiles.  相似文献   

19.
We study multiple-class classification problems. Both ordinal and categorical labeled cases are discussed. The common approaches for multiple-class classification are built on binary classifiers, in which one-versus-one and one-versus-rest are typical approaches. When the number of classes is large, then these binary-classifier-based methods may suffer from either computational costs or the highly imbalanced sample sizes in their training stage. In order to alleviate the computational burden and the imbalanced training data issue in multiple-class classification problems, we propose a method that has competitive performance and retains the ease of model interpretation, which is essential for a prognostic/predictive model.  相似文献   

20.
ABSTRACT

A frequently encountered statistical problem is to determine if the variability among k populations is heterogeneous. If the populations are measured using different scales, comparing variances may not be appropriate. In this case, comparing coefficient of variation (CV) can be used because CV is unitless. In this paper, a non-parametric test is introduced to test whether the CVs from k populations are different. With the assumption that the populations are independent normally distributed, the Miller test, Feltz and Miller test, saddlepoint-based test, log likelihood ratio test and the proposed simulated Bartlett-corrected log likelihood ratio test are derived. Simulation results show the extreme accuracy of the simulated Bartlett-corrected log likelihood ratio test if the model is correctly specified. If the model is mis-specified and the sample size is small, the proposed test still gives good results. However, with a mis-specified model and large sample size, the non-parametric test is recommended.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号