首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Testing for periodicity in microarray time series encounters the challenges of short series length, missing values and presence of non-Fourier frequencies. In this article, a test method for such series has been proposed. The method is completely simulation based and finds p-values for test of periodicity through fitting Pearson Type VI distribution. The simulation results compare and reveal the excellence of this method over Fisher's g test for varying series length, frequencies, and error variance. This approach is applied to Caulobacter crescentus cell cycle data in order to demonstrate the practical performance.  相似文献   

2.
In this article, a sequential correction of two linear methods: linear discriminant analysis (LDA) and perceptron is proposed. This correction relies on sequential joining of additional features on which the classifier is trained. These new features are posterior probabilities determined by a basic classification method such as LDA and perceptron. In each step, we add the probabilities obtained on a slightly different data set, because the vector of added probabilities varies at each step. We therefore have many classifiers of the same type trained on slightly different data sets. Four different sequential correction methods are presented based on different combining schemas (e.g. mean rule and product rule). Experimental results on different data sets demonstrate that the improvements are efficient, and that this approach outperforms classical linear methods, providing a significant reduction in the mean classification error rate.  相似文献   

3.
Process regression methodology is underdeveloped relative to the frequency with which pertinent data arise. In this article, the response-190 is a binary indicator process representing the joint event of being alive and remaining in a specific state. The process is indexed by time (e.g., time since diagnosis) and observed continuously. Data of this sort occur frequently in the study of chronic disease. A general area of application involves a recurrent event with non-negligible duration (e.g., hospitalization and associated length of hospital stay) and subject to a terminating event (e.g., death). We propose a semiparametric multiplicative model for the process version of the probability of being alive and in the (transient) state of interest. Under the proposed methods, the regression parameter is estimated through a procedure that does not require estimating the baseline probability. Unlike the majority of process regression methods, the proposed methods accommodate multiple sources of censoring. In particular, we derive a computationally convenient variant of inverse probability of censoring weighting based on the additive hazards model. We show that the regression parameter estimator is asymptotically normal, and that the baseline probability function estimator converges to a Gaussian process. Simulations demonstrate that our estimators have good finite sample performance. We apply our method to national end-stage liver disease data. The Canadian Journal of Statistics 48: 222–237; 2020 © 2019 Statistical Society of Canada  相似文献   

4.
Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas in practice it is often the case that only summary data are available. For example this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article, we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straight-forward, and related to a long history of the use of linear methods for estimating missing values (e.g. Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible - allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context - these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the art imputation methods that use individual-level data, but at a fraction of the computational cost.  相似文献   

5.
The usual covariance estimates for data n-1 from a stationary zero-mean stochastic process {Xt} are the sample covariances Both direct and resampling approaches are used to estimate the variance of the sample covariances. This paper compares the performance of these variance estimates. Using a direct approach, we show that a consistent windowed periodogram estimate for the spectrum is more effective than using the periodogram itself. A frequency domain bootstrap for time series is proposed and analyzed, and we introduce a frequency domain version of the jackknife that is shown to be asymptotically unbiased and consistent for Gaussian processes. Monte Carlo techniques show that the time domain jackknife and subseries method cannot be recommended. For a Gaussian underlying series a direct approach using a smoothed periodogram is best; for a non-Gaussian series the frequency domain bootstrap appears preferable. For small samples, the bootstraps are dangerous: both the direct approach and frequency domain jackknife are better.  相似文献   

6.
The spectral analysis of Gaussian linear time-series processes is usually based on uni-frequential tools because the spectral density functions of degree 2 and higher are identically zero and there is no polyspectrum in this case. In finite samples, such an approach does not allow the resolution of closely adjacent spectral lines, except by using autoregressive models of excessively high order in the method of maximum entropy. In this article, multi-frequential periodograms designed for the analysis of discrete and mixed spectra are defined and studied for their properties in finite samples. For a given vector of frequencies ω, the sum of squares of the corresponding trigonometric regression model fitted to a time series by unweighted least squares defines the multi-frequential periodogram statistic IM(ω). When ω is unknown, it follows from the properties of nonlinear models whose parameters separate (i.e., the frequencies and the cosine and sine coefficients here) that the least-squares estimator of frequencies is obtained by maximizing I M(ω). The first-order, second-order and distribution properties of I M(ω) are established theoretically in finite samples, and are compared with those of Schuster's uni-frequential periodogram statistic. In the multi-frequential periodogram analysis, the least-squares estimator of frequencies is proved to be theoretically unbiased in finite samples if the number of periodic components of the time series is correctly estimated. Here, this number is estimated at the end of a stepwise procedure based on pseudo-Flikelihood ratio tests. Simulations are used to compare the stepwise procedure involving I M(ω) with a stepwise procedure using Schuster's periodogram, to study an approximation of the asymptotic theory for the frequency estimators in finite samples in relation to the proximity and signal-to-noise ratio of the periodic components, and to assess the robustness of I M(ω) against autocorrelation in the analysis of mixed spectra. Overall, the results show an improvement of the new method over the classical approach when spectral lines are adjacent. Finally, three examples with real data illustrate specific aspects of the method, and extensions (i.e., unequally spaced observations, trend modeling, replicated time series, periodogram matrices) are outlined.  相似文献   

7.
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ? n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher's discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L(1) and fused lasso penalties. Our proposal is equivalent to recasting Fisher's discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.  相似文献   

8.
This article demonstrates the application of classification trees (decision trees), logistic regression (LR), and linear discriminant function (LDR) to classify data of water quality (i.e., whether the water is fit for drinking on not fit for drinking). The data on water quality were obtained from Pakistan Council of Research in Water Resources (PCRWR) for two cities of Pakistan—one representing industrial environment (Sialkot) and the other one representing non-industrial environment (Narowal). To classify data on water quality, three statistical tools were employed—the Decision Tree methodology using Gini Index, LR, and LDA—using R software library. The results obtained by the said three techniques were compared using misclassification rates (a model with minimum value of misclassification rate is better). It was witnessed that LR performed well than the other two techniques while the Decision trees and LDA performed equally well. But for illustration purposes decision trees technique is comparatively easy to draw and interpret.  相似文献   

9.
In this article, we present the explicit expressions for the higher-order moments and cumulants of the first-order random coefficient integer-valued autoregressive (RCINAR(1)) process. The spectral and bispectral density functions are also obtained, which can characterize the RCINAR(1) process in the frequency domain. We use a frequency domain approach which is named Whittle criterion to estimate the parameters of the process. We propose a test statistic which is based on the frequency domain approach for the hypothesis test, H0: α = 0?H1: 0 < α < 1, where α is the mean of the random coefficient in the process. The asymptotic distribution of the test statistic is obtained. We compare the proposed test statistic with other statistics that can test serial dependence in time series of count via a typically numerical simulation, which indicates that our proposed test statistic has a good power.  相似文献   

10.
The main focus of our paper is to compare the performance of different model selection criteria used for multivariate reduced rank time series. We consider one of the most commonly used reduced rank model, that is, the reduced rank vector autoregression (RRVAR (p, r)) introduced by Velu et al. [Reduced rank models for multiple time series. Biometrika. 1986;7(31):105–118]. In our study, the most popular model selection criteria are included. The criteria are divided into two groups, that is, simultaneous selection and two-step selection criteria, accordingly. Methods from the former group select both an autoregressive order p and a rank r simultaneously, while in the case of two-step criteria, first an optimal order p is chosen (using model selection criteria intended for the unrestricted VAR model) and then an optimal rank r of coefficient matrices is selected (e.g. by means of sequential testing). Considered model selection criteria include well-known information criteria (such as Akaike information criterion, Schwarz criterion, Hannan–Quinn criterion, etc.) as well as widely used sequential tests (e.g. the Bartlett test) and the bootstrap method. An extensive simulation study is carried out in order to investigate the efficiency of all model selection criteria included in our study. The analysis takes into account 34 methods, including 6 simultaneous methods and 28 two-step approaches, accordingly. In order to carefully analyse how different factors affect performance of model selection criteria, we consider over 150 simulation settings. In particular, we investigate the influence of the following factors: time series dimension, different covariance structure, different level of correlation among components and different level of noise (variance). Moreover, we analyse the prediction accuracy concerned with the application of the RRVAR model and compare it with results obtained for the unrestricted vector autoregression. In this paper, we also present a real data application of model selection criteria for the RRVAR model using the Polish macroeconomic time series data observed in the period 1997–2007.  相似文献   

11.
Many sparse linear discriminant analysis (LDA) methods have been proposed to overcome the major problems of the classic LDA in high‐dimensional settings. However, the asymptotic optimality results are limited to the case with only two classes. When there are more than two classes, the classification boundary is complicated and no explicit formulas for the classification errors exist. We consider the asymptotic optimality in the high‐dimensional settings for a large family of linear classification rules with arbitrary number of classes. Our main theorem provides easy‐to‐check criteria for the asymptotic optimality of a general classification rule in this family as dimensionality and sample size both go to infinity and the number of classes is arbitrary. We establish the corresponding convergence rates. The general theory is applied to the classic LDA and the extensions of two recently proposed sparse LDA methods to obtain the asymptotic optimality.  相似文献   

12.
Classification of gene expression microarray data is important in the diagnosis of diseases such as cancer, but often the analysis of microarray data presents difficult challenges because the gene expression dimension is typically much larger than the sample size. Consequently, classification methods for microarray data often rely on regularization techniques to stabilize the classifier for improved classification performance. In particular, numerous regularization techniques, such as covariance-matrix regularization, are available, which, in practice, lead to a difficult choice of regularization methods. In this paper, we compare the classification performance of five covariance-matrix regularization methods applied to the linear discriminant function using two simulated high-dimensional data sets and five well-known, high-dimensional microarray data sets. In our simulation study, we found the minimum distance empirical Bayes method reported in Srivastava and Kubokawa [Comparison of discrimination methods for high dimensional data, J. Japan Statist. Soc. 37(1) (2007), pp. 123–134], and the new linear discriminant analysis reported in Thomaz, Kitani, and Gillies [A Maximum Uncertainty LDA-based approach for Limited Sample Size problems – with application to Face Recognition, J. Braz. Comput. Soc. 12(1) (2006), pp. 1–12], to perform consistently well and often outperform three other prominent regularization methods. Finally, we conclude with some recommendations for practitioners.  相似文献   

13.
Bayesian classification of Neolithic tools   总被引:1,自引:0,他引:1  
The classification of Neolithic tools by using cluster analysis enables archaeologists to understand the function of the tools and the technological and cultural conditions of the societies that made them. In this paper, Bayesian classification is adopted to analyse data which raise the question whether the observed variability, e.g. the shape and dimensions of the tools, is related to their use. The data present technical difficulties for the practitioner, such as the presence of mixed mode data, missing data and errors in variables. These complications are overcome by employing a finite mixture model and Markov chain Monte Carlo methods. The analysis uses prior information which expresses the archaeologist's belief that there are two tool groups that are similar to contemporary adzes and axes. The resulting mixing densities provide evidence that the morphological dimensional variability among tools is related to the existence of these two tool groups.  相似文献   

14.
Discrimination between two Gaussian time series is examined assuming that the important difference between the alternative processes is their covarianoe (spectral) structure. Using the likelihood ratio method in frequency domain a discriminant function is derived and its approximate distribution is obtained. It is demonstrated that, utilizing the Kullbadk-Leibler information measure, the frequencies or frequency bands which carry information for discrimination can be determined. Using this, it is shown that when mean functions are equal, discrimination based on the frequency with the largest discrimination information is equivalent to the classification procedure based on the best linear discriminant, Application to seismology is described by including a discussion concerning the spectral ratio discriminant for underground nuclear explosion and natural earthquake and is illustrated numerically using Rayleigh wave data from an underground and an atmospheric explosions.  相似文献   

15.
The authors extend the classical Cormack‐Jolly‐Seber mark‐recapture model to account for both temporal and spatial movement through a series of markers (e.g., dams). Survival rates are modeled as a function of (possibly) unobserved travel times. Because of the complex nature of the likelihood, they use a Bayesian approach based on the complete data likelihood, and integrate the posterior through Markov chain Monte Carlo methods. They test the model through simulations and apply it also to actual salmon data arising from the Columbia river system. The methodology was developed for use by the Pacific Ocean Shelf Tracking (POST) project.  相似文献   

16.
We consider detection of multiple changes in the distribution of periodic and autocorrelated data with known period. To account for periodicity we transform the sequence of vector observations by arranging them in matrices and thereby producing a sequence of independently and identically distributed matrix observations. We propose methods of testing the equality of matrix distributions and present methods that can be applied to matrix observations using the E-divisive algorithm. We show that periodicity and autocorrelation degrade existing change detection methods because they blur the changes that these procedures aim to discover. Methods that ignore the periodicity have low power to detect changes in the mean and the variance of periodic time series when the periodic effects overwhelm the true changes, while the proposed methods detect such changes with high power. We illustrate the proposed methods by detecting changes in the water quality of Lake Kasumigaura in Japan. The Canadian Journal of Statistics 48: 518–534; 2020 © 2020 Statistical Society of Canada  相似文献   

17.
Combining p-values from statistical tests across different studies is the most commonly used approach in meta-analysis for evolutionary biology. The most commonly used p-value combination methods mainly incorporate the z-transform tests (e.g., the un-weighted z-test and the weighted z-test) and the gamma-transform tests (e.g., the CZ method [Z. Chen, W. Yang, Q. Liu, J.Y. Yang, J. Li, and M.Q. Yang, A new statistical approach to combining p-values using gamma distribution and its application to genomewide association study, Bioinformatics 15 (2014), p. S3]). However, among these existing p-value combination methods, no method is uniformly most powerful in all situations [Chen et al. 2014]. In this paper, we propose a meta-analysis method based on the gamma distribution, MAGD, by pooling the p-values from independent studies. The newly proposed test, MAGD, allows for flexible accommodating of the different levels of heterogeneity of effect sizes across individual studies. The MAGD simultaneously retains all the characters of the z-transform tests and the gamma-transform tests. We also propose an easy-to-implement resampling approach for estimating the empirical p-values of MAGD for the finite sample size. Simulation studies and two data applications show that the proposed method MAGD is essentially as powerful as the z-transform tests (the gamma-transform tests) under the circumstance with the homogeneous (heterogeneous) effect sizes across studies.  相似文献   

18.
We propose a new meta-analysis method to pool univariate p-values across independent studies and we compare our method with that of Fisher, Stouffer, and George through simulations and identify sub-spaces where each of these methods are optimal and propose a strategy to choose the best meta-analysis method under different sub-spaces. We compare these meta-analysis approaches using p-values from periodicity tests of 4,940 S. Pombe genes from 10 independent time-course experiments and show that our new approach ranks the periodic, conserved, and cycling genes much higher, and detects at least as many genes among the top 1,000 genes, compared to other methods.  相似文献   

19.
We present a new statistical framework for landmark ?>curve-based image registration and surface reconstruction. The proposed method first elastically aligns geometric features (continuous, parameterized curves) to compute local deformations, and then uses a Gaussian random field model to estimate the full deformation vector field as a spatial stochastic process on the entire surface or image domain. The statistical estimation is performed using two different methods: maximum likelihood and Bayesian inference via Markov Chain Monte Carlo sampling. The resulting deformations accurately match corresponding curve regions while also being sufficiently smooth over the entire domain. We present several qualitative and quantitative evaluations of the proposed method on both synthetic and real data. We apply our approach to two different tasks on real data: (1) multimodal medical image registration, and (2) anatomical and pottery surface reconstruction.  相似文献   

20.
Fundamental frequency (F0) patterns, which indicate the vibration frequency of vocal cords, reflect the developmental changes in infant spoken language. In previous studies of developmental psychology, however, F0 patterns were manually classified into subjectively specified categories. Furthermore, since F0 has sequential missing and indicates a mean nonstationarity, classification that employs subsequent partition and conventional discriminant analysis based on stationary and local stationary processes is considered inadequate. Consequently, we propose a classification method based on discriminant analysis of time series data with mean nonstationarity and sequential missing, and a measurement technique for investigating the configuration similarities for classification. Using our proposed procedures, we analyse a longitudinal database of recorded conversations between infants and parents over a five-year period. Various F0 patterns were automatically classified into appropriate pattern groups, and the classification similarities calculated. These similarities gradually decreased with infant’s monthly age until a large change occurred around 20 months. The results suggest that our proposed methods are useful for analysing large-scale data and can contribute to studies of infant spoken language acquisition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号