首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
4.
5.
6.
This article utilizes stochastic ideas for reasoning about association rule mining, and provides a formal statistical view of this discipline. A simple stochastic model is proposed, based on which support and confidence are reasonable estimates for certain probabilities of the model. Statistical properties of the corresponding estimators, like moments and confidence intervals, are derived, and items and itemsets are observed for correlations. After a brief review of measures of interest of association rules, with the main focus on interestingness measures motivated by statistical principles, two new measures are described. These measures, called α- and σ-precision, respectively, rely on statistical properties of the estimators discussed before. Experimental results demonstrate the effectivity of both measures.  相似文献   

7.
This article considers the case where two surveys collect data on a common variable, with one survey being much smaller than the other. The smaller survey collects data on an additional variable of interest, related to the common variable collected in the two surveys, and out-of-scope with respect to the larger survey. Estimation of the two related variables is of interest at domains defined at a granular level. We propose a multilevel model for integrating data from the two surveys, by reconciling survey estimates available for the common variable, accounting for the relationship between the two variables, and expanding estimation for the other variable, for all the domains of interest. The model is specified as a hierarchical Bayes model for domain-level survey data, and posterior distributions are constructed for the two variables of interest. A synthetic estimation approach is considered as an alternative to the hierarchical modelling approach. The methodology is applied to wage and benefits estimation using data from the National Compensation Survey and the Occupational Employment Statistics Survey, available from the Bureau of Labor Statistics, Department of Labor, United States.  相似文献   

8.
This work provides a set of macros performed with SAS (Statistical Analysis System) for Windows, which can be used to fit conditional models under intermittent missingness in longitudinal data. A formalized transition model, including random effects for individuals and measurement error, is presented. Model fitting is based on the missing completely at random or missing at random assumptions, and the separability condition. The problem translates to maximization of the marginal observed data density only, which for Gaussian data is again Gaussian, meaning that the likelihood can be expressed in terms of the mean and covariance matrix of the observed data vector. A simulation study is presented and misspecification issues are considered. A practical application is also given, where conditional models are fitted to the data from a clinical trial that assessed the effect of a Cuban medicine on a disease of the respiratory system.  相似文献   

9.
Intercropping is an important farming system, especially in tropical regions. A statistical model with competition coefficients and correlated error structure is suggested for the analysis of data from intercropping experiments involving two crop species. Data from an intercropping experiment with pearl millet and sorghum genotypes are used to illustrate the technique.  相似文献   

10.
11.
12.
13.
Real-time polymerase chain reaction (PCR) is reliable quantitative technique in gene expression studies. The statistical analysis of real-time PCR data is quite crucial for results analysis and explanation. The statistical procedures of analyzing real-time PCR data try to determine the slope of regression line and calculate the reaction efficiency. Applications of mathematical functions have been used to calculate the target gene relative to the reference gene(s). Moreover, these statistical techniques compare Ct (threshold cycle) numbers between control and treatments group. There are many different procedures in SAS for real-time PCR data evaluation. In this study, the efficiency of calibrated model and delta delta Ct model have been statistically tested and explained. Several methods were tested to compare control with treatment means of Ct. The methods tested included t-test (parametric test), Wilcoxon test (non-parametric test) and multiple regression. Results showed that applied methods led to similar results and no significant difference was observed between results of gene expression measurement by the relative method.  相似文献   

14.
When random variables do not take discrete values, observed data are often the rounded values of continuous random variables. Errors caused by rounding of data are often neglected by classical statistical theories. While some pioneers have identified and made suggestions to rectify the problem, few suitable approaches were proposed. In this paper, we propose an approximate MLE (AMLE) procedure to estimate the parameters and discuss the consistency and asymptotic normality of the estimates. For our illustration, we shall consider the estimates of the parameters in AR(p)AR(p) and MA(q)MA(q) models for rounded data.  相似文献   

15.
Statistical inference for olfactometer data   总被引:1,自引:0,他引:1  
Summary.  Olfactometer experiments are used to determine the effect of odours on the behaviour of organisms such as insects or nematodes, and typically result in data comprising many groups of small overdispersed counts. We develop a non-homogeneous Markov chain model for data from olfactometer experiments with parasitoid wasps and discuss a relation with the Dirichlet–multinomial distribution. We consider the asymptotic relative efficiencies of three different observation schemes and give an analysis of data intended to shed light on the effect of previous experience of odours in the wasps.  相似文献   

16.
In this paper we analyse a real e-learning dataset derived from the e-learning platform of the University of Pavia. The dataset concerns an online learning environment with in-depth teaching materials. The main focus of this paper is to supply a measure of the relative importance of the exercises (test) at the end of each training unit; to build predictive models of student’s performance and finally to personalize the e-learning platform. The methodology employed is based on nonparametric statistical methods for kernel density estimation and generalized linear models and generalized additive models for predictive purposes.  相似文献   

17.
ApEn, approximate entropy, is a recently developed family of parameters and statistics quantifying regularity (complexity) in data, providing an information-theoretic quantity for continuous-state processes. We provide the motivation for ApEn development, and indicate the superiority of ApEn to the K-S entropy for statistical application, and for discrimination of both correlated stochastic and noisy deterministic processes. We study the variation of ApEn with input parameter choices, reemphasizing that ApEn is a relative measure of regularity. We study the bias in the ApEn statistic, and present evidence for asymptotic normality in the ApEn distributions, assuming weak dependence. We provide a new test for the hypothesis that an underlying time-series is generated by i.i.d. variables, which does not require distribution specification. We introduce randomized ApEn, which derives an empirical significance probability that two processes differ, based on one data set from each process.  相似文献   

18.
Time-to-event data such as time to death are broadly used in medical research and drug development to understand the efficacy of a therapeutic. For time-to-event data, right censoring (data only observed up to a certain point of time) is common and easy to recognize. Methods that use right censored data, such as the Kaplan–Meier estimator and the Cox proportional hazard model, are well established. Time-to-event data can also be left truncated, which arises when patients are excluded from the sample because their events occur before a specific milestone, potentially resulting in an immortal time bias. For example, in a study evaluating the association between biomarker status and overall survival, patients who did not live long enough to receive a genomic test were not observed in the study. Left truncation causes selection bias and often leads to an overestimate of survival time. In this tutorial, we used a nationwide electronic health record-derived de-identified database to demonstrate how to analyze left truncated and right censored data without bias using example code from SAS and R.  相似文献   

19.
We introduce a process of non-intersecting convex particles by thinning a primary particle process such that the remaining particles are mutually non-intersecting and have maximum total volume among all such subsystems. This approach is based on the idea to construct hardcore processes by suitable dependent thinnings proposed by Matérn but generates packings with higher volume fractions than the known thinning models. Due to the enormous complexity of the computations involved, we develop a two-phase heuristic algorithm whose first phase turns out to yield a structure of Matérn III type. We focus mainly on the generation of packings with high volume fractions and present some simulation results for Poisson primary particle processes of equally sized balls in ?2 and ?3. The results are compared with the well-known random sequential adsorption model and Matérn type models.  相似文献   

20.
Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号