期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

This article utilizes stochastic ideas for reasoning about association rule mining, and provides a formal statistical view of this discipline. A simple stochastic model is proposed, based on which support and confidence are reasonable estimates for certain probabilities of the model. Statistical properties of the corresponding estimators, like moments and confidence intervals, are derived, and items and itemsets are observed for correlations. After a brief review of measures of interest of association rules, with the main focus on interestingness measures motivated by statistical principles, two new measures are described. These measures, called α- and σ-precision, respectively, rely on statistical properties of the estimators discussed before. Experimental results demonstrate the effectivity of both measures. 相似文献

7.

Statistical data integration using multilevel models to predict employee compensation

Andreea L. Erciulescu Jean D. Opsomer Benjamin J. Schneider 《Revue canadienne de statistique》2023,51(1):312-326

This article considers the case where two surveys collect data on a common variable, with one survey being much smaller than the other. The smaller survey collects data on an additional variable of interest, related to the common variable collected in the two surveys, and out-of-scope with respect to the larger survey. Estimation of the two related variables is of interest at domains defined at a granular level. We propose a multilevel model for integrating data from the two surveys, by reconciling survey estimates available for the common variable, accounting for the relationship between the two variables, and expanding estimation for the other variable, for all the domains of interest. The model is specified as a hierarchical Bayes model for domain-level survey data, and posterior distributions are constructed for the two variables of interest. A synthetic estimation approach is considered as an alternative to the hierarchical modelling approach. The methodology is applied to wage and benefits estimation using data from the National Compensation Survey and the Occupational Employment Statistics Survey, available from the Bureau of Labor Statistics, Department of Labor, United States. 相似文献

8.

Longitudinal conditional models with intermittent missingness: SAS code and applications

《Journal of Statistical Computation and Simulation》2012,82(4):753-780

This work provides a set of macros performed with SAS (Statistical Analysis System) for Windows, which can be used to fit conditional models under intermittent missingness in longitudinal data. A formalized transition model, including random effects for individuals and measurement error, is presented. Model fitting is based on the missing completely at random or missing at random assumptions, and the separability condition. The problem translates to maximization of the marginal observed data density only, which for Gaussian data is again Gaussian, meaning that the likelihood can be expressed in terms of the mean and covariance matrix of the observed data vector. A simulation study is presented and misspecification issues are considered. A practical application is also given, where conditional models are fitted to the data from a clinical trial that assessed the effect of a Cuban medicine on a disease of the respiratory system. 相似文献

9.

Statistical analysis of intercropping data using a correlated error structure *

M. Singh B. Gilliver 《Journal of applied statistics》1988,15(1):53-61

Intercropping is an important farming system, especially in tropical regions. A statistical model with competition coefficients and correlated error structure is suggested for the analysis of data from intercropping experiments involving two crop species. Data from an intercropping experiment with pearl millet and sorghum genotypes are used to illustrate the technique. 相似文献

10.

Using SAS for data management,statistical analysis and graphics

Jingyun Yang 《Pharmaceutical statistics》2012,11(4):346-346

相似文献

11.

Adaptive design theory and implementation using SAS and R

Andrew Robinson 《Journal of applied statistics》2009,36(6):701-702

相似文献

12.

Basic Statistics using SAS Enterprise Guide, a Primer

Alex Karagrigoriou 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2009,172(2):530-531

相似文献

13.

Evaluation of different statistical methods using SAS software: an in silico approach for analysis of real-time PCR data

Mohammadreza Nassiri Mahdi Elahi Torshizi Mohammad Doosti 《Journal of applied statistics》2018,45(2):306-319

Real-time polymerase chain reaction (PCR) is reliable quantitative technique in gene expression studies. The statistical analysis of real-time PCR data is quite crucial for results analysis and explanation. The statistical procedures of analyzing real-time PCR data try to determine the slope of regression line and calculate the reaction efficiency. Applications of mathematical functions have been used to calculate the target gene relative to the reference gene(s). Moreover, these statistical techniques compare C_t (threshold cycle) numbers between control and treatments group. There are many different procedures in SAS for real-time PCR data evaluation. In this study, the efficiency of calibrated model and delta delta C_t model have been statistically tested and explained. Several methods were tested to compare control with treatment means of C_t. The methods tested included t-test (parametric test), Wilcoxon test (non-parametric test) and multiple regression. Results showed that applied methods led to similar results and no significant difference was observed between results of gene expression measurement by the relative method. 相似文献

14.

Statistical analysis for rounded data

Zhidong Bai Shurong Zheng Baoxue Zhang Guorong Hu 《Journal of statistical planning and inference》2009

When random variables do not take discrete values, observed data are often the rounded values of continuous random variables. Errors caused by rounding of data are often neglected by classical statistical theories. While some pioneers have identified and made suggestions to rectify the problem, few suitable approaches were proposed. In this paper, we propose an approximate MLE (AMLE) procedure to estimate the parameters and discuss the consistency and asymptotic normality of the estimates. For our illustration, we shall consider the estimates of the parameters in AR(p)

AR (p)

and MA(q)

MA (q)

models for rounded data. 相似文献

15.

Statistical inference for olfactometer data 总被引：1，自引：0，他引：1

I. Ricard A. C. Davison 《Journal of the Royal Statistical Society. Series C, Applied statistics》2007,56(4):479-492

Summary. Olfactometer experiments are used to determine the effect of odours on the behaviour of organisms such as insects or nematodes, and typically result in data comprising many groups of small overdispersed counts. We develop a non-homogeneous Markov chain model for data from olfactometer experiments with parasitoid wasps and discuss a relation with the Dirichlet–multinomial distribution. We consider the asymptotic relative efficiencies of three different observation schemes and give an analysis of data intended to shed light on the effect of previous experience of odours in the wasps. 相似文献

16.

Statistical models for e-learning data

Silvia Figini Paolo Giudici 《Statistical Methods and Applications》2009,18(2):293-304

In this paper we analyse a real e-learning dataset derived from the e-learning platform of the University of Pavia. The dataset concerns an online learning environment with in-depth teaching materials. The main focus of this paper is to supply a measure of the relative importance of the exercises (test) at the end of each training unit; to build predictive models of student’s performance and finally to personalize the e-learning platform. The methodology employed is based on nonparametric statistical methods for kernel density estimation and generalized linear models and generalized additive models for predictive purposes. 相似文献

17.

Approximate entropy: Statistical properties and applications

Steven M. Pincus Wei-Min Huang 《统计学通讯:理论与方法》2013,42(11):3061-3077

ApEn, approximate entropy, is a recently developed family of parameters and statistics quantifying regularity (complexity) in data, providing an information-theoretic quantity for continuous-state processes. We provide the motivation for ApEn development, and indicate the superiority of ApEn to the K-S entropy for statistical application, and for discrimination of both correlated stochastic and noisy deterministic processes. We study the variation of ApEn with input parameter choices, reemphasizing that ApEn is a relative measure of regularity. We study the bias in the ApEn statistic, and present evidence for asymptotic normality in the ApEn distributions, assuming weak dependence. We provide a new test for the hypothesis that an underlying time-series is generated by i.i.d. variables, which does not require distribution specification. We introduce randomized ApEn, which derives an empirical significance probability that two processes differ, based on one data set from each process. 相似文献

18.

Left truncation in linked data: A practical guide to understanding left truncation and applying it using SAS and R

Yanling Jin Thanh G. N. Ton Devin Incerti Sylvia Hu 《Pharmaceutical statistics》2023,22(1):194-204

Time-to-event data such as time to death are broadly used in medical research and drug development to understand the efficacy of a therapeutic. For time-to-event data, right censoring (data only observed up to a certain point of time) is common and easy to recognize. Methods that use right censored data, such as the Kaplan–Meier estimator and the Cox proportional hazard model, are well established. Time-to-event data can also be left truncated, which arises when patients are excluded from the sample because their events occur before a specific milestone, potentially resulting in an immortal time bias. For example, in a study evaluating the association between biomarker status and overall survival, patients who did not live long enough to receive a genomic test were not observed in the study. Left truncation causes selection bias and often leads to an overestimate of survival time. In this tutorial, we used a nationwide electronic health record-derived de-identified database to demonstrate how to analyze left truncated and right censored data without bias using example code from SAS and R. 相似文献

19.

Books on data mining are reviewed

《Journal of Statistical Computation and Simulation》2012,82(1):107-110

We introduce a process of non-intersecting convex particles by thinning a primary particle process such that the remaining particles are mutually non-intersecting and have maximum total volume among all such subsystems. This approach is based on the idea to construct hardcore processes by suitable dependent thinnings proposed by Matérn but generates packings with higher volume fractions than the known thinning models. Due to the enormous complexity of the computations involved, we develop a two-phase heuristic algorithm whose first phase turns out to yield a structure of Matérn III type. We focus mainly on the generation of packings with high volume fractions and present some simulation results for Poisson primary particle processes of equally sized balls in ?² and ?³. The results are compared with the well-known random sequential adsorption model and Matérn type models. 相似文献

20.

Clustering transformed compositional data using K-means,with applications in gene expression and bicycle sharing system data

Antoine Godichon-Baggioni Cathy Maugis-Rabusseau Andrea Rau 《Journal of applied statistics》2019,46(1):47-65

Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data. 相似文献