首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
The purpose of this article is to review two text mining packages, namely, WordStat and SAS TextMiner. WordStat is developed by Provalis Research. SAS TextMiner is a product of SAS. We review the features offered by each package on each of the following key steps in analyzing unstructured data: (1) data preparation, including importing and cleaning; (2) performing association analysis; and (3) presenting the findings, including illustrative quotes and graphs. We also evaluate each package on its ability to help researchers extract major themes from a dataset. Both packages offer a variety of features that effectively help researchers run associations and present results. However, in extracting themes from unstructured data, both packages were only marginally helpful. The researcher still needs to read the data and make all the difficult decisions. This finding stems from the fact that the software can search only for specific terms in documents or categorize documents based on common terms. Respondents, however, may use the same term or combination of terms to mean different things. This implies that a text mining approach, which is based on analysis units other than terms, may be more powerful in extracting themes, an idea we touch upon in the conclusion section.  相似文献   

2.
3.
We review five packages for estimating finite mixtures, BINOMIX, C.A. MAN, MIX, and the maximum likelihood routines of BMDP and STATA. The focus of the review is on numerical issues rather than matters such as user interface because the success or failure of an algorithm to yield a mixture model is likely to be the most important issue facing a researcher. The problem of suitable initial values is discussed throughout.  相似文献   

4.
Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyoglu and Ozsoyoglu (1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifies them with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined.  相似文献   

5.
Many different models for the analysis of high-dimensional survival data have been developed over the past years. While some of the models and implementations come with an internal parameter tuning automatism, others require the user to accurately adjust defaults, which often feels like a guessing game. Exhaustively trying out all model and parameter combinations will quickly become tedious or infeasible in computationally intensive settings, even if parallelization is employed. Therefore, we propose to use modern algorithm configuration techniques, e.g. iterated F-racing, to efficiently move through the model hypothesis space and to simultaneously configure algorithm classes and their respective hyperparameters. In our application we study four lung cancer microarray data sets. For these we configure a predictor based on five survival analysis algorithms in combination with eight feature selection filters. We parallelize the optimization and all comparison experiments with the BatchJobs and BatchExperiments R packages.  相似文献   

6.
Eight statistical software packages for general use by non-statisticians are reviewed. The packages are GraphPad Prism, InStat, ISP, NCSS, SigmaStat, Statistix, Statmost, and Winks. Summary tables of statistical capabilities and “usability” features are followed by discussions of each package. Discussions include system requirements, data import capabilities, statistical capabilities, and user interface. Recommendations, based on user needs and sophistication, are presented following the reviews.  相似文献   

7.
ABSTRACT

In modelling repeated count outcomes, generalized linear mixed-effects models are commonly used to account for within-cluster correlations. However, inconsistent results are frequently generated by various statistical R packages and SAS procedures, especially in case of a moderate or strong within-cluster correlation or overdispersion. We investigated the underlying numerical approaches and statistical theories on which these packages and procedures are built. We then compared the performance of these statistical packages and procedures by simulating both Poisson-distributed and overdispersed count data. The SAS NLMIXED procedure outperformed the others procedures in all settings.  相似文献   

8.
Five statistical software packages for epidemiology and clinical trials are reviewed. The five packages are EPI INFO, EPICURE, EPILOG PLUS, STATA, and TRUE EPI-STAT. Only DOS versions of these packages are compared and rated (Windows versions are discussed but not rated). Although the packages differ in their target audiences, interfaces, capabilities, and approaches, they are examined according to criteria that are of most interest to epidemiologists, biostatisticians, and others involved in epidemiologic and clinical research. A general discussion with recommendations follows the review of the statistical packages.  相似文献   

9.
Expectile regression [Newey W, Powell J. Asymmetric least squares estimation and testing, Econometrica. 1987;55:819–847] is a nice tool for estimating the conditional expectiles of a response variable given a set of covariates. Expectile regression at 50% level is the classical conditional mean regression. In many real applications having multiple expectiles at different levels provides a more complete picture of the conditional distribution of the response variable. Multiple linear expectile regression model has been well studied [Newey W, Powell J. Asymmetric least squares estimation and testing, Econometrica. 1987;55:819–847; Efron B. Regression percentiles using asymmetric squared error loss, Stat Sin. 1991;1(93):125.], but it can be too restrictive for many real applications. In this paper, we derive a regression tree-based gradient boosting estimator for nonparametric multiple expectile regression. The new estimator, referred to as ER-Boost, is implemented in an R package erboost publicly available at http://cran.r-project.org/web/packages/erboost/index.html. We use two homoscedastic/heteroscedastic random-function-generator models in simulation to show the high predictive accuracy of ER-Boost. As an application, we apply ER-Boost to analyse North Carolina County crime data. From the nonparametric expectile regression analysis of this dataset, we draw several interesting conclusions that are consistent with the previous study using the economic model of crime. This real data example also provides a good demonstration of some nice features of ER-Boost, such as its ability to handle different types of covariates and its model interpretation tools.  相似文献   

10.
Power analysis for multi-center randomized control trials is quite difficult to perform for non-continuous responses when site differences are modeled by random effects using the generalized linear mixed-effects model (GLMM). First, it is not possible to construct power functions analytically, because of the extreme complexity of the sampling distribution of parameter estimates. Second, Monte Carlo (MC) simulation, a popular option for estimating power for complex models, does not work within the current context because of a lack of methods and software packages that would provide reliable estimates for fitting such GLMMs. For example, even statistical packages from software giants like SAS do not provide reliable estimates at the time of writing. Another major limitation of MC simulation is the lengthy running time, especially for complex models such as GLMM, especially when estimating power for multiple scenarios of interest. We present a new approach to address such limitations. The proposed approach defines a marginal model to approximate the GLMM and estimates power without relying on MC simulation. The approach is illustrated with both real and simulated data, with the simulation study demonstrating good performance of the method.  相似文献   

11.
Three situations are cited when caution is needed in using statistical computing packages: (a) when analyzing data and having insufficient statistical knowledge to completely understand the output; (b) when teaching the use of packages in a statistics course, to the exclusion of teaching statistics; and (c) when using packages in subject-matter teaching, without teaching the statistical methods underlying the packages.  相似文献   

12.
We propose a parametric nonlinear time-series model, namely the Autoregressive-Stochastic volatility with threshold (AR-SVT) model with mean equation for forecasting level and volatility. Methodology for estimation of parameters of this model is developed by first obtaining recursive Kalman filter time-update equation and then employing the unrestricted quasi-maximum likelihood method. Furthermore, optimal one-step and two-step-ahead out-of-sample forecasts formulae along with forecast error variances are derived analytically by recursive use of conditional expectation and variance. As an illustration, volatile all-India monthly spices export during the period January 2006 to January 2012 is considered. Entire data analysis is carried out using EViews and matrix laboratory (MATLAB) software packages. The AR-SVT model is fitted and interval forecasts for 10 hold-out data points are obtained. Superiority of this model for describing and forecasting over other competing models for volatility, namely AR-Generalized autoregressive conditional heteroscedastic, AR-Exponential GARCH, AR-Threshold GARCH, and AR-Stochastic volatility models is shown for the data under consideration. Finally, for the AR-SVT model, optimal out-of-sample forecasts along with forecasts of one-step-ahead variances are obtained.  相似文献   

13.
Abstract

For academic libraries, because budgetary pressures are nearly universal, it is imperative to evaluate journal packages regularly. This article presents an overview of the data and methods that the NC State University Libraries traditionally uses to evaluate journal packages and presents additional methods to expand our evaluation of publishing and editorial activity. We describe methods for downloading and analyzing Web of Science citation data to identify the most common publishers for NC State affiliated authors as well as the journals in which NC State authors publish most frequently. This article also demonstrates a custom Python web scraping application to harvest NC State affiliated editor data from publishers’ websites. Finally, this article discusses how these data elements are combined to provide a more comprehensive evaluative strategy for our journal investments.  相似文献   

14.
The increase of statistical software applications for PCs is caused by decreasing hardware costs and dramatically enhanced PC performance. Whereas in the past the domain of statistical computing has been reserved to mainframe solutions, a great number of new software packages for PCs have come out in the last five years. Therefore, the producers of established mainframe software were also forced to offer PC-based solutions. By limiting a market analysis to products with a medium set of well known statistical methods, the immense number of available products is reduced to about fifty systems. We ordered an evaluation copy of these systems to test the numerical quality, the system speed, and the performance of several procedures. Seventeen packages were made available for an extensive examination. This paper will (1) discuss the problems and the solutions of obtaining a complete and correct datamatrix that describes the entire market and (2) present the results of a comparative market analysis.  相似文献   

15.
We find that existing multiple imputation procedures that are currently implemented in major statistical packages and that are available to the wide majority of data analysts are limited with regard to handling incomplete panel data. We review various missing data methods that we deem useful for the analysis of incomplete panel data and discuss, how some of the shortcomings of existing procedures can be overcome. In a simulation study based on real panel data, we illustrate these procedures’ quality and outline fruitful avenues of future research.  相似文献   

16.
In real-life situations, we often encounter data sets containing missing observations. Statistical methods that address missingness have been extensively studied in recent years. One of the more popular approaches involves imputation of the missing values prior to the analysis, thereby rendering the data complete. Imputation broadly encompasses an entire scope of techniques that have been developed to make inferences about incomplete data, ranging from very simple strategies (e.g. mean imputation) to more advanced approaches that require estimation, for instance, of posterior distributions using Markov chain Monte Carlo methods. Additional complexity arises when the number of missingness patterns increases and/or when both categorical and continuous random variables are involved. Implementation of routines, procedures, or packages capable of generating imputations for incomplete data are now widely available. We review some of these in the context of a motivating example, as well as in a simulation study, under two missingness mechanisms (missing at random and missing not at random). Thus far, evaluation of existing implementations have frequently centred on the resulting parameter estimates of the prescribed model of interest after imputing the missing data. In some situations, however, interest may very well be on the quality of the imputed values at the level of the individual – an issue that has received relatively little attention. In this paper, we focus on the latter to provide further insight about the performance of the different routines, procedures, and packages in this respect.  相似文献   

17.
Multiple imputation is widely accepted as the method of choice to address item nonresponse in surveys. Nowadays most statistical software packages include features to multiply impute missing values in a dataset. Nevertheless, the application to real data imposes many implementation problems. To define useful imputation models for a dataset that consists of categorical and possibly skewed continuous variables, contains skip patterns and all sorts of logical constraints is a challenging task. Besides, in most applications little attention is paid to the evaluation of the underlying assumptions behind the imputation models.  相似文献   

18.
This article offers a review of three software packages that estimate directed acyclic graphs (DAGs) from data. The three packages, MIM, Tetrad and WinMine, can help researchers discover underlying causal structure. Although each package uses a different algorithm, the results are to some extent similar. All three packages are free and easy to use. They are likely to be of interest to researchers who do not have strong theory regarding the causal structure in their data. DAG modeling is a powerful analytic tool to consider in conjunction with, or in place of, path analysis, structural equation modeling, and other statistical techniques.  相似文献   

19.
Following on from the work of O'Quigley & Flandre (1994) and, more recently, O'Quigley & Xu (2000), we develop a measure, R2, of the predictive ability of a stratified proportional hazards regression model. The extension of this earlier work to the stratified case is relatively straightforward, both conceptually and in its practical implementation. The extension is nonetheless important in that the stratified model is making weaker assumptions than the full multivariate model. Formulae are given that can be readily incorporated into standard software routines, since the component parts of the calculations are routinely provided by most packages. We give examples on the predictability of survival in breast cancer data, modelled via proportional hazards and stratified proportional hazards models, the latter being necessary in view of the effects of a non-proportional hazards nature.  相似文献   

20.
Recent evidence indicates that using multiple forward rates sharply predicts future excess returns on U.S. Treasury Bonds, with the R2's being around 30%. The projection coefficients in these regressions exhibit a distinct pattern that relates to the maturity of the forward rate. These dimensions of the data, in conjunction with the transition dynamics of bond yields, offer a serious challenge to term structure models. In this article we show that a regime-shifting term structure model can empirically account for these challenging data features. Alternative models, such as affine specification, fail to account for these important features. We find that regimes in the model are intimately related to bond risk premia and real business cycles.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号