首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《统计学通讯:理论与方法》2012,41(16-17):3179-3197
Text clustering is an unsupervised process of classifying texts and words into different groups. In literature, many algorithms use a bag of words model to represent texts and classify contents. The bag of words model assumes that word order has no signicance. The aim of this article is to propose a new method of text clustering, considering links between terms and documents. We use centrality measures to assess word/text importance in a corpus and to sequentially classify documents.  相似文献   

2.
The Yule–Simon distribution has been out of the radar of the Bayesian community, so far. In this note, we propose an explicit Gibbs sampling scheme when a Gamma prior is chosen for the shape parameter. The performance of the algorithm is illustrated with simulation studies, including count data regression, and a real data application to text analysis. We compare our proposal to the frequentist counterparts showing better performance of our algorithm when a small sample size is considered.  相似文献   

3.
Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.  相似文献   

4.
SUMMARY Zipf 's experimental law states that, for a given large piece of text, the product of the relative frequency of a word and its order in descending frequency order is a constant, shown to be equal to 1 divided by the natural logarithm of the number of different words. It is shown to be approximately equal to Benford's logarithmic distribution of first significant digits in tables of numbers. Eleven samples allow comparison of observed and theoretical frequencies.  相似文献   

5.
In this paper, Pitman closeness criterion is used to compare the nearness of record values and order statistics from two independent samples to a specific population quantile of the parent distribution while the underlying distributions are the same. General expressions for the associated Pitman closeness probability are obtained when the support of the parent distribution is bounded and also unbounded. Some distribution-free results are achieved for symmetric distributions. The exponential and uniform distributions are considered for illustrative proposes and exact expressions are obtained in each case.  相似文献   

6.
The generalized exponential is the most commonly used distribution for analyzing lifetime data. This distribution has several desirable properties and it can be used quite effectively to analyse several skewed life time data. The main aim of this paper is to introduce absolutely continuous bivariate generalized exponential distribution using the method of Block and Basu (1974). In fact, the Block and Basu exponential distribution will be extended to the generalized exponential distribution. We call the new proposed model as the Block and Basu bivariate generalized exponential distribution, then, discuss its different properties. In this case the joint probability distribution function and the joint cumulative distribution function can be expressed in compact forms. The model has four unknown parameters and the maximum likelihood estimators cannot be obtained in explicit form. To compute the maximum likelihood estimators directly, one needs to solve a four dimensional optimization problem. The EM algorithm has been proposed to compute the maximum likelihood estimations of the unknown parameters. One data analysis is provided for illustrative purposes. Finally, we propose some generalizations of the proposed model and compare their models with each other.  相似文献   

7.
A probabilistic expert system provides a graphical representation of a joint probability distribution which enables local computations of probabilities. Dawid (1992) provided a flow- propagation algorithm for finding the most probable configuration of the joint distribution in such a system. This paper analyses that algorithm in detail, and shows how it can be combined with a clever partitioning scheme to formulate an efficient method for finding the M most probable configurations. The algorithm is a divide and conquer technique, that iteratively identifies the M most probable configurations.  相似文献   

8.
An efficient simulation algorithm for random sequential adsorption of spheres with radii chosen from a (prior) probability distribution is implemented. The algorithm is based on dividing the whole domain in small subcubes of different edge length. Samples obtained by this algorithm satisfy the jamming limit propertyi.e., no further sphere can be placed in the final configuration without overlapping. Samples for both discrete and continuous radii distributions are simulated and analyzed, especially jamming coverage, pair correlation functions and posterior radii distributions of the obtained sphere configurations.  相似文献   

9.
The problem posed by exact confidence intervals (CIs) which can be either all-inclusive or empty for a nonnegligible set of sample points is known to have no solution within CI theory. Confidence belts causing improper CIs can be modified by using margins of error from the renewed theory of errors initiated by J. W. Tukey—briefly described in the article—for which an extended Fraser's frequency interpretation is given. This approach is consistent with Kolmogorov's axiomatization of probability, in which a probability and an error measure obey the same axioms, although the connotation of the two words is different. An algorithm capable of producing a margin of error for any parameter derived from the five parameters of the bivariate normal distribution is provided. Margins of error correcting Fieller's CIs for a ratio of means are obtained, as are margins of error replacing Jolicoeur's CIs for the slope of the major axis. Margins of error using Dempster's conditioning that can correct optimal, but improper, CIs for the noncentrality parameter of a noncentral chi-square distribution are also given.  相似文献   

10.
Some partial orderings that compare probability distributions with the exponential distribution are found to be very useful to understand the phenomenon of ageing. Here, we introduce some new generalized partial orderings which describe the same kind of phenomenon of some generalized ageing classes. We give some equivalent conditions for each of the orderings. Inter-relations among the generalized orderings have also been discussed.  相似文献   

11.
Parameters of a finite mixture model are often estimated by the expectation–maximization (EM) algorithm where the observed data log-likelihood function is maximized. This paper proposes an alternative approach for fitting finite mixture models. Our method, called the iterative Monte Carlo classification (IMCC), is also an iterative fitting procedure. Within each iteration, it first estimates the membership probabilities for each data point, namely the conditional probability of a data point belonging to a particular mixing component given that the data point value is obtained, it then classifies each data point into a component distribution using the estimated conditional probabilities and the Monte Carlo method. It finally updates the parameters of each component distribution based on the classified data. Simulation studies were conducted to compare IMCC with some other algorithms for fitting mixture normal, and mixture t, densities.  相似文献   

12.
We consider the problem of change-point in a classical framework while assuming a probability distribution for the change-point. An EM algorithm is proposed to estimate the distribution of the change-point. A change-point model for multiple profiles is also proposed, and EM algorithm is presented to estimate the model. Two examples of Illinois traffic data and Dow Jones Industrial Averages are used to demonstrate the proposed methods.  相似文献   

13.
A Wald test-based approach for power and sample size calculations has been presented recently for logistic and Poisson regression models using the asymptotic normal distribution of the maximum likelihood estimator, which is applicable to tests of a single parameter. Unlike the previous procedures involving the use of score and likelihood ratio statistics, there is no simple and direct extension of this approach for tests of more than a single parameter. In this article, we present a method for computing sample size and statistical power employing the discrepancy between the noncentral and central chi-square approximations to the distribution of the Wald statistic with unrestricted and restricted parameter estimates, respectively. The distinguishing features of the proposed approach are the accommodation of tests about multiple parameters, the flexibility of covariate configurations and the generality of overall response levels within the framework of generalized linear models. The general procedure is illustrated with some special situations that have motivated this research. Monte Carlo simulation studies are conducted to assess and compare its accuracy with existing approaches under several model specifications and covariate distributions.  相似文献   

14.
We study the properties of truncated gamma distributions and we derive simulation algorithms which dominate the standard algorithms for these distributions. For the right truncated gamma distribution, an optimal accept–reject algorithm is based on the fact that its density can be expressed as an infinite mixture of beta distribution. For integer values of the parameters, the density of the left truncated distributions can be rewritten as a mixture which can be easily generated. We give an optimal accept–reject algorithm for the other values of the parameter. We compare the efficiency of our algorithm with the previous method and show the improvement in terms of minimum acceptance probability. The algorithm proposed here has an acceptance probability which is superior to e/4.  相似文献   

15.
唐晓彬等 《统计研究》2021,38(8):146-160
本文创新地将半监督交互式关键词提取算法词频-逆向文件频率( Term Frequency- Inverse Document Frequency, TF-IDF )与基于 Transformer 的 双 向 编 码 表 征 ( Bidirectional Encoder Representation from Transformers,BERT)模型相结合,设计出一种扩展CPI预测种子关键词的文本挖掘技术。采用交互式TF-IDF算法,对原始CPI预测种子关键词汇广度上进行扩展,在此基础上通过BERT“两段式”检索过滤模型深入挖掘文本信息并匹配关键词,实现CPI预测关键词深度上的扩展,从而构建了CPI预测的关键词库。在此基础上,本文进一步对文本挖掘技术特征扩展前后的关键词建立预测模型进行对比分析。研究表明,相比于传统的关键词提取算法,交互式TF-IDF算法不仅无需借助语料库,而且还允许种子词的输入。同时,BERT模型通过迁移学习的方式对基础模型进行微调,学习特定领域知识,在CPI预测问题中很好地实现了语言表征、语义拓展与人机交互。相对于传统文本挖掘技术,本文设计的文本挖掘技术具有较强的泛化表征能力,在84个CPI预测关键种子词的基础上,扩充后的关键词对CPI具有更高的预测准确度和更充分的解释性。本文针对CP 预测问题设计的文本挖掘技术,也为建立其他宏观经济指标关键词词库提供新的研究思路与参考价值。  相似文献   

16.
This paper describes a new program, CORRECT, which takes words rejected by the Unix® SPELL program, proposes a list of candidate corrections, and sorts them by probability score. The probability scores are the novel contribution of this work. They are based on a noisy channel model. It is assumed that the typist knows what words he or she wants to type but some noise is added on the way to the keyboard (in the form of typos and spelling errors). Using a classic Bayesian argument of the kind that is popular in recognition applications, especially speech recognition (Jelinek, 1985), one can often recover the intended correction,c, from a typo,t, by finding the correctionc that maximizesPr(c) Pr(t/c). The first factor,Pr(c), is a prior model of word probabilities; the second factor,Pr(t/c), is a model of the noisy channel that accounts for spelling transformations on letter sequences (insertions, deletions, substitutions and reversals). Both sets of probabilities were estimated using data collected from the Associated Press (AP) newswire over 1988 and 1989 as a training set. The AP generates about 1 million words and 500 typos per week.In evaluating the program, we found that human judges were extremely reluctant to cast a vote given only the information available to the program, and that they were much more comfortable when they could see a concordance line or two. The second half of this paper discusses some very simple methods of modeling the context usingn-gram statistics. Althoughn-gram methods are much too simple (compared with much more sophisticated methods used in artificial intelligence and natural language processing), we have found that even these very simple methods illustrate some very interesting estimation problems that will almost certainly come up when we consider more sophisticated models of contexts. The problem is how to estimate the probability of a context that we have not seen. We compare several estimation techniques and find that some are useless. Fortunately, we have found that the Good-Turing method provides an estimate of contextual probabilities that produces a significant improvement in program performance. Context is helpful in this application, but only if it is estimated very carefully.At this point, we have a number of different knowledge sources—the prior, the channel and the context—and there will certainly be more in the future. In general, performance will be improved as more and more knowledge sources are added to the system, as long as each additional knowledge source provides some new (independent) information. As we shall see, it is important to think more carefully about combination rules, especially when there are a large number of different knowledge sources.  相似文献   

17.
In this article, we compare the zero-inflated Poisson (ZIP) and negative binomial (NB) distributions based on three most important criteria: the probability of zero, the mean value, and the variance. Our results show that with same mean value and variance, the ZIP distribution always has a larger probability of zeros; with same mean value and probability of zeros, the NB distribution always has a larger variance; and with same variance and probability of zeros, the ZIP distribution always has a larger mean value. We also study the properties of Vuong test in model selection in three cases by simulations.  相似文献   

18.
Clinical trials are often designed to compare several treatments with a common control arm in pairwise fashion. In this paper we study optimal designs for such studies, based on minimizing the total number of patients required to achieve a given level of power. A common approach when designing studies to compare several treatments with a control is to achieve the desired power for each individual pairwise treatment comparison. However, it is often more appropriate to characterize power in terms of the family of null hypotheses being tested, and to control the probability of rejecting all, or alternatively any, of these individual hypotheses. While all approaches lead to unbalanced designs with more patients allocated to the control arm, it is found that the optimal design and required number of patients can vary substantially depending on the chosen characterization of power. The methods make allowance for both continuous and binary outcomes and are illustrated with reference to two clinical trials, one involving multiple doses compared to placebo and the other involving combination therapy compared to mono-therapies. In one example a 55% reduction in sample size is achieved through an optimal design combined with the appropriate characterization of power.  相似文献   

19.
There are a large number of different definitions used for sample quantiles in statistical computer packages. Often within the same package one definition will be used to compute a quantile explicitly, while other definitions may be used when producing a boxplot, a probability plot, or a QQ plot. We compare the most commonly implemented sample quantile definitions by writing them in a common notation and investigating their motivation and some of their properties. We argue that there is a need to adopt a standard definition for sample quantiles so that the same answers are produced by different packages and within each package. We conclude by recommending that the median-unbiased estimator be used because it has most of the desirable properties of a quantile estimator and can be defined independently of the underlying distribution.  相似文献   

20.

Ordinal data are often modeled using a continuous latent response distribution, which is partially observed through windows of adjacent intervals defined by cutpoints. In this paper we propose the beta distribution as a model for the latent response. The beta distribution has several advantages over the other common distributions used, e.g. , normal and logistic. In particular, it enables separate modeling of location and dispersion effects which is essential in the Taguchi method of robust design. First, we study the problem of estimating the location and dispersion parameters of a single beta distribution (representing a single treatment) from ordinal data assuming known equispaced cutpoints. Two methods of estimation are compared: the maximum likelihood method and the method of moments. Two methods of treating the data are considered: in raw discrete form and in smoothed continuousized form. A large scale simulation study is carried out to compare the different methods. The mean square errors of the estimates are obtained under a variety of parameter configurations. Comparisons are made based on the ratios of the mean square errors (called the relative efficiencies). No method is universally the best, but the maximum likelihood method using continuousized data is found to perform generally well, especially for estimating the dispersion parameter. This method is also computationally much faster than the other methods and does not experience convergence difficulties in case of sparse or empty cells. Next, the problem of estimating unknown cutpoints is addressed. Here the multiple treatments setup is considered since in an actual application, cutpoints are common to all treatments, and must be estimated from all the data. A two-step iterative algorithm is proposed for estimating the location and dispersion parameters of the treatments, and the cutpoints. The proposed beta model and McCullagh's (1980) proportional odds model are compared by fitting them to two real data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号