首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
Sepsis is one of the biggest risks to patient safety, with a natural mortality rate between 25% and 50%. It is difficult to diagnose, and no validated standard for diagnosis currently exists. A commonly used scoring criteria is the quick sequential organ failure assessment (qSOFA). It demonstrates very low specificity in ICU populations, however. We develop a method to personalize thresholds in qSOFA that incorporates easily to measure patient baseline characteristics. We compare the personalized threshold method to qSOFA, five previously published methods that obtain an optimal constant threshold for a single biomarker, and to the machine learning algorithms based on logistic regression and AdaBoosting using patient data in the MIMIC-III database. The personalized threshold method achieves higher accuracy than qSOFA and the five published methods and has comparable performance to machine learning methods. Personalized thresholds, however, are much easier to adopt in real-life monitoring than machine learning methods as they are computed once for a patient and used in the same way as qSOFA, whereas the machine learning methods are hard to implement and interpret.  相似文献   

2.
我国城镇登记失业率指标稳定在4%左右,难以较为准确反映就业动态;而劳动力调查样本量有限,城镇调查失业率对省以下各级行政区域代表性不足。本文将针对大数据的机器学习算法与针对传统统计数据的核算思想结合起来,基于某四百万人口城市2016—2018年的全样本行政大数据,利用机器学习算法,对每个城镇居民每个月的就业状态进行预测,再利用统计核算方法,估计出该城市的失业率。在个人层面,本文的模型在样本外测试集上的准确率达到96.7%。经过统计核算加总,本文估计的当地失业率在合理区间范围内,并表现出明显的周期性特征,对就业形势动态变化的刻画明显优于当地一年发布一次的登记失业率数据。本文基于个人层面的预测结果,进一步探讨了当地失业人口 的性别与文化程度特征,以及再就业的时间规律。本文针对如何使用行政大数据辅助经济决策提出了新的范式,对大数据时代如何理解经济与制定政策具有参考意义。  相似文献   

3.
Many seemingly different problems in machine learning, artificial intelligence, and symbolic processing can be viewed as requiring the discovery of a computer program that produces some desired output for particular inputs. When viewed in this way, the process of solving these problems becomes equivalent to searching a space of possible computer programs for a highly fit individual computer program. The recently developed genetic programming paradigm described herein provides a way to search the space of possible computer programs for a highly fit individual computer program to solve (or approximately solve) a surprising variety of different problems from different fields. In genetic programming, populations of computer programs are genetically bred using the Darwinian principle of survival of the fittest and using a genetic crossover (sexual recombination) operator appropriate for genetically mating computer programs. Genetic programming is illustrated via an example of machine learning of the Boolean 11-multiplexer function and symbolic regression of the econometric exchange equation from noisy empirical data.Hierarchical automatic function definition enables genetic programming to define potentially useful functions automatically and dynamically during a run, much as a human programmer writing a complex computer program creates subroutines (procedures, functions) to perform groups of steps which must be performed with different instantiations of the dummy variables (formal parameters) in more than one place in the main program. Hierarchical automatic function definition is illustrated via the machine learning of the Boolean 11-parity function.  相似文献   

4.
选择性集成算法是目前机器学习关注的热点之一。在对一海藻繁殖案例研究的基础上,提出了一种基于k—nleanS聚类技术的快速选择性BaggingTre咚集成算法;同时与传统统计方法和一些常用的机器学习方法相比较,发现该算法具有较小的模型推广误差和更高的预测精度的优点,而且其运行的效率也得到了较大的提高。  相似文献   

5.
《统计学通讯:理论与方法》2012,41(16-17):3233-3243
In literature there are several studies on the performance of Bayesian network structure learning algorithms. The focus of these studies is almost always the heuristics the learning algorithms are based on, i.e., the maximization algorithms (in score-based algorithms) or the techniques for learning the dependencies of each variable (in constraint-based algorithms). In this article, we investigate how the use of permutation tests instead of parametric ones affects the performance of Bayesian network structure learning from discrete data. Shrinkage tests are also covered to provide a broad overview of the techniques developed in current literature.  相似文献   

6.
陈凯 《统计教育》2008,(12):3-7
目前集成学习算法已经成为机器学习研究的一大热点,已有人提出许多改进的集成学习算法。本文提出了一种综合了Boosting和Bagging算法特点的选择性集成学习算法--SE-BagBoosting Trees算法。并将其与几种常用的机器学习算法比较研究得出,该算法往往比其它算法具有更小的模型推广误差和更高的预测精度的优点。  相似文献   

7.
Statistical learning is emerging as a promising field where a number of algorithms from machine learning are interpreted as statistical methods and vice-versa. Due to good practical performance, boosting is one of the most studied machine learning techniques. We propose algorithms for multivariate density estimation and classification. They are generated by using the traditional kernel techniques as weak learners in boosting algorithms. Our algorithms take the form of multistep estimators, whose first step is a standard kernel method. Some strategies for bandwidth selection are also discussed with regard both to the standard kernel density classification problem, and to our 'boosted' kernel methods. Extensive experiments, using real and simulated data, show an encouraging practical relevance of the findings. Standard kernel methods are often outperformed by the first boosting iterations and in correspondence of several bandwidth values. In addition, the practical effectiveness of our classification algorithm is confirmed by a comparative study on two real datasets, the competitors being trees including AdaBoosting with trees.  相似文献   

8.
Artificial neural networks have been successfully applied to a variety of machine learning tasks, including image recognition, semantic segmentation, and machine translation. However, few studies fully investigated ensembles of artificial neural networks. In this work, we investigated multiple widely used ensemble methods, including unweighted averaging, majority voting, the Bayes Optimal Classifier, and the (discrete) Super Learner, for image recognition tasks, with deep neural networks as candidate algorithms. We designed several experiments, with the candidate algorithms being the same network structure with different model checkpoints within a single training process, networks with same structure but trained multiple times stochastically, and networks with different structure. In addition, we further studied the overconfidence phenomenon of the neural networks, as well as its impact on the ensemble methods. Across all of our experiments, the Super Learner achieved best performance among all the ensemble methods in this study.  相似文献   

9.
Estimating the generalization performance of learning algorithms is one of the main purposes of machine learning theoretical research. The previous results describing the generalization ability of Tikhonov regularization algorithm are almost all based on independent and identically distributed (i.i.d.) samples. In this paper we go far beyond this classical framework by establishing the bound on the generalization ability of Tikhonov regularization algorithm with geometrically beta-mixing observations. We first establish two refined probability inequalities for geometrically beta-mixing sequences, and then we obtain the generalization bounds of Tikhonov regularization algorithm with geometrically beta-mixing observations and show that Tikhonov regularization algorithm with geometrically beta-mixing observations is consistent. These obtained bounds on the learning performance of Tikhonov regularization algorithm with geometrically beta-mixing observations are proved to be suitable to geometrically ergodic Markov chain samples and hidden Markov models.  相似文献   

10.
ABSTRACT

Panel datasets have been increasingly used in economics to analyze complex economic phenomena. Panel data is a two-dimensional array that combines cross-sectional and time series data. Through constructing a panel data matrix, the clustering method is applied to panel data analysis. This method solves the heterogeneity question of the dependent variable, which belongs to panel data, before the analysis. Clustering is a widely used statistical tool in determining subsets in a given dataset. In this article, we present that the mixed panel dataset is clustered by agglomerative hierarchical algorithms based on Gower's distance and by k-prototypes. The performance of these algorithms has been studied on panel data with mixed numerical and categorical features. The effectiveness of these algorithms is compared by using cluster accuracy. An experimental analysis is illustrated on a real dataset using Stata and R package software.  相似文献   

11.
This paper presents a new Metropolis-adjusted Langevin algorithm (MALA) that uses convex analysis to simulate efficiently from high-dimensional densities that are log-concave, a class of probability distributions that is widely used in modern high-dimensional statistics and data analysis. The method is based on a new first-order approximation for Langevin diffusions that exploits log-concavity to construct Markov chains with favourable convergence properties. This approximation is closely related to Moreau–Yoshida regularisations for convex functions and uses proximity mappings instead of gradient mappings to approximate the continuous-time process. The proposed method complements existing MALA methods in two ways. First, the method is shown to have very robust stability properties and to converge geometrically for many target densities for which other MALA are not geometric, or only if the step size is sufficiently small. Second, the method can be applied to high-dimensional target densities that are not continuously differentiable, a class of distributions that is increasingly used in image processing and machine learning and that is beyond the scope of existing MALA and HMC algorithms. To use this method it is necessary to compute or to approximate efficiently the proximity mappings of the logarithm of the target density. For several popular models, including many Bayesian models used in modern signal and image processing and machine learning, this can be achieved with convex optimisation algorithms and with approximations based on proximal splitting techniques, which can be implemented in parallel. The proposed method is demonstrated on two challenging high-dimensional and non-differentiable models related to image resolution enhancement and low-rank matrix estimation that are not well addressed by existing MCMC methodology.  相似文献   

12.
In applications of multivariate finite mixture models, estimating the number of unknown components is often difficult. We propose a bootstrap information criterion, whereby we calculate the expected log-likelihood at maximum a posteriori estimates for model selection. Accurate estimation using the bootstrap requires a large number of bootstrap replicates. We accelerate this computation by employing parallel processing with graphics processing units (GPUs) on the Compute Unified Device Architecture (CUDA) platform. We conducted a runtime comparison of CUDA algorithms between implementation on the GPU and that on a CPU. The results showed significant performance gains in the proposed CUDA algorithms over multithread CPUs.  相似文献   

13.
The main models of machine learning are briefly reviewed and considered for building a classifier to identify the Fragile X Syndrome (FXS). We have analyzed 172 patients potentially affected by FXS in Andalusia (Spain) and, by means of a DNA test, each member of the data set is known to belong to one of two classes: affected, not affected. The whole predictor set, formed by 40 variables, and a reduced set with only nine predictors significantly associated with the response are considered. Four alternative base classification models have been investigated: logistic regression, classification trees, multilayer perceptron and support vector machines. For both predictor sets, the best accuracy, considering both the mean and the standard deviation of the test error rate, is achieved by the support vector machines, confirming the increasing importance of this learning algorithm. Three ensemble methods - bagging, random forests and boosting - were also considered, amongst which the bagged versions of support vector machines stand out, especially when they are constructed with the reduced set of predictor variables. The analysis of the sensitivity, the specificity and the area under the ROC curve agrees with the main conclusions extracted from the accuracy results. All of these models can be fitted by free R programs.  相似文献   

14.
15.
机器学习及其相关算法综述   总被引:3,自引:0,他引:3  
自从计算机被发明以来,人们就想知道它能不能学习。机器学习从本质上是一个多学科的领域。它吸取了人工智能、概率统计、计算复杂性理论、控制论、信息论、哲学、生理学、神经生物学等学科的成果。文章主要从统计学习基础的角度对机器学习的发展历程以及一些相关的常用算法进行了简要的回顾和介绍。  相似文献   

16.
This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure.  相似文献   

17.
In computational sciences, including computational statistics, machine learning, and bioinformatics, it is often claimed in articles presenting new supervised learning methods that the new method performs better than existing methods on real data, for instance in terms of error rate. However, these claims are often not based on proper statistical tests and, even if such tests are performed, the tested hypothesis is not clearly defined and poor attention is devoted to the Type I and Type II errors. In the present article, we aim to fill this gap by providing a proper statistical framework for hypothesis tests that compare the performances of supervised learning methods based on several real datasets with unknown underlying distributions. After giving a statistical interpretation of ad hoc tests commonly performed by computational researchers, we devote special attention to power issues and outline a simple method of determining the number of datasets to be included in a comparison study to reach an adequate power. These methods are illustrated through three comparison studies from the literature and an exemplary benchmarking study using gene expression microarray data. All our results can be reproduced using R codes and datasets available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/compstud2013.  相似文献   

18.
集成算法已经成为机器学习研究的一大热点,已有许多改进的集成算法,但对"病态"数据的集成研究并不常见。本文通过对一海藻繁殖案例的研究,提出了一种基于块状bootstrap技术的集成算法,并将其与几种常用的集成算法比较研究得出,在对于一些"病态"数据而言,该算法往往比其它算法具有更小的模型推广误差和更高的预测精度的优点。  相似文献   

19.
Software packages usually report the results of statistical tests using p-values. Users often interpret these values by comparing them with standard thresholds, for example, 0.1, 1, and 5%, which is sometimes reinforced by a star rating (***, **, and *, respectively). We consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, for example, by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals that cover [0, 1] and that can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-value thresholds occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.  相似文献   

20.
In this work we present a study on the analysis of a large data set from seismology. A set of different large margin classifiers based on the well-known support vector machine (SVM) algorithm is used to classify the data into two classes based on their magnitude on the Richter scale. Due to the imbalance of nature between the two classes reweighing techniques are used to show the importance of reweighing algorithms. Moreover, we present an incremental algorithm to explore the possibility of predicting the strength of an earthquake with incremental techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号