Learning robust machine models is still a challenging issue for classifying biomedical data. In order to deal with high dimensionality, low sample size, imbalanced classes, Random Forests (RF) have been widely adopted in this field. RF consists in building a classifier ensemble, with randomization to produce a diverse pool of tree-based classifiers. Since their introduction in 2001 by Leo Breiman, RF have been extensively studied, both theoretically and experimentally, and have shown competitive performance with state of the art classifiers. However, only a few studies have addressed the issues raised by the choice of the hyper-parameters and their influence on RF performance. This talk will first address our attempts to better understand and explain the performance of RF through their hyper-parameters that have led us to propose different variants of RF, namely Forest-RK and Dynamic Random Forests, to be less sensitive to the choice, sometimes critical on the generalization performance, of the parametrization. In a second part I will illustrate the use of RF on two medical applications: the classification of endomicroscopic images of the lungs and cancer stage/patient prediction with Radiomics, a domain which is increasingly attracting attention. When dealing with medical data, it might happen that only data of one class (eg healthy patient) is available for training. This is typically the case for endomicroscopic images of the lungs and we have proposed an original approach to deal with outliers in medical image classification, namely One Class Random Forests, which has shown to be effective for our problem and competitive with other state of the art one class classifiers. The second application of RF is Radiomics, a new (2012) concept which refers to the analysis of large amount of quantitative tumor features, extracted from multimodal medical images and other information like clinical data and gene or protein data to predict the patient's evolution and/or survival rate. In this case, data are both highly dimensional and heterogeneous. As part of an on going work, we have proposed a dissimilarity-based multi-view learning model with random forest, in which each data view (or group of features) is processed separately so that the data dimension is smaller in each view. By combining different views together, we can take advantage of the heterogeneity between views while avoiding using conventional feature selection methods for reducing the high dimensionality of data.
Laurent Heutte (b. 1964) is a Full Professor in Computer Engineering and Control System at the University of Rouen. He is a member of LITIS research lab and a member of NormaSTIC, FR CNRS 3638. From 2006 to 2015, he has been head of the "Document and Machine Learning" research group in LITIS and is currently co-director of the lab. His research interests cover all areas of pattern recognition and machine learning, with a focus on statistical techniques, classifier ensembles and random forests, applied to handwriting recognition and analysis, historical document image analysis and retrieval, and biomedical data classification. He has been managing several european and international research projects in these fields. He has authored or co-authored more than 180 peer-reviewed papers. Prof. Heutte has been a Governing Board member of IAPR from 2006 to 2010 and has been actively involved in the organizing committe or program committee of many IAPR international conferences (ICPR, ICDAR, SSPR, ICFHR, ICPRAM, CIARP, MCPR...). He is a Field Chief Editor of "Statistics, Optimization & Information Computing", International Academic Press, an Editorial Board member of "Pattern Recognition" and "Pattern Recognition Letters", Elsevier, and an Editorial Advisory Board member of "Recent Patents on Computer Science", Bentham Science.