Tanagra - Data Mining and Data Science Tutorials

Automatic translation of tutorials

For nearly 10 years, I constantly translated my tutorials into English because I found that machine translation tools were not very efficient. This work was very tedious, but I convinced myself that I had to do it. For a year now, I have realized that these tools provide good translations, so much so that I used them directly for the latest English documents I have produced. Under these conditions, it seems more appropriate to me to direct you to my tutorials in French and advise you to use the online machine translation software.

Thank you very much for all these years of following the publications I have made on this blog. The adventure is not over yet because I still continue, and for a long time I hope, to produce course and tutorial materials for researchers and students.

Lyon, May 5th, 2019.Tanagra website statistics for 2017

The year 2017 ends, 2018 begins. I wish you all a very happy year 2018.

A small statistical report on the website statistics for 2017. All sites (Tanagra, course materials, e-books, tutorials) has been visited 222,293 times this year, 609 visits per day.

Since February, the 1st, 2008, the date from which I installed the Google Analytics counter, there was 2,33,371 visits (644 daily visits).

Who are you? The majority of visits come from France and Maghreb. Then there are a large part of French speaking countries, notably because some pages are exclusively in French. In terms of non-francophone countries, we observe mainly the United States, India, UK, Germany, ...

39 new course materials and tutorials were posted online this year: 18 in French language, 21 in English.

The pages containing course materials about Data Science and Programming (R and Python) are the most popular ones. This is not really surprising.

Ricco.Slideshow: Website statistics for 2017Sparse data file format

The data to be processed with machine learning algorithms are increasing in size. Especially when we need to process unstructured data. The data preparation (e. g. the use of a bag of words representation in text mining) leads to the creation of large data tables where, often, the number of columns (descriptors) is higher than the number of rows (observations). With the singularity that the table contains many zero values. In this context, storing all these zero values into the data file is not opportune. A data compression strategy without loss of information must be implemented, which must remain simple so that the file is readable with a text editor.

In this tutorial, we describe the use of the sparse data file format handled by Tanagra (from the version 1.4.4). It is based on the file format processed by famous libraries for machine learning (svmlight, libsvm, libcvm). We show its use in a text categorization process applied to the Reuters database, well known in data mining. We will observe that the use of this kind of sparse format enables to reduce dramatically the data file size.

Keywords: sparse dataset, dense dataset, attribute-value table, support vector machine, svm, libsvm, c-svc, logistic regression, tr-irls, scoring, roc curve, auc, area under curveComponets: VIEW DATASET, CONT TO DISC, UNIVARIATE DUISCRETE STAT, SELECT FIRST EXAMPLES, C-SVC, SCORING, ROC CURVETutorial: en_Tanagra_Sparse_File_Format.pdfDataset:reuters.data.zipReferences:T. Joachims, "SVMlight: Support Vector Machine".UCI Repository, "Reuters-21578 Text Categorization Collection".Configuration of a multilayer perceptron

The multilayer perceptron is one of the most popular neural network approach for supervised learning, and that it was very effective if we know to determine the number of neurons in the hidden layers.

In this tutorial, we will try to explain the role of neurons in the hidden layer of the multilayer perceptron (when we have one hidden layer). Using an artificial toy dataset, we show the behavior of the classifier when we modify the number of neurons.

We work with Tanagra in a first step. Then, we use R (nnet package) to create a program to determine automatically the right number of neurons into the hidden layer.

Keywords: neural network, perceptron, multilayer perceptron, MLPComponents: MULTILAYER PERCEPTRON, FORMULATutorial: Configuration of a MLPDataset: artificial2d.zipReferences:Tanagra Tutorials, "Single layer and multilayer perceptron (slides)", September 2014.Tanagra Tutorials, "Multilayer perceptron - Software comparison", November 2008.CDF and PPF in Excel, R and Python

How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python.

I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc.

We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution, Student's t-distribution, chi-squared distribution and Fisher-Snedecor distribution. I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (stats package) and Python (scipy package). Whew! At least on the few illustrative examples given in our document, the results are consistent.

Keywords: excel, r, stats package, python, scipy package, p-value, quantile, cdf, cumulative distribution function, ppf, percent point function, quantile functionTutorial: CDF and PPFThe "compiler" package for R

It is widely agreed that R is not a fast language. Notably, because it is an interpreted language. To overcome this issue, some solutions exists which allow to compile functions written in R. The gains in computation time can be considerable. But it depends on our ability to write code that can benefit from these tools.

In this tutorial, we study the efficiency of the Luke Tierney's “compiler” package which is provided in the base distribution of R. We program two standard data analysis treatments, (1) with and (2) without using loops: the scaling of variables in a data frame; the calculation of a correlation matrix by matrix product. We compare the efficiency of non-compiled and compiled versions of these functions.

We observe that the gain for the compiled version is dramatic for the version with loops, but negligible for the second variant. We note also that, in the R 3.4.2 version used, it is not needed to compile explicitly the functions containing loops because it exists a JIT (just in time compilation) mechanism which ensure to our code the maximal performance.

Keywords: package compiler, cmpfun, byte code, package rbenchmark, benchmark, JIT, just in timeTutorial: en_Tanagra_R_compiler_package.pdfProgram: compilation_r.zipReferences :Luke Tierney, "A Byte Code Compiler for R", Department of Statistics and Actuarial Science, University of Iowa, March 30, 2012. Package 'compiler' - "Byte Code Compiler"Regression analysis in Python

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance.

Keywords: regression, statsmodels, pandas, matplotlibTutorial: en_Tanagra_Python_StatsModels.pdfDataset and program: en_python_statsmodels.zip References:StatsModels: Statistics in PythonDocument classification in Python

Tanagra - Data Mining and Data Science Tutorials

Formulir Kontak