Category talk:Book:Data Mining Algorithms In R

I have 2 problems with this book project:

1. As you know, real life problems are far more simple... Particularly, input data is often bad or was badly collected. So, it would be great to have more info on: a- Preprocessing (removing bad columns, the various ways to treat NAs, removing constant columns, "correcting" class imbalance in the collected data, etc.). b- Also, the book lacks a very important point: cost sensitivity. Indeed, in real-life problems (particularly in medicine, but not only), wrongly assigning class A to a new example may be more costly than wrongly assigning it class B. So, it is important to take into account the cost of an error in the learning process. c- Also visualization techniques help manual pre-processing. So, having more info on how to implement some of these technique in R might be interesting.

Unfortunately, I am far from an expert in these domains (that's why I was looking for info and found this book). But there might be interesting hints in:

- The Weka MOOC - "Types of Cost in Inductive Concept Learning" from Peter Turney - The ML book from Witten & Frank (CHAPTER-5: "Credibility: Evaluating What’s Been Learned", section "Cost-Sensitive Classification" https://aml.media.mit.edu/EvaluationWitten&Frank.pdf )

2. It is imposible for the moment to generate a PDF from the current state of the pages... See the bug report here: https://phabricator.wikimedia.org/T88890