web analytics

Matt Briggs, whom I quote, found this graph on reddit. The comments there note, correctly, that the fitted model is stupid. It does not fit reality. Since I spent much of he afternoon looking at a paper based on data mining first childbirth versus first abortion, The idea of models trumping reality was painfully apparent during the afternoon.

To quote Briggs — on that graph.

Problem #1: The Deadly Sin of Reification! The mortal sin of statistics. The blue line did not happen. The gray envelope did not happen. What happened where those teeny tiny too small block dots, dots which fade into obscurity next to the majesty of the statistical model. Reality is lost, reality is replaced. The model becomes realer than reality.

You cannot help but be drawn to the continuous sweet blue line, with its guardian gray helpers, and think to yourself “What smooth progress!” The black dots become a distraction, an impediment, even. They soon disappear.

Problem #1 one leads to Rule #1: If you want to show what happened, show what happened. The model did not happen. Reality happened. Show reality. Don’t show the model.

William S Briggs

If you are going to use a model — and often we do, particularly when dealing with missing data — then you better know what the assumptions within that model are and can state why the model fits.

One of the frustrating things in doing statistics from the command line is you have to understand what the command means to interpret it. But this forces you to be parsimonious, and not over correct the facts on the ground.

This kind of package scares me. Random forest plots? Why?

Nor, as happened in the discussion today, concentrate on the increased odds ratio for antidepressant use some years post childbirth, ignoring correlations that were twice as strong, such as having other psychiatric medications or seeing a psychiatrist.

For the model should never contradict the findings.

  1. A company I used to work for died a slow, painful death after a colleague of mine decided that the consistency of the model used for various vendors of a key part was more important than the consistency of that model with the data. More or less, the target spec was about 200dppm after five years, but the data were actually suggesting about 10,000 dppm. About half a year after political pressure was applied to choose my colleague’s more optimistic model vs. my more pessimistic model (log-normal tails vs. Weibull), the biggest customer dropped them because, lo and behold, the data they observed were in the same place as the data I processed. Actually, I think they were worse, because my analysis wasn’t able to process another result of where those data were going.

    Sadly, hundreds of people, myself included, lost their jobs because of that SNAFU.

Comments are closed.

%d bloggers like this: