Author: LorenCobb
Subject: Datamining vs. The Hunt
Date: 10/3/2000
Recommendations: 215
Warning: long, prolix, and highly idiosyncratic.I don t normally like to step into the deep waters of religious belief, but once in a while I am overcome with the urge to plunge in. This issue of datamining is clearly more religious than scientific or statistical, but I suspect that some clarity can be brought to this topic, with the aid of some careful thought and carefully chosen words. In any event, I propose to try.First, some disclosures. I am first and foremost a mathematician, as opposed to a statistician, scientist, or investor. All of my instincts and thoughts are grounded in a lifetime of logic and mathematics. I always look for structure, and especially so within my specialty, random processes. This has led me into some strange and wonderful areas, such as stock markets, psychology, politics, epidemiology, and developmental economics; these are areas that most mathematicians avoid like the plague. It has also led me into statistics, and rather deeper than is really good for any mortal soul.On Mathematics and StatisticsThe philosophical chasm that separates statistics from mathematics is very deep, and completely unrecognized by most people. This "chasm" is a disconnect that profoundly affects every aspect of our discourse on datamining. If you haven t perceived it yet, then it is my honor and privilege to be the first to show it to you. It exists because there are two very different ways of looking at any one equation. Let me explain.Let s start with some common knowledge about mathematics. In ordinary math we describe our ideas with "models". A model is just an equation that specifies shape or behavior. In the simple linear case, a model is often nothing more than an equation like y = a + bx. In this equation there are "variables" and "parameters". The variables are by ancient tradition written with letters chosen from the end of the alphabet. They can take many values, and the equation itself expresses a relationship between the values of the two variables. The parameters, on the other hand, are by tradition written with letters from the front of the alphabet. They have values that are considered fixed. From their fixed values we discover the actual shape or behavior of the model itself. In the linear example, the parameters a and b give us the intercept and slope of the line described by the model.Now consider the same model from a statistical point of view. It is dramatically different! Now the variables are fixed by observation, and we have a lot of these observations, not all in agreement with each other. We are most emphatically not allowed to vary these values: that would be falsifying the data. The parameters, however, are now variable! They have unknown values that we need to estimate by suitably manipulating the data, our observationally fixed variables. In the linear example, we have (say) N different observations of x and y, and we need to discover the value of parameters a and b. Note how the two parts of the model, the variables and parameters, exchange roles when considered statistically. This is the essential shift in perspective that we unconsciously make when we adopt the statistical point of view.Thus what is fixed and immutable to a mathematician is unknown and variable to a statistician. Contrariwise, what is fixed as data for the statistician is a freely and beautifully variable to the mathematician. Worse, neither understands nor appreciates the other s point of view.The Psychology and Culture of StatisticsIt has been my consistent observation that most human beings are much more like mathematicians than statisticians, even those of us who never touch an equation from one year to the next. In other words, most of us naturally form a fixed mental model of whatever we happen to study, and we don t normally concern ourselves with the variability of our model parameters as a function of our observations. This is the psychological reason why statistics as a field of endeavor is so infernally counterintuitive. The appealing aspect of every mental model is its shape and behavior, not the hypothetical variability of its parameters.There is another problem with statistics, this one cultural. Most professional statisticians grow up (educationally speaking) with only the most cursory exposure to real mathematics, by which I mean the meat and potatoes of mathematical models: geometry, differential equations, topology, modern algebra, etc. To be blunt, their concept of what a model can be is seriously and sometimes catastrophically deficient. A quadratic form is for them a very sophisticated model. Many, and perhaps most statisticians work their entire careers without using any model more complicated than a flat plane, or more dynamic than a change over just two points in time. Yet these same individuals function as the high priests of science, blessing or damning every quantitative study even before it is submitted to a journal for peer review. I know about this from direct personal experience: I was twice employed as a biostatistician in a medical school.Research or Datamining?When we investors backtest our stock screening ideas, we are functioning in mathematical mode: looking for structure. An accusation of datamining, however, comes from the other side of the chasm: it is a statistical claim. Perhaps the reason these accusations seem so raucous is that the statisticians have to scream to make themselves heard by all of us on the other side.Datamining itself is a very peculiar concept, thrice removed from reality. The mental model that we form from experience and observation is the first level of abstraction from reality. Statistical theory teaches us to test a formal hypothesis concerning each parameter of that model, to see whether it is truly different from zero. This is the second level of abstraction. We pursue this approach happily, or at least without too much complaint, sometimes for years at a stretch. Then along comes a young troublemaker who beats us over the head with a big stick named Datamining. Ouch! Now it seems that either (a) we are making too many tests of hypotheses, so of course something has to be significant by chance alone, or (b) our models have become so complicated that they can fit anything. The claim is that we have not discovered a structure, we have merely fooled ourselves into believing that it exists. If the claim is true then we have been deluding ourselves, but if it is false then we may be about to throw away a potential fortune.To me, all of this thunder and lightning about datamining seems to miss the point. Datamining works as a concept only within the impoverished world of statistical models. Outside of that domain there are gorgeously complex mathematical models whose validity derives from well-understood mechanisms of cause and effect. When causation is poorly understood, as for example in the stock market, then our models are necessarily simpler and more ad hoc, and the cry of Datamining! has at least some semblance of applicability. But is it helpful? I think not. We need better models of cause and effect, not purer statistics. In the case of Mechanical Investing, we should, I submit, be thinking about why certain screening and filtering ideas work and others don t. If we don t ask this question, if we simply try endless variations on screen parameters without theory or theme, then we shall have wandered into the dusty wasteland of statistical empiricism, and we shall deserve every shriek of datamining that comes our way.The search for causation and systematic structure does not require statistical analysis, though it sometimes helps, nor should we ever be deterred from this search by accusations of datamining, though these accusations may sometimes be on target. These things have a way of correcting themselves: false hypotheses and fantasized relationships reveal themselves willy-nilly in the cold light of tomorrow s market. Meanwhile the fun is in the hunt, the thrill of a stock that rises "like a homesick angel" (to borrow a phrase from RayVT), the shock of a stock that craters ignominiously, and the glory of a theory that really works.Loren(whose trip to Ecuador was postponed, so he now has actual hours of magnificently free time)