If I proposed 12 independent variables and created a model, and made a prediction using only 10 data points for my training set that would be overfitting.
Proposing 5 different variables with ranges was silly, more cherry picking then overfitting, (I wasn't trying to make a prediction). I should of thought more carefully. Using tight ranges I could get as small a group as I wanted only using 1 variable (say fastball speed 90-90.01)
My hypothesis was that Sanchez was an outlier. I'm no longer convinced he's an "outlier" so to say. He near the edge of the 3D distribution of age, ground ball rate, and fastball speed. So are a lot of other guys. It's a hypershpere. RA Dickey and Mark Buerhle are on the edge somewhere.
I don't know if that is meaningful or not, Are predictions for guys on the edge of the cluster as accurate as for guys in the middle of the cluster??
I also realize the more variables I use the more guys there will be on the edge... 1D there will be only be 2 guys on the edges, in 2D space more, and so on, until, as you indicate, you can get everybody on the edges if you use enough variables.