Post
by Topper » Fri Mar 23, 2012 2:26 pm
I spend several hours a day working with databases and using stats to identify trends and detect anomalies in 2D, 3D and at times 4D and 5D. I would not call myself a statistician, but I can work with numbers fairly well. I have also long been a fan of the work Bill James has done with baseball numbers. I still have copies of his annual Baseball Abstract from the early eighties sitting on my bookshelf.
Bill's greatest achievement was demonstrating the correlation between minor and major league performance and then using minor league stats to accurately project major league performance. One paper he wrote on projecting team performance based upon number of players having career years the previous season is a work of art.
I am quite skeptical when I see people throwing numbers around. There are a few simple things to note when you work with stats. Cam Charon's (sic?) chances for/against model is an exceptional example of how not to compile stats.
First is data verification. Are the numbers collected any good, how were they collected and what biases are there in the data collection? Do some simple stats of mean and standard deviation plots to see variability in the data. This will highlight variability in the data and give an indication how useful it is. If you have highly variable data, you quite likely have too small of a sample size and you should use a +/- factor (commonly 2 standard deviations) when working with the numbers. Of note is that this +/- factor accumulates when you perform calculations with the stats and can become significantly large, even larger than the represented stat value, very quickly.
Second, look at correlation coefficients to identify data linkages. I looked at Cam's data wel Spud linked to it early in the season and saw that chances for had a direct correlation to TOI. What as interesting, was that chances against did not correlate to TOI. At that point it was obvious that Cody was playing sheltered minutes.
Another problem In the case of the chances for/against model, the obvious problem was the home plate area for defining chances. Point shots and even shots from the deep slot would not count. Another issue would be the definition of a shot. I could not find one on the blog. Is it a shot on net or a shot directed at the net? i would hope they using shots directed at the net, but how are say passes to a player ion the backdoor with a yawning cage that are either tipped or fanned on dealt with in the data collection?
I also noted that at that point in the season, the data was subject to a fair bit of variability. I recall the Sedins had had games with about 3 chances and other games with 10-13 chances. Without the raw data it is impossible to calculate means and standard deviations. That wide variability would lead to a high 2nd standard deviation. When they chose to represent the chances for/against as a percentage they should have given it a +/- of three times the 2nd standard deviation. That amount could be a very significant figure and throw much of the derived results into question. I was quite conceivable to see a player have a %Chances for/against value of 67% +/-12%. When team is in a group with a range of 30% between highest and lowest, that +/-12% is huge.
I suspect that teams use a GIS system to track scoring attempts. For an investment of a few tens of thousands of dollars a system could be adopted that would allow touch screen input of the location on the ice of the origin of the shot/pass and then input of what happened to the shot/pass. Was it tipped/deflected/fanned on/blocked. Was it a goal/save/post/wide (and by how much)/rebound.
We use similar systems for our field mapping and data collection, simplistic systems are available as phone apps.
Using such a system, the pass to create a chance that someone mentioned Tippet uses could be easily tracked.
Nothing too different than what is done for charting shooting tendencies or goaltenders save/weakness tendencies, just adding some additional detail.
I read with amusement when the folks who came up with the Chances for/against blog used the movie Money Ball as a validation of their work. When I watched the movie I thought it was more a condemnation of there simplistic approach. Tellingly there is a scene where the fat fuck is charting pitches in a strike zone displayed on his computer. The shot shows multicoloured dots representing balls, strikes, curve balls, fast balls and whatever, and serves to highlight how overly simplistic Cam's work is.
It is funny watching people use these stats to justify were a team is in the here and now, the true power of these advanced stats was shown by Bill James in projecting a players value in the future.
Corsi is an interesting beast and appears to steal heavily from the work done in tracking bowler and batter stats for cricket rankings. In that work, not only is strength of opponent tracked, but also the difficulty of the cricket ground being played on.The stats also carry a weighting based upon time, with most recent performance given more credence than past work. This gives a very useful trend line to a players performance.
Over the Internet, you can pretend to be anyone or anything.
I'm amazed that so many people choose to be complete twats.