Pitcher WAR

turtle4499 · 01-27-2025, 10:18 PM

Pitcher WAR is not correct at all. Currently this results in it being a complete noob trap and absolutely crushes the AI's ability to play the game.

I wrote a long ranty post about it on reddit because people keep suggesting new players to use it.

List of issues I know about:

Uses non FIP park factors for park adjustment
- No idea why this is a thing but easy to verify change your parks doubles rate and WAR, FIP-, and ERA+ all change. Only ERA+ should be changing.
Uses park adjustment at the season level instead of game level
- Seriously go look at player home vs away splits for Rockies it is completely insane.
- Editing your park makes your players home and away war change. This is not done in real life you can inspect any players splits on fangraphs and easily verify this. I guess it may have been done 20 years ago in 2006.
- This may be close enough for hitters but it creates systematic biasing for pitchers.
Uses a FIP calculation that ignore in game calculation method.
- FIP dramatically over represents K% effect on outcomes. In real life K's prevent HRs in OOTP they do not.
- Sidenote can we please let K% come before HR calc in order of operations it so odd that you can hit a HR without k% being checked at all.
Uses FIP
- The game allows pitchers to impact BABIP why is the game using a FIP based WAR? It makes no sense that both of these things are true.
Doesn't use the full WAR formula for dynamic run environment that starters have.
- (May be wrong on this one so please correct me if I am)

This and hitter WAR have no excuse to be so broken in a game that is about stats at its core. It is extremely problematic to have them be so wrong when they effect player happiness and contract demands.

jaa36 · 01-27-2025, 11:37 PM

Valid critiques! I really wish that OOTP would make their methods of calculating their various WARs, park factors and other stats less opaque- including what the linear weights are for offensive events, which would in turn feed into wOBA and offensive WAR. And for pitchers, it would be great if the game had a "home-grown" WAR calculation that incorporated the effects of BABIP. One potential way to do this (not perfect by any means) would be to use base it on "pitcher wOBA allowed" vs league average (corrected with appropriate park factors), then applying the dynamic runs-per-win calculation. For the linear weights, you'd probably want to use a catch-all weight for all non-HR hits, as (to my knowledge) an OOTP pitcher has no control over what type of hit he allows, if it's not a home run.

mytreds · 01-28-2025, 12:38 PM

And thus the inherent problem with WAR. It can be whatever the person making the formula wants it to be.

uruguru · 01-28-2025, 02:58 PM

the introduction of pitcher BABIP was a godsend for simulation accuracy, but yes it greatly undercuts the entire premise of FIP.

It is WAR that is a noob trap. No one should be relying any version of it, real or in OOTP, to evaluate single season performance. I would be shocked if anyone working with MLB teams relies on it -- its error bars are far too wide. It is best used as a career ranking stat for HOF voters where all of those error bars can hopefully cancel out over long careers.

Brad K · 01-28-2025, 04:25 PM

Quote:

Originally Posted by uruguru

No one should be relying any version of it, real or in OOTP, to evaluate single season performance.

Maybe not career performance either. This video is a shocker.

https://www.youtube.com/watch?v=4OvLU1DSrmw

uruguru · 01-28-2025, 06:22 PM

Quote:

Originally Posted by Brad K

Maybe not career performance either. This video is a shocker.

https://www.youtube.com/watch?v=4OvLU1DSrmw

That same guy made a video responding to a critical comment I made about the stat a few years ago. He listed other comments after mine, but mine was the main substantive one. He absolutely nailed my voice when he read my comment.

He then went through the trouble of manually calculating WAR and came across all of the same systemic flaws in the calculation that I have critiqued before. And then he ultimately shrugged his shoulders and said basically, "I guess it's not that good but it's still the best we have"

https://www.youtube.com/watch?v=ipD053CE3PI

Brad K · 01-28-2025, 06:31 PM

Good job. It's tough to get people to change their minds.

Brad K · 01-28-2025, 06:41 PM

He talks too fast. I attended a wedding last year in which only one of the speakers was truly understandable. I commented to him about it and he said the key is to talk slowly, so slowly that he as the speaker thinks it's too slow.

jaa36 · 01-28-2025, 07:12 PM

Quote:

Originally Posted by uruguru

the introduction of pitcher BABIP was a godsend for simulation accuracy, but yes it greatly undercuts the entire premise of FIP.

It is WAR that is a noob trap. No one should be relying any version of it, real or in OOTP, to evaluate single season performance. I would be shocked if anyone working with MLB teams relies on it -- its error bars are far too wide. It is best used as a career ranking stat for HOF voters where all of those error bars can hopefully cancel out over long careers.

Well, let's not throw out the baby with the bathwater. While the OOTP implementation of WAR remains far from perfect, and would benefit from visibility about how it's calculated, it is still a useful measure of retrospective player value in most cases. Of course, like anything else, you'd want to look at the underlying ratings to know if the number is a fluke or not, and for future performance, I'll take the player with good ratings who put up negative WAR over the player with bad ratings who put up 4 WAR any day of the week.

I just looked at MLB on Fangraphs for the least 25 years, and an online league I'm in for a 15-year period. R-squared correlation for total team Fangraphs WAR to team wins for MLB was 0.97. R-squared for total OOTP WAR to team wins for the league was 0.93. Pretty good! Ignore at your peril

turtle4499 · 01-28-2025, 10:50 PM

Quote:

Originally Posted by jaa36

Valid critiques! I really wish that OOTP would make their methods of calculating their various WARs, park factors and other stats less opaque- including what the linear weights are for offensive events, which would in turn feed into wOBA and offensive WAR. And for pitchers, it would be great if the game had a "home-grown" WAR calculation that incorporated the effects of BABIP. One potential way to do this (not perfect by any means) would be to use base it on "pitcher wOBA allowed" vs league average (corrected with appropriate park factors), then applying the dynamic runs-per-win calculation. For the linear weights, you'd probably want to use a catch-all weight for all non-HR hits, as (to my knowledge) an OOTP pitcher has no control over what type of hit he allows, if it's not a home run.

Linear weights does not exist for a pitcher. Pitchers have too much control on the arrangement of base-out states. Pitchers who walk a lot see a lot more 1 on no out situations then those who do not, that descriptively alters the linear weight result. SIERA is the actual stat you would want but given the inability to implement easy stats correctly I am not holding my breath about implementing this correctly.

But really just making the game actually factor strikeouts into HR and BABIP prevention like it does in real life would solve most of the issues with using FIP. Then all they need to do is actually apply park factors correctly. Which isn't actually hard and is pretty shocking that this has been done incorrectly. I am not talking about it being wrong like it overweights HR park factor. It is just not applied correctly at all.

Applying park factors to innings regardless of the park the event took place in makes 0 sense, and only kinda works for hitters do to them playing roughly the same amount of PA at home as away. It doesn't actually work out correctly though to be clear.

(runs_at_home*home_park_factor) + runs_away != (runs_at_home+runs_away) * (home_park_factor/2) the second is what OOTP is doing despite the terms not being equivalent at all.

uruguru · 01-28-2025, 10:53 PM

Quote:

Originally Posted by jaa36

I just looked at MLB on Fangraphs for the least 25 years, and an online league I'm in for a 15-year period. R-squared correlation for total team Fangraphs WAR to team wins for MLB was 0.97. R-squared for total OOTP WAR to team wins for the league was 0.93. Pretty good! Ignore at your peril

Did you see what you did there? You took a bunch of individual player WARs, which bbref admits has error bars up to 20% and then summed them all together -- which cancels out a lot of the errors when viewed an aggregate (much like a player's career WAR total).

At the individual player level, WAR is still a blurry measure of contribution.

turtle4499 · 01-28-2025, 11:04 PM

Quote:

Originally Posted by jaa36

I just looked at MLB on Fangraphs for the least 25 years, and an online league I'm in for a 15-year period. R-squared correlation for total team Fangraphs WAR to team wins for MLB was 0.97. R-squared for total OOTP WAR to team wins for the league was 0.93. Pretty good! Ignore at your peril

Team level comparisons don't actually address this and can actually cover them up. WAR is always normalized at the league level, so long as systematic errors shift value around at the team level you would not observe this. In the case of systematic issues with starting pitchers that value gets shifted to relievers. The issue then won't really show up except that at the player level incorrectly attributes value. Further because its based in park factors if most teams in your league are around 1 you won't even see the bulk of the issues show up.

That being said the other comments made about WAR are mostly not addressing the same thing as me. WAR models a specific thing and when using it you need to understand what it models. That isn't a knock on WAR that is a reality of the math that underpins it.

turtle4499 · 01-28-2025, 11:12 PM

Quote:

Originally Posted by uruguru

Did you see what you did there? You took a bunch of individual player WARs, which bbref admits has error bars up to 20% and then summed them all together -- which cancels out a lot of the errors when viewed an aggregate (much like a player's career WAR total).

At the individual player level, WAR is still a blurry measure of contribution.

That is the consequence of the central limit theorem and the law of large numbers. It also only applies to random errors not systematic errors. There is systematic canceling here that is likely responsible for why the r^2 remains so high at the team level. So no it is not equivalent at all, and you really need to be careful when you make assumptions about which types of errors are contributing to the problem since random vs non random errors have different effects in aggregate.

uruguru · 01-28-2025, 11:58 PM

Quote:

Originally Posted by turtle4499

That is the consequence of the central limit theorem and the law of large numbers. It also only applies to random errors not systematic errors. There is systematic canceling here that is likely responsible for why the r^2 remains so high at the team level. So no it is not equivalent at all, and you really need to be careful when you make assumptions about which types of errors are contributing to the problem since random vs non random errors have different effects in aggregate.

I didn't specify the type of error, but I thought it was obvious I was speaking about the general margin of error for the WAR statistic. This error is created not only by randomness within the sport, but by systemic shortcomings in the modeling for the stat (approximations, guesstimates about data, assumptions about cause, noise in park factors, etc).

I think it's fair to assume that the standard error of for individual player WARs are mostly independent. This suggests that summing their WARs could increase the total error by the square root of the number of players (N) added together even as the total WAR increases by a the full factor of N. So yeah, much like what you would see with the law of large numbers. Either way, you still get an averaging of the fluctuations in error for individual WARS.

But speaking precisely, player WARs are not identically distributed (for lots of reasons) so the law of large numbers technically does not apply. But the broader concept still applies enough to use as a reference point.

Matt Arnold · 01-29-2025, 06:29 AM

As a baseball fan, if anyone can come up with a better metric for modelling players, I'm all ears.

As for the in-game, we try to follow as much like actual rWAR or fWAR as we can, although yes, we do take some simplifying assumptions. We do the dynamic runs per win, we do adjust for leverage, split starter/rp, etc...

Some of them have started counting popups in the strikeout bucket for FIP, we don't necessarily track popups the same in player's historical records.
We do rope the park factors into a single factor number for some of the run calculations, and we also use the set park factors, we don't re-calculate park factors for each team based on actual results for the season. We also do assume that your league park factors average out - if you set everyone in your league to a 1.2 HR park factor, you'll probably see some weird adjusted numbers because if one park is a 1.2 we simplify and assume the average the other parks you play in will balance it out.

There's other adjustments that impact things, especially if you look at sub-splits for WAR, yes, or players who have sample sizes outside of average. A lot of that is that despite how much we love stats, we also don't store everything in history, because the average user does not want the entire BBRef database stored on their machine for every saved game they play. Or wait for half an hour for us to recalculate every WAR value on the fly because something got tweaked.

jcard · 01-29-2025, 08:11 AM

Quote:

Originally Posted by Matt Arnold

As a baseball fan, if anyone can come up with a better metric for modelling players, I'm all ears.

As for the in-game, we try to follow as much like actual rWAR or fWAR as we can, although yes, we do take some simplifying assumptions. We do the dynamic runs per win, we do adjust for leverage, split starter/rp, etc...

Some of them have started counting popups in the strikeout bucket for FIP, we don't necessarily track popups the same in player's historical records.
We do rope the park factors into a single factor number for some of the run calculations, and we also use the set park factors, we don't re-calculate park factors for each team based on actual results for the season. We also do assume that your league park factors average out - if you set everyone in your league to a 1.2 HR park factor, you'll probably see some weird adjusted numbers because if one park is a 1.2 we simplify and assume the average the other parks you play in will balance it out.

There's other adjustments that impact things, especially if you look at sub-splits for WAR, yes, or players who have sample sizes outside of average. A lot of that is that despite how much we love stats, we also don't store everything in history, because the average user does not want the entire BBRef database stored on their machine for every saved game they play. Or wait for half an hour for us to recalculate every WAR value on the fly because something got tweaked.

Somewhat related question—Is there a reason why in OPTP rRC+ is not park adjusted, whereas OPS+ does adjust?

Matt Arnold · 01-29-2025, 08:27 AM

Quote:

Originally Posted by jcard

Somewhat related question—Is there a reason why in OPTP rRC+ is not park adjusted, whereas OPS+ does adjust?

wRC+ is park adjusted

jcard · 01-29-2025, 09:11 AM

Quote:

Originally Posted by Matt Arnold

wRC+ is park adjusted

First—I appreciate the quick reply. This must have just changed with v25, then (I bought it in March but never used it). Through v24, if you summed through a season and compared batters’ individual wRC+ and OBS+, it was obviously the case that [wRC+ — OBS+] was highest for players in good hitting parks and lowest (=most negative) for those in poor hitting parks. The wRC+ numbers simply reflected the raw batting line, while the OBS+ stats punished / compensated based on park context.

turtle4499 · 01-29-2025, 09:50 AM

Quote:

Originally Posted by Matt Arnold

We do the dynamic runs per win, we do adjust for leverage, split starter/rp, etc...

We do rope the park factors into a single factor number for some of the run calculations, and we also use the set park factors, we don't re-calculate park factors for each team based on actual results for the season. We also do assume that your league park factors average out - if you set everyone in your league to a 1.2 HR park factor, you'll probably see some weird adjusted numbers because if one park is a 1.2 we simplify and assume the average the other parks you play in will balance it out.

Can you please provide clarity on exactly what is being multiplied out for dynamic runs per win and split starter/rp. Can you also provide clarity of just the forumla being used to turn park factors into run adjustments. I am not really sure why FIPs run adjustment takes into account the parks doubles factor.

I don't think anyone is actually asking you to calculate park factors based on seasonal data. I am really just asking for the park factors to be applied to the actual games correctly. Applying them broadly cannot possibly work out the correct numbers.

Simple example from my reddit post. I had a starting pitcher in coors with 50 innings at home and 100 on the road (because he was shelled at home) the adjustment is being applied like he played half his innings at home. There is no reason to assume starting pitchers do this at all. Especially in any hitter friendly park in an era where pitchers aren't just pitching 9 innings.

Quote:

Originally Posted by Matt Arnold

A lot of that is that despite how much we love stats, we also don't store everything in history, because the average user does not want the entire BBRef database stored on their machine for every saved game they play.

I am really struggling to understand what amount of data do you think is needed that isn't currently being stored right now? If we are able to look up home/away splits and vs team splits there is clearly enough data being stored to apply the park adjustments to the actual parks the games occurred in.

turtle4499 · 01-29-2025, 10:04 AM

Quote:

Originally Posted by jcard;5160628Through v24, if you summed through a season and compared batters’ individual wRC+ and OBS+, it was obviously the case that [wRC+ — OBS+

was highest for players in good hitting parks and lowest (=most negative) for those in poor hitting parks..

Just for clarity for anyone else poking around here. I checked this on 25 and that is still the case I am guessing this is a side effect of the run modifier itself though I have no idea if it is similar to the normal issues of ops+ vs wrc+ or if this is some specific issue with OOTP.

01-27-2025, 10:18 PM	#1
turtle4499 Minors (Single A) Join Date: Apr 2022 Posts: 52	Pitcher WAR Pitcher WAR is not correct at all. Currently this results in it being a complete noob trap and absolutely crushes the AI's ability to play the game. I wrote a long ranty post about it on reddit because people keep suggesting new players to use it. List of issues I know about: Uses non FIP park factors for park adjustment No idea why this is a thing but easy to verify change your parks doubles rate and WAR, FIP-, and ERA+ all change. Only ERA+ should be changing. Uses park adjustment at the season level instead of game level Seriously go look at player home vs away splits for Rockies it is completely insane. Editing your park makes your players home and away war change. This is not done in real life you can inspect any players splits on fangraphs and easily verify this. I guess it may have been done 20 years ago in 2006. This may be close enough for hitters but it creates systematic biasing for pitchers. Uses a FIP calculation that ignore in game calculation method. FIP dramatically over represents K% effect on outcomes. In real life K's prevent HRs in OOTP they do not. Sidenote can we please let K% come before HR calc in order of operations it so odd that you can hit a HR without k% being checked at all. Uses FIP The game allows pitchers to impact BABIP why is the game using a FIP based WAR? It makes no sense that both of these things are true. Doesn't use the full WAR formula for dynamic run environment that starters have. (May be wrong on this one so please correct me if I am) This and hitter WAR have no excuse to be so broken in a game that is about stats at its core. It is extremely problematic to have them be so wrong when they effect player happiness and contract demands.

01-28-2025, 12:38 PM	#3
mytreds All Star Starter Join Date: Nov 2019 Posts: 1,182	And thus the inherent problem with WAR. It can be whatever the person making the formula wants it to be. __________________ “Baseball isn’t statistics; it’s Joe DiMaggio rounding second.” “Once, centuries ago, it was the beloved national pastime of the Americas, Wesley. Abandoned by a society that prized fast food and faster games. Lost to impatience.” “ The term ‘WAR’ should be replaced by ‘WAG’. WAR isn’t an actual measurement; it’s just a wild-ass guess” -Bill James RIP National League 1876-2022 Floreat semper vel invita morte. I make custom ballparks.

01-28-2025, 02:58 PM	#4
uruguru All Star Starter Join Date: May 2022 Posts: 1,268	the introduction of pitcher BABIP was a godsend for simulation accuracy, but yes it greatly undercuts the entire premise of FIP. It is WAR that is a noob trap. No one should be relying any version of it, real or in OOTP, to evaluate single season performance. I would be shocked if anyone working with MLB teams relies on it -- its error bars are far too wide. It is best used as a career ranking stat for HOF voters where all of those error bars can hopefully cancel out over long careers. Last edited by uruguru; 01-28-2025 at 03:01 PM.

01-28-2025, 06:31 PM	#7
Brad K Banned Join Date: May 2016 Location: St Petersburg Florida USA Posts: 6,693 Infractions: 0/2 (4)	Good job. It's tough to get people to change their minds. Last edited by Brad K; 01-28-2025 at 06:32 PM.

01-27-2025, 11:37 PM	#2
jaa36 Hall Of Famer Join Date: May 2011 Posts: 3,106	Valid critiques! I really wish that OOTP would make their methods of calculating their various WARs, park factors and other stats less opaque- including what the linear weights are for offensive events, which would in turn feed into wOBA and offensive WAR. And for pitchers, it would be great if the game had a "home-grown" WAR calculation that incorporated the effects of BABIP. One potential way to do this (not perfect by any means) would be to use base it on "pitcher wOBA allowed" vs league average (corrected with appropriate park factors), then applying the dynamic runs-per-win calculation. For the linear weights, you'd probably want to use a catch-all weight for all non-HR hits, as (to my knowledge) an OOTP pitcher has no control over what type of hit he allows, if it's not a home run.

01-28-2025, 06:41 PM	#8
Brad K Banned Join Date: May 2016 Location: St Petersburg Florida USA Posts: 6,693 Infractions: 0/2 (4)	He talks too fast. I attended a wedding last year in which only one of the speakers was truly understandable. I commented to him about it and he said the key is to talk slowly, so slowly that he as the speaker thinks it's too slow.

01-29-2025, 06:29 AM	#15
Matt Arnold OOTP Developer Join Date: Jun 2009 Location: Here and there Posts: 15,833	As a baseball fan, if anyone can come up with a better metric for modelling players, I'm all ears. As for the in-game, we try to follow as much like actual rWAR or fWAR as we can, although yes, we do take some simplifying assumptions. We do the dynamic runs per win, we do adjust for leverage, split starter/rp, etc... Some of them have started counting popups in the strikeout bucket for FIP, we don't necessarily track popups the same in player's historical records. We do rope the park factors into a single factor number for some of the run calculations, and we also use the set park factors, we don't re-calculate park factors for each team based on actual results for the season. We also do assume that your league park factors average out - if you set everyone in your league to a 1.2 HR park factor, you'll probably see some weird adjusted numbers because if one park is a 1.2 we simplify and assume the average the other parks you play in will balance it out. There's other adjustments that impact things, especially if you look at sub-splits for WAR, yes, or players who have sample sizes outside of average. A lot of that is that despite how much we love stats, we also don't store everything in history, because the average user does not want the entire BBRef database stored on their machine for every saved game they play. Or wait for half an hour for us to recalculate every WAR value on the fly because something got tweaked.