Measuring player performance with statistics has become an everyday practice, yet it has never been questioned more than now. Discussions about sample size and the impact of the reduced number of rounds played have raised concern about the numbers we use to compare players, pick MVPs, and crown the best players in the year. HLTV Rating 2.0 has been the gold standard since its introduction in June 2017, but the transition to CS2 has exposed the need for change. With this requirement for evolution in mind, HLTV recently announced the release of HLTV Rating 2.1, with some small adjustments to account for some of the effects observed, as well as announced the future release of the completely new HLTV Rating 3.0. Leetify rating, based on round-win probabilities models, has been proposed as an alternative. However, it is far from being widely accepted and comes with its own downsides.
In this piece, we will delve into some of the characteristics of both of these approaches and justify the place for a new approximation, based on analyzing the performance of players across different dimensions and merging all of them together for a total score. This piece is going to get a bit technical, but I highly suggest that you follow along the way. Understanding every detail is not essential, and getting the gist of what makes each option ideal at what it intends is more than good enough!
State of the art: HLTV Rating 2.0 vs Leetify Rating
HLTV Rating 2.0: CS2 killed the rating star
HLTV Rating 2.0 has been the standard for comparing players since the beginning of professional Counter-Strike. However, with the adoption of MR12 and the changes included in CS2, the general feeling in the community is that it was outdated. However, we will use HLTV Rating 2.0 as an example of how ratings can go sideways whenever the game suffers a noticeable change in the meta.
The complete formulation for the HLTV Rating 2.0 was not public, so we do not know for sure what actions it included and how much value they were assigning to each one. However, a user from the community tried reverse-engineering the rating and his results were fairly accurate, you can check it at his blog post. From that approximation, we can be fairly confident that the formulation was a combination of KPR, DPR, ADR, KAST, and Impact, with some small adjustments from more granular data.
The rating was supposed to be centered at 1, considering a good performance whenever player has a rating above 1.1, and a bad performance when a player has a rating below 0.95. The lowest value that a given player can get is 0, while the upper limit was not restricted. This is one of my complaints about the rating: humans are used to things being centered, and equally distributed at both sides of the mean value (height, IQ, hand’s size…), which is known as a Gaussian distribution.
In contrast, with the specification of the rating used at both HLTV Rating 2.0 and 2.1, the values of the players can get further for the mean going up, which in statistics is known as having a heavier tail towards the upper part of the distribution, than going down. If you are interested, this could probably behave more like a gamma distribution. It may not seem like a huge issue, but it was directly related to the general feeling that the community had about the rating. If you are interested in distributions, you can check this visualizer where you can play with the parameters and look at how they modify the shapes.
MR12 means that the sample size is smaller, which translates to more extreme values of the rating. Why? Because more rounds give players more time to regress to their “real” mean level, reducing the effect of a couple of good rounds in the final rating. In theory, we could argue that extremely low and extremely high outlier performances would balance each other, but this comes from the supposition that the ratings are symmetrical. Since the HLTV Rating 2.0 or 2.1 can not go lower than 0, it means that extremely high HLTV Rating values can not be balanced. In the next figure you have a simple representation of what my intuitions suggest that is happening: with CS2 the upper tail is longer and thicker, and the lower tail is getting thicker but being stopped at 0.
Another critique regarding the HLTV Rating 2.0 has been that the loadings assigned to each action are predefined, and it is somewhat defined by a human. This is a vicious circle: the HLTV Rating determines who are the best players, and they are even getting rewarded for it with specific terms in their contract if they get to appear in the Top 20 of the year. However, the person choosing the loadings is defining what is considered important, and that is directly impacting the values obtained. Specifically, HLTV Rating 2.0 seems to be rewarding more aggressive players than Leetify. Finally, one last critique is that this HLTV Rating 2.0 does not take into account economy, and therefore is not able to filter between clutch players and eco-cobras.
It may seem that I am being too harsh about the HLTV Rating, but don’t get me wrong: it is extremely good at what it is intended to do, but everything comes with a price. If you need a rating that can be extracted confidently for each match played, you can not use extremely granular data because the sample size will be tiny. Since you can not use advanced stats, you are sensible to changes in playstyle across time that you can not measure on a match-basis but will affect your rating performance on a long-term basis.
Leetify Rating: a good performance metric for amateur players
Leetify Rating covers some of the blind spots in HLTV Rating using a completely different approach. Whereas HLTV uses several game stats to generate a rating for each player, Leetify is a “win rate impact-based and economy-adjusted player rating system”, according to their own launch blog post. With this approach, they aim to account for two problems that they thought were affecting HLTV Rating:
- People getting the same credit for each kill, not considering that all kills don’t have the same impact on the outcome of a given round.
- Excessive focus on who gets the kill, even if he only did 1 point of damage.
Luckily for us, they have been fairly open about how their rating works, so here is a simplified explanation:
- Depending on the economy, each team is assigned an initial win probability for each round.
- Round win probability for each team is updated after each key event happens.
- Players are awarded or penalized depending on their contribution to the change in win probability, using a manually-defined weighting system that distributes the change in round win probability difference between the individuals that contributed to each kill. If a player saves in one given round, he will increase the win probability of his team in the upcoming round, and in consequence he will be rewarded with a small amount of rating.
- Final rating is a combination of the ratings obtained in each of the rounds played.
This system solves some of the issues that we mentioned earlier regarding HLTV Rating: it is Gaussian-distributed and centered in zero. Furthermore, each time a player is awarded winning probability, one or several players in the enemy team are penalised by the same value, which makes the distribution symmetrical.
So, what is Leetify rating good at? According to Leetify, it is a performance metric great at measuring impact in a single match or a narrowly scoped context (e.g. a given tournament). They do not recommend it as a tool for measuring improvement, since your performance may be heavily affected by the skill level of your opponents, but I have to disagree here. In my opinion, Leetify rating shines most when it is used by a wide spectrum of players on a daily basis and used as a tool for measuring progress. The key is to not used a single observation, just like when you are losing weight you don’t panic if one morning your weight is slightly higher. You should focus on the average across 1-2 weeks, and in that case, Leetify rating is the single best measure out there for upcoming players that want to improve. Moreover, Leetify provides a wide range of stats that young players can use to track different aspects of their game, giving an even wider context for the players.
Nevertheless, just like with most things in life, being a great performance tool comes with a price. One of my bigger complaints is that when training their win prediction model, the dataset directly influences the probabilities that later are being rewarded. Why is it a problem? It is not specified, but my guess is that they are using professional matches and a much larger amount of Faceit/Matchmaking matches to collect data in order to train the model. This is a problem because, for example, the 5v4 conversion rate in pugs is way smaller than in professional games. This implies that the rating will punish aggressive players, in benefit of passive players that take advantage of players overpeeking and going for riskier plays because they are actually just playing pugs with their friends and not “tryharding”. Moreover, this is exacerbated by the rating obtained for saving.
In addition, following their logic where not all kills are equally important, an increase in 15% winning probability is not always equally important. Let’s say we have a situation where the winning probability for team A is 80% because they are in a 2v2 postplant in B Inferno. One T player is holding banana and fights the last two remaining CTs, killing one and being traded. This increases the probability in, let’s say, 15% up to 95%. In a similar situation, let’s say that we have B team playing CT on Inferno in a 3v4 situation with a winning probability of 20%. One CT goes agro with the flash of a teammate, gets a kill, and increases the probability by 15% up to 35%. In theory, both plays are equal in terms of contribution to winning probability, but we know that they are certainly not equal. Going from 80% to 95% is securing the round while going from 20% to 35% may very well be useless because the Ts now have the information of two of the remaining CTs and the mid-call is straightforward. One counter-argument could be that the model would take this into account, but I find it extremely hard to believe that any given model can take into account that type of information. Moreover, I could also argue that in case of both scenarios going wrong, the T player would have lost way more winning probability than the CT, so for the CT the risk of going for a play in an already lost round is very slim while the T would get highly punished for just trying to hold the retake.
Following this idea, all rounds are not created equal either! Given two identical rounds in terms of economy, it is not the same pressure and importance the first full buy vs full buy of the first half, than the last round or the game where either one team wins or the other forces overtime. In this system, both rounds would be treated equally. This is not a big concern but it is another flaw of the system, that may reward players that overperform when the enemy team is having bad momentum, and then choke whenever the stakes are high.
Finally, my last complaint is that good/bad values are not straightforward. The distribution may have the right shape for being intuitive, but the values are certainly not. In fact, Leetify has updated a couple of times the definition of what is considered great-good-average-subpar-poor. In addition, they have changed in the past the probabilities assigned by the model and I find it naive to think that whatever loadings are used today will be equally optimal in 5-years time. This implies that this system could suffer if you want to do comparison across a longer timespan.
Just like in the previous case, all of this criticism does not imply that I believe Leetify rating is a bad system. It is obvious the amount of effort that the team at Leetify has put to create a rating that is as informative as possible, focusing on individual performance. If you are a young player looking to improve, Leetify provides access to something extremely close to a personal data analysis team, which is just ridiculously good. If you are interested in knowing more about the differences/similarities between HLTV Rating and Leetify, you should check this X thread by NER0cs.
A different approach: focusing on what make players special
What you optimize for determines what you can measure
What everyone is missing: what makes a given player good in his role, what are the differences between two great players with the same role but completely different playstyle, why two given players can have the same rating but give such different eye-test sensations… Simply put: what makes a player special and different from all the other players?
What is the downside: focusing on a player’s abilities implies that we are not interested in what a player does a given day, but what he can deliver most of the days. This means that a system designed to study players will never be a good indicator of what a given performance is in a specific match, because the dimensions that we need to take into account will have too much measurement error if the sample size is too small.
Luckily for us, we do not have to worry about an everyday system to measure players, we already have two great and different options with HLTV Rating and Leetify.
10 dimensions to classify players
Our approach will be to create different dimensions, focusing each one in different situations, covering a wide spectrum of actions that determine the outcome of the game. Each dimension consists of a linear combination of several specific statistics. Performance is measure using percentiles, which implies that whatever is considered good is determined in comparison with all the players in the dataset. Particularly, we have created specific metrics for the following dimensions:
- 3 economy dimensions: performance for each player and each type of round is measured for three types of rounds: fulls buys, anti-ecos, and half-buys.
- 2 situational dimensions: pistol rounds and spotlight rounds.
- 5 specialist dimensions: opening potential, multikill potential, clutch factor, AWP impact, and support abilities.
Different values for different roles: loadings for each different dimension can be customized to obtain different profiles for different types of players.
Comparable across time: score will always be in a restricted interval, but percentiles are allowed to vary over time. By changing the time frame where observations are considered, we can allow the scores to vary over time if the metagame evolves, but the percentiles will always vary between 0 and 100.
Re-defining scouting: so, what is the biggest application of this design? In a competitive scene where teams are racing each other to sign the best upcoming talent, being able to keep track of all the players in the professional scene is a massive advantage. With this dimensions approach, the amount of information that you have for all the players included lets you filter exactly the type of player that you are looking for. You are no longer filtering by role, you are filtering by playstile, and you can easily select the top 10 players that fits your team better. After that, the coaching staff can look into their demos and select the candidates where the eye test better matches the numbers test. Stats will never outperform the expertise of elite-level coaching staff, but since time is a limited resource, having tools that help you reduce candidates is important.
Percentiles: an intuitive way of measuring heterogeneity and comparing players
Introduction to percentiles
Using percentiles to measure individual characteristics is daily practice in human-related fields. Things such as IQ or Big Five personality traits tests rely on percentiles to point where, in the wide spectrum of different people that exist, each individual lies. For each test, individuals are compared with all of the other individuals who took the test, and therefore what really matters is not your score, but your score in comparison with the score of all the other people included.
Percentiles are flexible, requiring no distributional assumptions, and being reliable for both Gaussian and non-Gaussian distributions. Moreover, the interpretation of percentiles is fairly straightforward: being in the 90th percentile for KPR means that you are better in KPR than 90% of all the players included. For percentiles, the median is a good indicator of central tendency, and the lower and upper percentiles (e.g. 10th and 90th) are a robust estimator of spread.
However, there is one specific scenario where percentiles can be tricky: top upper and lower percentiles get exponentially rare, not linearly:
- A person in the 50th percentile is 1 in 2.
- A person in the 90th percentile is 1 in 10.
- A person in the 95th percentile is 1 in 20.
- A person in the 99th percentile is 1 in 100!!
Variables with a Gaussian-like distribution
In most cases, we are going to be dealing with statistics with a Gaussian-like distribution, where percentiles are fairly straightforward to obtain. This is the case, for example, of KPR, DPR, ADR or headshot percentage. In the following figure, we can see how the distribution of values changes once we keep processing the data. First, we have raw numbers (kills, deaths, total damage, and total headshot bullets). Second, we correct the raw numbers per measurement unit (round for kills, deaths, and damage, getting KPR, DPR, and ADR, and total bullets hit to get the % of bullets hit that are headshots). Lastly, we estimate the distribution of values and assign the percentile where each player lies for each statistic.
Variables with a bimodal distribution
For some variables, we will observe that the distribution is a combination of two groups of players. For example, for trade kills there is a group with a very low number of trade kills (lurkers, aggressive players) and then a group of players with standard numbers of trade kills (the core of the team playing together). The same happens with entry kills and entry deaths, where there is a group of players with a really small amount of events (passive players, lurkers, strongholds) and then the core of the teams that are creating space or looking for advantages. Finally, another good example is multi kills score: some players hardly ever get a multi kill (IGLs, supports), and then the rest of the team is responsible for most of the fragging.
In this case, the approach is really similar, but we need to take into account when extracting percentiles that these distributions are a mixture of these two different populations:
That’s All Folks!
In this post we went all-in with ratings with the goal of helping CS2 players understand the differences between the main options available, and why we think there is a place for the inclusion in the community of a new system complementary to the previous ratings. If you have reached this point, congratulations! I tried to explain things as simply as possible, but some mathematical terminology was required. If you are interested in seeing our new approach in action, you should check our complementary post where we compare player’s performance since the beginning of the year focusing only on full buy VS full buy rounds. If you enjoyed this piece and are interested in statistics, I highly suggest that you listen to the podcast between Thorin and William M. Briggs regarding the misuse of statistics.