Rounding into its 19th year, the Chinese Basketball Association (CBA) made headlines this summer by luring away a high profile, American high schooler from the conventional collegiate route to play professionally overseas in China. Emmanuel Mudiay, originally commited to attend Southern Methodist University (SMU), surprisingly joined the Guangdong Southern Tigers instead. He is not the first highly rated recruit to spurn the American college system on his path to professional basketball--Brandon Jennings pioneered this in 2008 by signing with the Italian club Lottomatica Roma. Nonetheless, Mudiay is indeed the first to choose a team in China.
This recent surge in popularity for the CBA is interesting as it opens opportunities to grow the sport and develop talent outside of the traditional avenues. Still, the CBA has strict guidelines on the quantity and usage of foreign players on each team.
Teams are allowed up to 2 foreign-born players on the roster, who cannot play more than 6 combined quarters in a given game. Additionally, teams who finished the prior season with a bottom-5 record are allowed a 3rd import player, who cannot be North American. This 3rd foreign player does not have any minutes restrictions.
Clearly, the CBA must regard these foreign players in high esteem to artificially limit their availability and impact. With this in mind, I wanted to investigate the necessity of this constraint and see if there truly was a talent discrepancy between the domestic and imported foreign players.
Without knowing the features that could set the two types of players apart, my approach was to build a predictive model that classifies players based on their physical attributes, on-court positions, and performance. This is simply to determine if building a decision boundary was even possible. Using CBA player data from realgm.com, I first tried a traditional method for binary classification--logistic regression. The initial data set was expansive and had many collinear variables, which would present problems for logistic regression, so preprocessing was a must. Using PCA for dimensionality reduction, I transformed to 30 features with ~61% variance explained. With this new PCA-projected feature set, I trained a logistic regression binary classifier that scored very well: 94.9% accuracy, 95.8% precision, 97.8% recall, 96.8% F1-score.
This demonstrated that a decision boundary did indeed exist. Unfortunately, the PCA transformation muddled the interpretability of the features.
My next approach was to train a classifier using a random forest classifier. Using this technique, I could retain the original feature set and leverage the natural feature selection abilities of random forests. When evaluated on a test data set, this binary classifier scored fairly well: 94.9% accuracy, 96.4% precision, 97.1% recall, 96.8% F1-score.
Here are the feature importances from an example trained forest:
The following consistently score high for feature importance:
- "usgpct" (Usage %)
- "hob" (Hands on buckets)
- "per" (Player Efficiency Rating)
- "high_game" (Most points scored in a game)
- "pts" (Total points scored)
You immediately notice that some of these are calculated metrics, rather than observed box scores statistics. The high importance of usage % and "hob" suggest that CBA teams are indeed heavily relying on foreign players to facilitate and create offense. Points scored also plays heavily into this classification.
It is also interesting to note that there is hardly any signal in player positions, which indicates that CBA teams are not biased towards any particular position when selecting imports.
Further nationality stratification studies could be pursued in the future, but unfortunately, there are currently not enough foreign players from various countries to do a multi-label classification, as the origin of non-domestic players is dominated by the United States.
For my rough scratch work/code: