%%time
csvs = loading.get_csvs()

leaderboard = pd.read_csv(csvs['public-leaderboard'], parse_dates=['SubmissionDate'])
leaderboard.head()

leaderboard['TeamId'].nunique(), len(leaderboard)

%%time
dis = get_leaderboard_distribution(leaderboard)
dis.head()

dis['Score'].describe(percentiles=[.05, .1, .25, .5, .75, .95])

Public scores:

Segment	Score
top 50%	1.44
top 5%	0.98

With the best private leaderboard score being at 1.23 there is seems to be some overfitting / leakage in leading to those scores.

Line plot of the above

px.line(dis, x='Score', y='Cumulative share (%)', title='Cumulative distribution of public leaderboard scores')

Looking at the temporal trend of the scores to get an idea of jumps

leaderboard.plot(kind='scatter', x='SubmissionDate', y='Score', title='Trend of the public score over time')

Finding:

There are like 3 clusters around 1.243, 1.118 (from 2019-10-25 onwards) and 0.979 (from 2019-11-20 onwards) appearing over time

Leaderboard