Visualising the public leaderboard.
%%time
csvs = loading.get_csvs()
leaderboard = pd.read_csv(csvs['public-leaderboard'], parse_dates=['SubmissionDate'])
leaderboard.head()
leaderboard['TeamId'].nunique(), len(leaderboard)

get_leaderboard_distribution[source]

get_leaderboard_distribution(df:DataFrame)

%%time
dis = get_leaderboard_distribution(leaderboard)
dis.head()
dis['Score'].describe(percentiles=[.05, .1, .25, .5, .75, .95])

Public scores:

Segment Score
top 50% 1.44
top 5% 0.98

With the best private leaderboard score being at 1.23 there is seems to be some overfitting / leakage in leading to those scores.

Line plot of the above

px.line(dis, x='Score', y='Cumulative share (%)', title='Cumulative distribution of public leaderboard scores')

Looking at the temporal trend of the scores to get an idea of jumps

leaderboard.plot(kind='scatter', x='SubmissionDate', y='Score', title='Trend of the public score over time')

Finding:

  • There are like 3 clusters around 1.243, 1.118 (from 2019-10-25 onwards) and 0.979 (from 2019-11-20 onwards) appearing over time