# Predicting votes
> Let's see how how well votes of politicians in polls can be predicted.

**The strategy**:
- first: only include a politician id and a poll id as features 
- second: include text features based on the poll title and or description

**TL;DR**
- using only politician id and poll id we find an 88% accuracy (over validation given random split) => individual outcome is highly associated with votes of others in the same poll

**TODO**:
- combine poll title and description for feature generation
- try transformer based features
- visualise most incorrect predicted polls and politicians

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
from fastai.tabular.all import (
    Categorify,
    CategoryBlock,
    Normalize,
    TabularPandas,
    tabular_learner,
)

import bundestag.logging as logging
import bundestag.paths as paths
import bundestag.poll_clustering as pc
import bundestag.vote_prediction as vp

logger = logging.logger
logger.setLevel("DEBUG")

_paths = paths.get_paths("../data")
_paths

## Setup

Loading preprocessed dataframes (see `03_abgeordnetenwatch.ipynb`). First let's the votes.

In [None]:
file = _paths.preprocessed_abgeordnetenwatch / "df_all_votes_111.parquet"
file

In [None]:
df_all_votes = pd.read_parquet(file)
df_all_votes.head()

Loading further info on politicians

In [None]:
file = _paths.preprocessed_abgeordnetenwatch / "df_mandates_111.parquet"
file

In [None]:
df_mandates = pd.read_parquet(file)

In [None]:
df_mandates["fraction_names"].apply(
    lambda x: 0 if not isinstance(x, list) else len(x)
).value_counts()

Loading data on polls (description, title and so on)

In [None]:
file = _paths.preprocessed_abgeordnetenwatch / "df_polls_111.parquet"
file

In [None]:
df_polls = pd.read_parquet(file)
df_polls.head(3).T

## Modelling using only poll and politician ids as features

### Split into train and validation

Creating train / valid split

In [None]:
# splits = RandomSplitter(valid_pct=0.2)(df_all_votes)
splits = vp.poll_splitter(df_all_votes, valid_pct=0.2)
splits

Setting target variable and count frequencies

In [None]:
y_col = "vote"
print(f"target values: {df_all_votes[y_col].value_counts()}")

### Training

Final data preprocessing for training

In [None]:
to = TabularPandas(
    df_all_votes,
    cat_names=["politician name", "poll_id"],
    y_names=[y_col],
    procs=[Categorify],
    y_block=CategoryBlock,
    splits=splits,
)

dls = to.dataloaders(bs=512)

Finding the learning rate for training

In [None]:
learn = tabular_learner(dls)
lrs = learn.lr_find()
lrs

Training the artificial neural net

In [None]:
learn.fit_one_cycle(5, lrs.valley)

### Inspecting predictions

In [None]:
df_mandates["party_original"] = df_mandates["party"].copy()
df_mandates["party"] = df_mandates["party"].apply(lambda x: x[-1])

In [None]:
vp.plot_predictions(learn, df_all_votes, df_mandates, df_polls, splits)

accuracy:
- random split: 88% 
- poll based split: ~50%, politician embedding itself insufficient to reasonably predict vote

### Inspecting resulting embeddings

In [None]:
embeddings = vp.get_embeddings(learn)
embeddings

In [None]:
proponents = vp.get_poll_proponents(df_all_votes, df_mandates)
proponents.head()

In [None]:
vp.plot_poll_embeddings(df_all_votes, df_polls, embeddings, df_mandates)

In [None]:
vp.plot_politician_embeddings(df_all_votes, df_mandates, embeddings)

embed scatters after pca:
- poll based split => mandates form two groups
- random split => polls and mandates each form 2-3 groups

## Modelling using `poll_title`-based features

### LDA topic weights as features

In [None]:
source_col = "poll_title"
nlp_col = f"{source_col}_nlp_processed"
num_topics = 25

st = pc.SpacyTransformer()

# load data and prepare text for modelling
df_polls_lda = df_polls.assign(
    **{nlp_col: lambda x: pc.clean_text(x, col=source_col, nlp=st.nlp)}
)

# modelling
st.fit(df_polls_lda[nlp_col].values, mode="lda", num_topics=num_topics)

# creating text features using fitted model
df_polls_lda, nlp_feature_cols = df_polls_lda.pipe(
    st.transform, col=nlp_col, return_new_cols=True
)

# inspecting
display(df_polls_lda.head())
pc.pca_plot_lda_topics(df_polls_lda, st, source_col, nlp_feature_cols)

In [None]:
df_all_votes.head()

In [None]:
df_input = df_all_votes.join(
    df_polls_lda[["poll_id"] + nlp_feature_cols].set_index("poll_id"),
    on="poll_id",
)
df_input.head()

In [None]:
splits = vp.poll_splitter(df_input, valid_pct=0.2)
splits

In [None]:
to = TabularPandas(
    df_input,
    cat_names=[
        "politician name",
    ],  # 'poll_id'
    cont_names=nlp_feature_cols,  # using the new features
    y_names=[y_col],
    procs=[Categorify, Normalize],
    y_block=CategoryBlock,
    splits=splits,
)

dls = to.dataloaders(bs=512)

In [None]:
learn = tabular_learner(dls)
lrs = learn.lr_find()
lrs

In [None]:
learn.fit_one_cycle(
    5,
    #                     2e-2)
    lrs.valley,
)

In [None]:
vp.plot_predictions(learn, df_all_votes, df_mandates, df_polls, splits)

poll_id split:
- politician name + poll_id + 10 lda topics based on poll title do not improve the accuracy
- politician name + <s>poll_id</s> + 5 lda topics based on poll title: ~49%
- politician name + <s>poll_id</s> + 10 lda topics based on poll title: ~57%
- politician name + <s>poll_id</s> + 25 lda topics based on poll title: ~45%

## Modelling using `poll_description`-based features

### LDA topic weights as features

In [None]:
source_col = "poll_description"
nlp_col = f"{source_col}_nlp_processed"
num_topics = 25

st = pc.SpacyTransformer()

# load data and prepare text for modelling
df_polls_lda = df_polls.assign(
    **{nlp_col: lambda x: pc.clean_text(x, col=source_col, nlp=st.nlp)}
)

# modelling
st.fit(df_polls_lda[nlp_col].values, mode="lda", num_topics=num_topics)

# creating text features using fitted model
df_polls_lda, nlp_feature_cols = df_polls_lda.pipe(
    st.transform, col=nlp_col, return_new_cols=True
)

# inspecting
display(df_polls_lda.head())
pc.pca_plot_lda_topics(df_polls_lda, st, source_col, nlp_feature_cols)

In [None]:
df_input = df_all_votes.join(
    df_polls_lda[["poll_id"] + nlp_feature_cols].set_index("poll_id"),
    on="poll_id",
)
df_input.head()

In [None]:
splits = vp.poll_splitter(df_input, valid_pct=0.2)
splits

In [None]:
to = TabularPandas(
    df_input,
    cat_names=[
        "politician name",
    ],  # 'poll_id'
    cont_names=nlp_feature_cols,  # using the new features
    y_names=[y_col],
    procs=[Categorify, Normalize],
    y_block=CategoryBlock,
    splits=splits,
)

dls = to.dataloaders(bs=512)

In [None]:
learn = tabular_learner(dls)
lrs = learn.lr_find()
lrs

In [None]:
learn.fit_one_cycle(
    5,
    #                     2e-2)
    lrs.valley,
)

In [None]:
vp.plot_predictions(learn, df_all_votes, df_mandates, df_polls, splits)

poll_id split:
- politician name + <s>poll_id</s> + 5 lda topics based on poll description: ~51%
- politician name + <s>poll_id</s> + 10 lda topics based on poll description: ~53%
- politician name + <s>poll_id</s> + 20 lda topics based on poll description: ~56%
- politician name + <s>poll_id</s> + 25 lda topics based on poll description: ~59%