from eu_parliament import download
Currently this repo is using PyPDF2 to parse PDFs behind Result of roll call votes available. And the notebook already contains a dazzling Plotly visulisation!
Why?
After watching Nico Semsrott's video on Silent Minutes, Missing Votes, Shocking Colleagues, where he enthusiastically presented a typical call vote PDF, I got demotivated enough to try and ask my PC for help.
State of this
It somehow worked to parse the PDFs, although it was quite a pain. Some very basic insights can probably already be generated across PDF files, although the stability of the parsing was only tested for two PDFs before boredom won out. The parsing with PyPDF2 while convenient seems to have issues with special characters, dropping entire words (like names of MEPs). So some investigation of better practises would be next.
To dos
- Fix the incorrect parsing of MEP names / them being dropped completely.
- Plot who tends to vote with whom (does Martin Sonneborn still vote randomly?)
- Automate the download of PDFs from the website
Collecting the links
%%time
rcv_pdfs, vot_pdfs = download.identify_links_for_pdfs(download.URL)
Downloading the roll call pdfs
%%time
download.collect_multiple_files(rcv_pdfs)
Downloading the other pdfs
%%time
download.collect_multiple_files(vot_pdfs)