![]() ![]() That same challenge exists for Senator Feinstein's paper form, except with the additional and exponentially more challenging task of just extracting the data. It's important to note that even though Senator Rubio's electronic form is easy to read, programmatically, there's still the challenge of creating a data schema that you can import his financial data into. Here's what one of the scanned pages looks like: The OCR challenge OpenSecrets has a copy of the PDF that you can view without visiting the Senate site. A personal finance report submitted as paperĪnd here's what that same report looks like when it's submitted on paper, courtesy of Senator Dianne Feinstein: So let's dispel once and for all with this fiction that Senator Rubio doesn't know what he's doing. Rubio's financial report, which you can visit here without going through the Senate site.Īs you can see, the HTML is straightforward to parse as machine-readable data. Here's what an annual report on personal finances for 2014 looks like when it's electronically-submitted, courtesy of Senator Marco Rubio:įor your convenience, I've mirrored the HTML for Sen. An electronically-submitted personal finance report ![]() This will start a browser session that allows you to access the direct links. If you want to visit the direct links I provide, you'll need to visit the Senate site with your browser and manually agree to the site's terms of use. The Senate's financial disclosure database can be found here: What the submitted financial disclosure forms look like My initial takeaway: FineReader is remarkably good for this task in a later walkthrough I'll explain how to apply this in semi-automated fashion across all the forms (or any other set of scanned papers).įor the purposes of brevity, this writeup focuses on the Senate financial disclosures - the OCR challenge for both chambers of Congress is fundamentally the same. If all you care about is the actual personal finances of Congressmembers, OpenSecrets has you covered.Īlso, Robert Gebeloff of the New York Times has put together a list of the various other commercial products and their use-cases in this NICAR presentation (.docx) Just extracting text, even semi-accurately, from a single scanned form is a hard challenge on its own.įor a better overview of PDFs and structured data, including the different kind of PDFs, and the many challenges and approaches to extracting structured data from those different PDFs, check out Jacob Fenton's and Jeremy Singer-Vine's NICAR16 presentation on Parsing Prickly PDFs. Note that I'm not attempting to solve the problem of how to clean up the imperfect OCR results and insert them into a database, and how to automate it as a batch process. My writeup here is meant as a quick overview of the effectiveness of using ABBYY FineReader for Mac in producing usable, perhaps even delmited data from the scanned disclosure forms. The Senate's electronic filing system came into effect a couple years ago Senator Bernie Sanders is one example of a Senator who has moved from paper to the electronic filing system:Įxtracting data from scanned images is one of the most common and most difficult data wrangling tasks, such that OpenSecrets (aka The Center for Responsive Politics) pitched a civic hackathon challenge to build a solution for efficiently parsing Congressmembers' personal financial disclosures. However, despite the existence of electronic filing systems, some legislators still submit via paper, which is then scanned and uploaded as images or PDFs into an online database ( Senate / House). Members of Congress are required to submit regular reports detailing their personal wealth. Using ABBYY FineReader to extract tabular data from U.S. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |