** Do not send me a generic, automated response and I will automatically decline it. ** Best response is to send me a couple relevant projects where you have used scrapy.
Develop a python script using scrapy 2.1 to crawl and scrape https://www.gao.gov/reports-testimonies/month-in-review/
for fiscal years beginning in 2015 in the Annual Index to the right of the page.
Each year link will bring up an Annual Index of Reports page that has a long list of reports. Each report has several pdfs. Download each of those pdfs, associating them to a report. For example, the first report in FY2019 is https://www.gao.gov/products/GAO-19-539 It has 3 pdfs. The GAO-19-539 should be a unique key for the report. Click through that link and you see a Fast Facts tab and a Highlights tab, capture that text (don't need the images). Always capture the unique links associated with reports so we can easily get to the original website url needed. Some have podcasts which we don't need to download but capture the URL to the podcast, if one exists.
Many of the reports, like https://www.gao.gov/products/GAO-19-629, have a third tab - Recommendations. This is key information to capture. - the Recommendation, Agency Affected, Status, Comments.
Similarly, this same process for each of the months listed on the page. Be sure to note any duplicates (links in months that are duplicated in the years). The numbers should be your key like GAO-19-539.
For outputs, I'd expect a csv file that is the index. Something like sequence number, data time run, the fiscal year followed, the report number, report title, name of highlights pdf, name of print pdf, name of accessible pdf, text from each of the tabs. and an indicator that there are recommendations. Be sure that the columns for the pdf files are the same. If a report doesn't have a highlights pdf then that field would be blank. 1,5-7-2020 10:00, "Fiscal Year 2019", "GAO-19-629", "Animal Use......", "GAO-19-539_Highlights.pdf", "GAO-19-539_Print.pdf",, , Y
For reports that have a recommendation (indicated by a Y in the above index file), there is a second csv file of this nature: sequence number (key to the index file above), report number, recommendation number, priority flag, recommendation, agency affected, status, comments
Some recommendations are "priority recommendations". You can see an example here: https://www.gao.gov/products/GAO-18-12 The priority flag I mention is set to Y if it's a priority recommendation. Using this example, I'd see something like this in the file 4 (link to index file), GAO-18-12, 1, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..." 4 (link to index file), GAO-18-12, 2, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..." 4 (link to index file), GAO-18-12, 3, N, "The Assistant Sec...", "Department of Lab....", Open, "In June 2019..." 4 (link to index file), GAO-18-12, 4, Y, "The Assistant Sec...", "Department of Lab....", Open, "In Febru..."
The pdf files should be stored in a single subdirectory like "downloads". There would be many pdfs in that one directory.
The above is the main unit of work. Then, I need the script to be able to run again to detect any changes and download them. For example, a new month or year is added. Be sure not to duplicate data/downloads. Basically, run it again to get anything new that was added without re-downloading everything.
I'm going to leave the budget open. But I do not expect this to be very expensive. I would like to see initial results in two days, a test run with just a couple of records, for me to evaluate and comment on. I'll likely award within 2-3 days.