1001 Freelance Projects -- Scrapy webscrape spider

Latest Projects from
Freelance Marketplaces

View this project in detail (Note: you will be redirected to external marketplace)
Project title:
Scrapy webscrape spider
Posted by:
External project from PeoplePerHour
Started:
08-May-2020 15:39 GMT
Description:
Do not send me a generic, automated response and I will automatically decline it. Best response is to send me a couple relevant projects where you have used scrapy. Develop a python script using scrapy 2.1 to crawl and scrape https://www.gao.gov/reports-testimonies/month-in-review/ for fiscal years beginning in 2015 in the Annual Index to the right of the page. Each year link will bring up an Annual Index of Reports page that has a long list of reports. Each report has several pdfs. Download each of those pdfs, associating them to a report. For example, the first report in FY2019 is https://www.gao.gov/products/GAO-19-539 It has 3 pdfs. The GAO-19-539 should be a unique key for the report. Click through that link and you see a Fast Facts tab and a Highlights tab, capture that text (don't need the images). Always capture the unique links associated with reports so we can easily get to the original website url needed. Some have podcasts which we don't need to download but capture the URL to the podcast, if one exists. Many of the reports, like https://www.gao.gov/products/GAO-19-629, have a third tab - Recommendations. This is key information to capture. - the Recommendation, Agency Affected, Status, Comments. Similarly, this same process for each of the months listed on the page. Be sure to note any duplicates (links in months that are duplicated in the years). The numbers should be your key like GAO-19-539. For outputs, I'd expect a csv file that is the index. Something like sequence number, data time run, the fiscal year followed, the report number, report title, name of highlights pdf, name of print pdf, name of accessible pdf, text from each of the tabs. and an indicator that there are recommendations. Be sure that the columns for the pdf files are the same. If a report doesn't have a highlights pdf then that field would be blank. 1,5-7-2020 10:00, "Fiscal Year 2019", "GAO-19-629", "Animal Use......", "GAO-19-539_Highlights.pdf", "GAO-19-539_Print.pdf",, , Y For reports that have a recommendation (indicated by a Y in the above index file), there is a second csv file of this nature: sequence number (key to the index file above), report number, recommendation number, priority flag, recommendation, agency affected, status, comments Some recommendations are "priority recommendations". You can see an example here: https://www.gao.gov/products/GAO-18-12 The priority flag I mention is set to Y if it's a priority recommendation. Using this example, I'd see something like this in the file 4 (link to index file), GAO-18-12, 1, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..." 4 (link to index file), GAO-18-12, 2, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..." 4 (link to index file), GAO-18-12, 3, N, "The Assistant Sec...", "Department of Lab....", Open, "In June 2019..." 4 (link to index file), GAO-18-12, 4, Y, "The Assistant Sec...", "Department of Lab....", Open, "In Febru..." The pdf files should be stored in a single subdirectory like "downloads". There would be many pdfs in that one directory. The above is the main unit of work. Then, I need the script to be able to run again to detect any changes and download them. For example, a new month or year is added. Be sure not to duplicate data/downloads. Basically, run it again to get anything new that was added without re-downloading everything. I'm going to leave the budget open. But I do not expect this to be very expensive. I would like to see initial results in two days, a test run with just a couple of records, for me to evaluate and comment on. I'll likely award within 2-3 days.
Project ID:
2981619
Project category:

Project budget:

View this project in detail (Note: you will be redirected to external marketplace)

Project	Started
AGIT SUDIANTO Category: Public Relations, System Admin Budget: $10 - $30 USD	01 Apr 2023 16:04 GMT
Brand Manager for Medical Content Writing Category: Artificial Intelligence, Content Writing, Medical Writing, Technical Writing Budget: $250 - $750 USD	01 Apr 2023 16:04 GMT
Logo for after school program Category: Graphic Design, Illustration, Logo Design, Photoshop, T Shirts Budget: $10 - $30 USD	01 Apr 2023 16:04 GMT
I need a soldering specialist -- 2 Category: Electrical Engineering, Electronics, Engineering, Soldering Budget: £250 - £750 GBP	01 Apr 2023 16:04 GMT
Simple copy typing work Category: Copy Typing, Data Entry, Data Processing, Excel, Word Budget: ₹1500 - ₹12500 INR	01 Apr 2023 16:03 GMT
Japanese to Chinese translation (hentai game translation/long term) Category: Japanese Translator, Simplified Chinese Translator, Translation Budget: $750 - $1500 USD	01 Apr 2023 16:03 GMT
Logo Brand Category: 3D Design, Graphic Design, Illustration, Logo Design, Photoshop Budget: $15 - $25 USD	01 Apr 2023 16:02 GMT
Bakerved Ayurvedic Cookies Category: HTML, Web Design Budget: ₹12500 - ₹37500 INR	01 Apr 2023 16:01 GMT
Sewing pattern designer Category: Fashion Design, Pattern Making Budget: $10 - $30 USD	01 Apr 2023 16:00 GMT
graphic designer needed Category: Brochure Design, Corporate Identity, Covers & Packaging, Graphic Design, Logo Design Budget: $15 - $25 USD	01 Apr 2023 16:00 GMT
Add 2 Images to Shopify Page Category: CSS, HTML, Shopify, Shopify Templates, Web Design Budget: $10 - $30 USD	01 Apr 2023 15:59 GMT
Excellent fiction story needed. Category: Creative Writing, Fiction, Ghostwriting, Romance Writing Budget: $150 - $400 USD	01 Apr 2023 15:56 GMT
Small Azure project for 12$ only. Category: Azure, Microsoft Azure Budget: $2 - $8 CAD	01 Apr 2023 15:54 GMT
BUSCO DESARROLLADOR WEB CON EXPERIENCIA COMPROBABLE EN DISEÑO FRONT-END Y BACK-END Category: SEO, Shopify, Web Hosting, Web Design, WordPress Budget: $2 - $8 AUD	01 Apr 2023 15:54 GMT
Looking a flutter developer to customer my application source code Category: Flutter, IPhone, JavaScript, Mobile App Development Budget: ₹1500 - ₹12500 INR	01 Apr 2023 15:54 GMT

Browse All Projects

New!
Проекты на русском (Projects in Russian)