1001 Freelance Projects
Latest Projects from
Freelance Marketplaces
View Project
View this project in detail
(Note: you will be redirected to external marketplace)
Project title:
Scrapy webscrape spider
Posted by:
External project from PeoplePerHour
Started:
08-May-2020 15:39 GMT
Description:
** Do not send me a generic, automated response and I will automatically decline it. ** Best response is to send me a couple relevant projects where you have used scrapy.

Develop a python script using scrapy 2.1 to crawl and scrape https://www.gao.gov/reports-testimonies/month-in-review/

for fiscal years beginning in 2015 in the Annual Index to the right of the page.

Each year link will bring up an Annual Index of Reports page that has a long list of reports. Each report has several pdfs. Download each of those pdfs, associating them to a report. For example, the first report in FY2019 is https://www.gao.gov/products/GAO-19-539
It has 3 pdfs. The GAO-19-539 should be a unique key for the report.
Click through that link and you see a Fast Facts tab and a Highlights tab, capture that text (don't need the images). Always capture the unique links associated with reports so we can easily get to the original website url needed. Some have podcasts which we don't need to download but capture the URL to the podcast, if one exists.

Many of the reports, like https://www.gao.gov/products/GAO-19-629, have a third tab - Recommendations. This is key information to capture. - the Recommendation, Agency Affected, Status, Comments.

Similarly, this same process for each of the months listed on the page. Be sure to note any duplicates (links in months that are duplicated in the years). The numbers should be your key like GAO-19-539.

For outputs, I'd expect a csv file that is the index. Something like
sequence number, data time run, the fiscal year followed, the report number, report title, name of highlights pdf, name of print pdf, name of accessible pdf, text from each of the tabs. and an indicator that there are recommendations. Be sure that the columns for the pdf files are the same. If a report doesn't have a highlights pdf then that field would be blank.
1,5-7-2020 10:00, "Fiscal Year 2019", "GAO-19-629", "Animal Use......", "GAO-19-539_Highlights.pdf", "GAO-19-539_Print.pdf",, , Y

For reports that have a recommendation (indicated by a Y in the above index file), there is a second csv file of this nature:
sequence number (key to the index file above), report number, recommendation number, priority flag, recommendation, agency affected, status, comments

Some recommendations are "priority recommendations". You can see an example here: https://www.gao.gov/products/GAO-18-12
The priority flag I mention is set to Y if it's a priority recommendation. Using this example, I'd see something like this in the file
4 (link to index file), GAO-18-12, 1, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..."
4 (link to index file), GAO-18-12, 2, N, "The Assistant Sec...", "Department of Lab....", Open, "OSHA stated that..."
4 (link to index file), GAO-18-12, 3, N, "The Assistant Sec...", "Department of Lab....", Open, "In June 2019..."
4 (link to index file), GAO-18-12, 4, Y, "The Assistant Sec...", "Department of Lab....", Open, "In Febru..."

The pdf files should be stored in a single subdirectory like "downloads". There would be many pdfs in that one directory.

The above is the main unit of work. Then, I need the script to be able to run again to detect any changes and download them. For example, a new month or year is added. Be sure not to duplicate data/downloads. Basically, run it again to get anything new that was added without re-downloading everything.

I'm going to leave the budget open. But I do not expect this to be very expensive. I would like to see initial results in two days, a test run with just a couple of records, for me to evaluate and comment on. I'll likely award within 2-3 days.
Project ID:
2981619
Project category:
Project budget:
View this project in detail
(Note: you will be redirected to external marketplace)
Last Projects / Browse Projects
  Project Started
AGIT SUDIANTO
Category: Public Relations, System Admin
Budget: $10 - $30 USD
01 Apr 2023 16:04 GMT
Brand Manager for Medical Content Writing
Category: Artificial Intelligence, Content Writing, Medical Writing, Technical Writing
Budget: $250 - $750 USD
01 Apr 2023 16:04 GMT
Logo for after school program
Category: Graphic Design, Illustration, Logo Design, Photoshop, T Shirts
Budget: $10 - $30 USD
01 Apr 2023 16:04 GMT
I need a soldering specialist -- 2
Category: Electrical Engineering, Electronics, Engineering, Soldering
Budget: £250 - £750 GBP
01 Apr 2023 16:04 GMT
Simple copy typing work
Category: Copy Typing, Data Entry, Data Processing, Excel, Word
Budget: ₹1500 - ₹12500 INR
01 Apr 2023 16:03 GMT
Japanese to Chinese translation (hentai game translation/long term)
Category: Japanese Translator, Simplified Chinese Translator, Translation
Budget: $750 - $1500 USD
01 Apr 2023 16:03 GMT
Logo Brand
Category: 3D Design, Graphic Design, Illustration, Logo Design, Photoshop
Budget: $15 - $25 USD
01 Apr 2023 16:02 GMT
Bakerved Ayurvedic Cookies
Category: HTML, Web Design
Budget: ₹12500 - ₹37500 INR
01 Apr 2023 16:01 GMT
Sewing pattern designer
Category: Fashion Design, Pattern Making
Budget: $10 - $30 USD
01 Apr 2023 16:00 GMT
graphic designer needed
Category: Brochure Design, Corporate Identity, Covers & Packaging, Graphic Design, Logo Design
Budget: $15 - $25 USD
01 Apr 2023 16:00 GMT
Add 2 Images to Shopify Page
Category: CSS, HTML, Shopify, Shopify Templates, Web Design
Budget: $10 - $30 USD
01 Apr 2023 15:59 GMT
Excellent fiction story needed.
Category: Creative Writing, Fiction, Ghostwriting, Romance Writing
Budget: $150 - $400 USD
01 Apr 2023 15:56 GMT
Small Azure project for 12$ only.
Category: Azure, Microsoft Azure
Budget: $2 - $8 CAD
01 Apr 2023 15:54 GMT
BUSCO DESARROLLADOR WEB CON EXPERIENCIA COMPROBABLE EN DISEÑO FRONT-END Y BACK-END
Category: SEO, Shopify, Web Hosting, Web Design, WordPress
Budget: $2 - $8 AUD
01 Apr 2023 15:54 GMT
Looking a flutter developer to customer my application source code
Category: Flutter, IPhone, JavaScript, Mobile App Development
Budget: ₹1500 - ₹12500 INR
01 Apr 2023 15:54 GMT
Browse All Projects
Projects by Skills ...
android
ajax
asp
aspnet
cms
cpp
csharp
css
delphi
design
drupal
excel
facebook
flash
html
java
javascript
joomla
iphone
mysql
photoshop
php
python
ruby
seo
sql
sysadm
translate
typing
twitter
vbnet
xml
wordpress
writing
New!
Проекты на русском
(Projects in Russian)

Copyright © 2005-2022
1001 Freelance Projects