Tool would have a basic UI for a user to input the following:
-The pages to scrape. These could be from one of multiple sources:
---This could be a manually entered list (such as copy and pasted into a form).
-----A list of URLs to be retrieved from a source URL. For example, user would enter URL X, and the tool would download all links from page X and use that list of links as the pages to scrape for content.
-------To prevent scraping every URL, users would be able to provide regex rules for including/excluding URLs. For example, a rule may include only URLs that contain the /news/ directory.
---The content to scrape / collect by including CSS selectors or XPATH rule.
-Scraper would collect and store the data in a Google Sheet (preferably) or database (if necessary).
-Basic cleanup additions available to clean the data. One, extract URLs from a block of text. Two, remove all HTML and special characters from text.
-A script would then run to create an RSS feed based off of the data. The feed would feature the most recent 10 items and be available at a public URL.
-Recommend the best affordable option for a cloud service to host the script such as Amazon or Google and we will sign up for it and add you as a user. Set up service as needed and set the scripts to run automatically.
-We would have full access to the code and maintain the tool ourselves at the end of the project, although availability for potential future jobs to update the tool would be beneficial.