A Visual Cloud based Web Crawler built using Django 1.7, MongoDB 2.6.5 and Scrapy 0.24.4
- Focra Demo (http://focra-mingsheng36.rhcloud.com/)
##Features
- Visually create your own XPath template
- Toggle CSS and JavaScript on and off
- Pagination Crawl
- Chain Crawler (created from sublinks from initial crawler)
- Pause/Resume Crawl
- Show Hierarchy of Crawlers (how they are chained)
- View Data in Pages
- User Accounts
- Export Data to Excel / CSV / JSON URL
- Improve on Algorithms (Aggregation + Alignment)
- Schedule Crawl Frequency
- Crawl JavaScript Pages (Get from XHR request)
- Appending of Data (Latest data appending)
- Modify Field Names, Column Position and Template of the Crawler
- Change Database architecture (Not scalable as it uses one collection per crawler)
- Monitor Performance of Crawler
- Django Push Events (currently using poll)
###Things to Note
- Internet Explorer not supported.
- There is a download delay of 2 seconds to avoid affecting other servers.