Scraper Spider
Based upon a set of custom keywords, and using:
- Python
- Newspaper3k python library (natural language processing)
- Jinja python templating engine
- Postgres database engine
- Psycopg2 Postgres database adapter
- NLTK (Natural Language Toolkit) Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
- Docker
- Redis in-memory data structure store
- RedisJSON
- Task task runner (a better GNU Make)
- Yake (Yet Another Keyword Extractor)
Web sites and API's are scraped for relevant content every few hours and then listed here. False positives are not displayed but are retained in the database for training with a
Naive Bayesian text classification model. Summaries are generated using Natural Language Processing. Redis/RedisJSON is used for a handy (and wicked fast!) caching mechanism so
I don't abuse the API's and websites that I utilize.
- hover in the Title column for a content summary, including content keywords at the bottom (useful for determining
why the content is there)
- click on the Google icon to perform a google search on the content keywords and find additional relevant info/do more research
- click on the Title link to read the content, and the Original Source link to visit the original source main top-level website
- additional relevant links, like videos or other relevant sites from comment/links, are sometimes listed under the Title