Blog

2022-07-31 SSL and VPS

Relocated to a shiny new VPS, and got my SSL/HTTPS ducks in a row. I was on an S3-based static site, and SSL gets tricky there. I'm in no mood to wrestle with CloudFront.

2022-06-02 Newsy API Integration

Big kudos to Eric at Newsy, who helped me out with a new API integration. The most responsive customer service I have ever experienced. If you have a domain you aren't using, go have a look.

2022-05-03 CSS Layouts

Ok, I got bored with the tired old typical tables/rows/columns layout. With some ideas from other sites, I reworked the CSS cosmetics/layout. Plus, some of the article titles and associated URL's were long enough to really mess things up. The changes should hopefully make things easier to browse/scroll. I think I'm really just avoiding writing the Bayes "bag of words" algorithm that I really need. Too many false positives. I'm sure the Ulladulla Docker Junior AFL Club on the east coast of Australia is a fine sports league, but they really shouldn't be on my IT-based web page.

2022-01-28 The Follies of Simple Text Matching

This scraper portal, at its heart, is driven by a set of 40 or 50 keywords that I have chosen. These keywords are (surprise) usually technical and associated with Information Technology concepts, particular software or hardware, IT companies, and so forth. Using a simple text matching search results in some interesting false positives. After noticing some consistent patterns, I even use an "anti-keyword" list. If I find a page with a title or keywords that match my keyword, I also look for anti-keywords and identify the false positive that way. However, this just doesn't always work well enough.

Some examples:
One of my keywords is Raspberry PI (I built a PI Docker Swarm cluster, which part of this project is based on). I've seen these false positives: You'd be impressed by the number of recipes on the interwebs that mention raspberries.

Another keyword is AWS, which besides the obvious can also be the Animal Welfare Society of South Africa. There is also a company AWS Ocean Energy, specializing in marine energy systems.

Another is Docker. There is a popular Australian rules Football club called the Fremantle Dockers in Fremantle, Australia. I think this is rugby.

Another example is S3, the AWS Simple Storage Service. There are many online television series reviews and fan/info pages that mention S3 (season 3).

And yet another is Ubuntu (the operating System), which also apparently roughly means "humanity" in the Zulu language.

And a final example is Redshift, which besides being a cool AWS (Postgres!) database for analytics/OLAP is a term also used in astronomy and physics (which is where the AWS product name came from).

Apparently one way to solve this is the use of a Naive Bayesian "bag of words" algorithm, dealing with things like stop words and lemmatization. I have the algorithmic beginnings of this, and collect false positives in the hope that I can train the algorithm. But I'm still trying to grok the concepts. Any pointers or expertise are greatly appreciated.