Monday, March 14, 2005

Ruby script to pre-process javablogs daily feed


If you read Javablogs, you may be aware that there are many blogs written in languages you don't understand. I have long been searching for a way to filter those out.

Since Javablog puts the site name in square brace at the end of the feed title, it is conceptually easy to filter out feed by pattern matching its title. The problem is that few RSS readers support such operation. NetNewsWire 2.0, the reader I use, supports smartlist. Like smart playlists in iTunes, smart lists display news items based on rules that you set up. However, smartlist only covers half of what I want, the pattern matching part. Since I can not specify what action to take on matched item, I have to perform filter out manually -- by marking all (of my "Junk" smartlist) as read. To add salt to injury, the smartlist doesn't always work since the NNW2 is still in beta. I was troubled for the past few months.

Then it suddenly draw on me last night -- why not write a script to pre-process the feeds? Considering that I have been using script to grab Javablogs all along, it is surprising I did not think of this early.

While writing the script tonight, more ideas came along. The resulting script does following:

  • ignore news items of which the title contains any of ignoring texts you specify. Ignored item is removed from RSS content.
  • highlight news items. Append "(*)" to the item title so that you can easily set up rules for those items in your RSS reader.
  • alert. Very often, people republish their Blog and results in the entire site archive shows up in the feed. I consider it a anormaly if the number of entries for a site exceeds 10.
  • process log is generated as a news item.

I am happy now.

UPDATE: append "(-)" to the alerted item title. handle empty line in the configuration file.

Technorati Tags: ,

No comments: