With my personal search engine Cloudgrove, I’ve come across an interesting problem with FeedBurner. There are quite a few major sites that use FeedBurner to manage their RSS feeds, but every so often they will change the URL for the posts in the feed, usually by adding a “.” to the end. The problem with this is that I’m currently keying off of the url for the posts as a unique identifier, so by having FeedBurner changing the url I’m getting duplicates.
This is an easy enough problem to solve in the short run, but it concerns me a bit. What makes a post unique? Many sites will append params to the url to display the way in which a user came to the site, such as ?source=rss. However, almost all sites use params in some form or another in the url. So, removing params is immediately out for determining unique posts. I’ve seen in the Google documentation for some of their products that they remove session id params and other such transient pieces. I’m curious how they go about this. Are they just removing things with common names for sessions and cache busting?
I realize that Google is going much more indepth in comparing the similarity of pages. They have been fighting a long and hard battle against link farms and rooting out identical pages is a key part of this. My intentions with Cloudgrove though are to focus more on quality content sources and bypass the spam issues.
After doing some digging and going back through the logs it appears that the issue with the FeedBurner feeds was limited to Tue, 11 Jul 2006. However it appears that since then, they’ve changed the format for the urls in their feeds. They made an attempt to make a clean cut between old format and new format, but there was some leakage that also resulted in duplicates.