How to Make a Top Blog List

14 December 2008

Posted in Uncategorized tags top blogs

My Top 100 Blogs for Developers is, without a doubt, the single most popular post I ever created. On this page I want to share with you the (somewhat complicated) process I use in making such a list. I am sure many people in the world would be interested in other lists, like a Top 100 Blogs for Secretaries, a Top 100 Blogs for Managers, a Top 50 Blogs for Dog Lovers, or a Top 200 Blogs for Tree Huggers. However, I will not be the one to make those lists. But if someone else wants to, then I invite them to follow the steps described below.

Warning: if you want to make a Top Blog list in the way I describe here, you must have some skills in working with spreadsheet formulas! And some stamina is useful as well, because it’s a lot of work…

Research

Making a Top X Blogs begins with finding the right candidate blogs. Here are a couple of suggestions…

You can start by posting a call for votes, and ask your readers to submit blogs (possibly their own). But that probably won’t be enough, as most readers prefer to let you do all the work yourself…
That means that you will also have to do some searching to find the URLs of popular blogs. You can use Digg, Delicious, Technorati and other social networks to find the most prominent blogs in your area of interest. Maybe there are some other lists already out there, so make sure that you also use Google to search for top blogs. (Note: I suggest that you use a spreadsheet to keep track of the blogs and the calculations that follow in the next section.)
You can save yourself a lot of time when you restrict your top list to blogs with a certain GooglePage rank or higher. For my own list I only consider blogs with a PageRank of 4 or higher. (It turns out that blogs with a lower page rank don’t make it into the final list anyway, even when I allow them the chance.)
Once you have a good number of blogs, some submitted by readers and most discovered through your own strenuous research, you might want to check each candidate’s references to other blogs. Blog authors usually read blogs by other authors in the same area of interest. So if you check their blog rolls, the Twitterers they follow, and/or the links they have in their blog carnivals, you will usually discover some new candidates that you missed with steps 1) and 2). (Note: this can be a lot of work, depending on the number of candidates you already had. So you should restrict yourself to a good sample of them.)

For my Top 100 Blogs for Developers I ended up with between 200 and 300 blogs. Important: doing statistics and calculations with that many blogs is a lot of work, so make sure you limit the size of your top list to something you can manage time-wise!

Statistics

When you’ve found enough candidates for your top list, it is time to start collecting their statistics. Most of the statistics mentioned in this section are controversial, for some reason or another. But it’s the only thing we have. So the best thing we can do is not to rely on each of them too heavily. That’s why I use multiple statistics. When you take the average of multiple statistics, the deficiencies tend to cancel each other out. And no blog is punished too heavily for doing badly in one specific category.

The first statistical category you need is Google’s PageRank. This number (from 0 to 10, where 10 is the best) is an indication of the relative importance of a site or page, according to Google. You can easily find this number by installing the Google Toolbar in your browser. (Note: these PageRank values are republished about once every three months.)
The second statistical category is formed by the traffic rankings according to Alexa. You can find a blog’s rank by typing its URL on Alexa.com. (Important: this is the only statistic where a low number means a good score.) Alexa has traffic rankings for many blogs and sites, but there’s one catch…

For some platforms Alexa only tracks the traffic of the entire platform, and not for the individual blogs hosted on that platform. For example: for each blog on Blogger.com and TypePad.com, Alexa maintains a separate ranking. If you check the ranking for Evolving Web (http://ourfounder.typepad.com/) you see that Alexa tracks it’s ranking separately. But for MSDN.com, Alexa has just one ranking: that of the MSDN site as a whole. If you check the ranking for J.D. Meiers blog (http://blogs.msdn.com/jmeier/) you will see it has an extremely high ranking. But it would be unfair to use that ranking for each individual blog hosted on MSDN.

Likewise, there may be a corporate site that draws a lot of traffic, with a minor blog hosted on the parent site that forms only a small part of the corporate site. In such a case Alexa would give you the ranking of the parent (corporate) site, and it would not be fair to use that number for the blog. If you check the ranking for ThoughtBlogs (http://blogs.thoughtworks.com/) you see the Alexa ranking is for the corporate site ThoughtWorks.com, while the blogs are only a smaller part of that site.

One last issue is that Alexa does not always have traffic ranks available. In some cases they simply have no data. All things considered, it means that the Alexa rank for some blogs must be set to unknown. This is important to take into account when doing your calculations (see next section). In my experience, about 9% of the blogs I checked out have no rating on Alexa.
The third statistical category you will want to check out is the Technorati Authority for each blog. This number reflects the number of other blogs linking to the blog you’re investigating. It differs from Google’s PageRank because a) only referring blogs are considered; b) each other blog is counted only once even if they link a 1000 times; and c) they don’t distinguish between minor and major blogs among the referrers. You can find the Technorati Authority for a blog by typing its url in the Technorati search box.

Similar to Alexa’s ranking, the Technorati Authority numbers are not always available. Blog authors have to submit their blogs to Technorati, or else Technorati will not maintain their authority numbers. This means that the Technorati Authority for some blogs must be set to unknown. In my experience, about 12% of the blogs I checked out have no rating on Technorati.
The fourth statistical category you can consider are the number of links found with search engines. Google makes this easy by allowing you to type link:<url> in Google’s search box. This will give you a figure that indicates the total number of pages linking to a blog. Of course, this is just a simple alternative for Google’s own (much more advanced) PageRank algorithm. However, PageRank has only 11 possible values (0..10), therefore the number of hyperlinks in search results helps to better differentiate between blogs that have the same PageRank value.
The fifth statistical category that I find very useful is the number of comments on a blog. It is a measure of interactivity, showing us how many people actually spend time being involved in discussions. You may want to calculate this statistic by adding the total number of comments of the last five or ten posts per blog, depending on how much time you have on your hands. You may also want to skip the last post on each blog, as it would be a bit unfair when the last post was posted only a day ago and therefore did not have much chance to collect its comments.
Note that some blog authors don’t allow comments on their blogs. Like before, in those cases the resulting statistics will be unknown. In my experience, only about 4% of the blogs I checked out have no comments.
The sixth and final statistical category that I use is RSSMicro’s FeedRank, as a measure of the number of RSS feed subscribers. This statistic has been introduced only recently and still has to prove its worth. However, I know of no other (somewhat reliable) independent means of measuring the number of feed subscribers.

When you check FeedRank it is important to consider this: many blogs offer their feeds in multiple formats (Atom and RSS). You should check the URL of each feed format because sometimes the different formats turn out to have different FeedRank values. (In those cases I simply take the highest number.)

Like before, some blogs have no FeedRank, and the resulting statistics will be unknown. In my experience, this applies to about 9% of the blogs I checked out.
The sixth and final statistical category that I use is Twitter Grader, as a measure of a blogger’s success in micro-blogging. I only use the rank on Twitter Grader if there’s a one-to-one relationship between the Twitter account and the blog, and they refer to each other.

One last comment about these finding statistics: for each category you should aim to collect all data on the same day. All statistics are regularly updated, and you don’t want that to happen right in the middle of your statistical analysis!

Calculations

When you’ve collected the statistics for each blog in your spreadsheet (each blog on a new row, and each statistical category in another column) it is time to do the calculations. I will show you my methods, and my motivations behind it:

First of all, for each statistical category, I create a new column in the spreadsheet that calculates the rank of each blog in that category. I do this to make sure that a) there’s a #1 blog in each statistical category, and b) all other blogs are numbered from #1 to #X, where X is the total number of blogs for which I have a statistic available. In Excel I use the following format for the formulas (using Technorati Authority here as an example):
=IF(_authority_<>””; RANK(_authority_;_columnofallauthorities_; 0); “”)

For each blog, this formula calculates if the blog is the #1 blog in this statistical category, or the #2, or #3, etc. (Blogs will end up with the same rank, if they have the same authority number, but that’s ok.) A similar formula must be constructed for PageRank, Alexa Rank, Google hits, Comments and FeedRank. Basically, it means that we’re doing away with the different scales and types of the six statistical categories. We simply end up with six columns of rankings, where the best blog scores #1, and the others follow behind it.

Important: the Alexa statistic is the only where a low number means a good score! In that case you must change the formula so that the lowest, and not the highest, number is ranked as #1:

=IF(_alexarank_<>””; RANK(_alexarank_;_rangeofallalexaranks_; 1); “”)

One last thing to point out is that we must deal with statistics that are unknown (or empty). In this formula unknown values will simply lead to unknown rankings (empty cells).
The second step I take is that each blog will get a number of points, that depends on its rank in step 1. For example: suppose I have 115 statistics for Technorati Authority in my list of blogs, out of 125 blogs (for 10 blogs this statistic is not available). Then the #1 blog in the Technorati Authority category will earn 115 points. The #2 will earn 114 points, and so forth. The last one earns 1 point. And the 10 blogs without a statistic get an empty result. You can achieve that with a formula that follows this format:
=IF(_authorityrank_<>””; COUNT(_rangeofallauthorityranks_)+1 – _authorityrank_; “”)

This formula takes the results of step 1 as its input (authorityrank). Again, it knows when a statistic was unknown, and it gives an empty result in those cases.
In each statistical category there are blogs for which we don’t have a measurement. That means that the points in step 2 end up having different scales. If 115 blogs had a Technorati Authority then the best one will have earned 115 points. But when only 87 blogs had an Alexa Rank, then the best one will have earned 87 points in that category. This means we need to normalize the results, to make them better comparable. You can do that using a formula like this:
=IF(_authoritypoints_<>””; 100/COUNT(_rangeofallauthoritypoints_) * _authoritypoints_; “”)

This formula takes the number of points from step 2 as its input (autoritypoints). It then makes sure that all points are re-scaled to a new scale of 0 to 100. This enables us to prepare the last ratings in the final steps.
Before you continue with the last step, you might want to allow the different statistical categories to have different weights. Personally I think that Google PageRank, Alexa Rank, Technorati Authority and Google hits are the most important ones. The others are interesting, but either less reliable (FeedRank) or a little too volatile (Comments). Therefore, for my own top 100 list, I decided to double the weights of the first four statistical categories, which resulted in the following:
20% Googe PageRank
20% Technorati Authority
20% Alexa Rank
20% Google hits
10% RSSMicro FeedRank
10% Comments

It means that, in my final calculations, the measure of RSS feed subscribers accounts for 10% of the final ranking, while Technorati’s authority numbers influence 20% of my results.
Now that you’ve come to the end of the calculations, I hope you understand that it’s important to calculate the average score over all statistical categories, for each blog in your list. You should not calculate the sum, as this would punish blogs that don’t have comments, no Technorati Authority, or no Alexa Ranking. I’m sure that’s not what you want. When you take the average score you allow blogs to have missing ratings, and you simply determine how well they’ve scored in the categories that they do participate in.

After you’ve calculated the average scores in step 5, it will be easy for you to order the results according to those scores, and you will have made yourself your own Top Blogs list!

If you have created a list that you want to share with others, or when you think the research/calculation process can be further improved, please feel free to share this with us in the comments section.