It is a growing conception of the SEO world that getting links from highly relevant pages is no longer just valuable, but necessary in order to rank. I am by no means the only SEO to doubt the veracity of these claims (here is Michael Martinez and Julie Joyce on the issue in 2007), but despite their reasoned arguments, the myth continues to persist.
So, I thought it was time to put some data to the test using the same Wikipedia Link Modeling that we had used in past to test theories on Link Depth, Link Proximity and other link diagnostics.
Last year, after SEOMoz’s ground breaking work on the relationship between Latent Dirichlet Allocation and Google rankings, we brought on Andrew Cron, a Ph.D. statistics candidate at Duke University to build our own in-house LDA model. While we now use this in nearly every content writing endeavor, it has also been useful to test out theories about content relevance.
The Plan
The strategy is actually quite simple.
- Get the backlinks of around 50 unique Wikipedia articles and then determine the LDA score of the title of those Wikipedia pages to the content of the backlinking pages.
- Compare a single piece of content to 1000 randomly selected words to determine the random distribution of english language topical relationships
- Observe if Wikipedia backlinking pages generally out-perform random content in terms of relevancy.
The Results
The results were actually quite unimpressive. From what we can see, the overwhelming majority of pages that link to Wikipedia articles share no discernible topical relevance above that of random content to the article they cite.
The graph above shows the distribution of LDA scores from random content vs. Wikipedia backlinks. As you can see, there is a great drop off in the random scores above an LDA score of 80. A better representation is that of the differentials below.
About 15% of links are relevant enough that they cannot be described by randomness. This seems to stand in stark contrast to any expectation that all links one acquires should be topically relevant to the subject matter of the page. In fact, we actually see a cluster of pages that are distinctively different from the page to which it links. While some of this can be described by very thin content (a surefire way to get a low LDA score is to have almost no content on the page), we find another phenomenon occurs.
Why People Link
We actually find that this reinforces two reasons why people link out on the web.
- Citation Links: These are links where the webmaster is citing content they have included on their page. You would expect high LDA scores because the writer is merely giving credit to the original source of that content (quoted or paraphrased).
- Descriptive Links: These are links where the webmaster is choosing to link to content rather than write about it. Because the link is offered in lieu of writing out the content, you can expect lower than average LDA scores. The link is there explicitly so the related content does not have to be. It is an alternative to relevant content.
Take Aways
Does this mean you should avoid getting links from related sites? Absolutely not. However, it does mean that you should not give up a link solely because the content is not textually similar to the content on your page. If the link is good for the user, it is good for Google.
I’ve always gone with the mantra “there’s no such thing as a bad incoming link”
Just certain things give it extra weight, related site, good anchor text etc…but would i turn down a free incoming link based purely on the fact that it was unrelated and just my URL…..of course I wouldn’t
I have to agree with Jim. Is this really a misconception in the industry. I’ve always operated under the assumption that any link is a good link and will at worst pass no value. There are certainly varying degrees of “goodness” that should inform your link building strategy as far as priorities but a less than relevant link, is still better than no link at all.
It is more a misconception of SEO customers than SEO practitioners.
Intuitively, this seems wrong. As you point out, people will link to Wikipedia typically for citation or to provide extra information without putting it inline (probably the main reasons for linking anywhere). In both of these cases there is relevance to the content of the page containing the link, for exactly those reasons.
However there may not be much discernible relevance using text comparison algorithms. I believe that’s the case here.
To what extent search engines rely on surface relevance (text matches) and/or something deeper (the relevance that we as humans would recognise) is another question. (It’s also worth considering there may be significant text matching a link or two back, a search engine may determine relevance through longer chains).
Also I’m not altogether convinced that because the LDA model correlated well with Google on other tests that it necessarily correlates well with this test.
Taken Google out of the picture, if a Web publisher wishes to make their material maximally available to people that are interested in the topic of the material, links from relevant pages will be more valuable than random links, because people follow their noses from page to page. For the same reason, this also serves the reader better. I would hope a good search engine would take this into consideration.
Hi Danny, thanks for your comments. I have a couple of thoughts but, I admit, much of this is conjecture.
First, I don’t agree that there is necessarily relevance (or statistically discernible relevance) in the second link type. For example, if someone were to write a long page about Michael Jordan, they would probably spend most of their time writing about his basketball career and only in passing mention that he was an entrepreneur and owns several businesses. The author then may choose to link the word entrepreneur to the wikipedia page for individuals to learn what is meant by the term. One could actually argue that the irrelevance of the page to the word actually increases the need to link to a resource to describe it. I can’t be certain that is the phenomena described here, but I think it is worth considering.
There are many statistical methods for determining relevancy, but we use LDA here in particular because we can use Google’s very own pLDA code to do the computation. What really matters here is that using a common, well understood topic relevancy model that Google likely uses, only approximately 15% of Wikipedia linking pages were more relevant to the page that could be described by random English language content. It is important that by random English language content, I do not mean just randomly throwing together a bag of words, but rather randomly selecting a word and then building a single article based on the pages ranking for that term in Google.
Finally, I am not certain that the textual relevancy of the linking page should be considered as an important ranking factor. The narrowness the relevancy of all inbound links might also indicate the narrowness of that content to the potential reader. If my article on entrepreneurship only receives links from other articles on entrepreneurship, is it truly relevant to the general audience of those searching for entrepreneurship? The diversity of link sources might be a stronger indicator of content balanced to the needs of your average searcher, while a narrow, curated source might be more appropriate to refined, specific queries.
Regardless, thanks for the good thoughts. Nice to think out loud with smart people 🙂
I think the only myths and misconceptions are the ones being spread by this article and the comments. You start off your article by saying that it´s a misconception that “getting links from highly relevant pages is no longer just valuable, but necessary in order to rank.”
I´m sorry but I have a huge problem with telling people that link relevance isn´t essential. Usually your customers will have commercial websites and will be competing with other commercial websites in their sector. While I agree that it´s healthy to have some irrelevant links you won´t be competitive if all your links are of low or no relevance. Therefore I firmly think it´s correct to say that getting highly relevant links is necessary.
Let´s consider this scenario:
Client wants to improve his organic traffic n the UK Motor sector. He gets most his links from china from pages about non-relevant topics. Do you think this client is going to achieve his goals?
Geographical relevance is just as essential.
Topical relevance must be obtained in any competitive environment. Do all links have to be highly relevant? Of course not. But saying link relevance isn´t essential is misleading.
I would tell people to pay attention to the quality of their links. Spammy links can and will hurt you. Yes it can be better to have no link at all than to have one that is from a bad neighbourhood.
Russ – good piece of research. Maybe it would help to test it against 2 or 3 other authority sites as well?
As a link building company, we periodically have to defend some of our links because they might not be on a totally relevant page — Even a sponsorship on a nice site or a piece of content we placed on a very solid blog.
Even though rankings and traffic are going up, there is this paranoia about links from non-relevant pages. Google can be very good at promoting the “fear” without the systems in place to back it up.
Good article Russ. This is something I’ve believed and tried to persuade others for some time now.
I would argue that unless you are writing highly technical or micro focussed niche content, purely by the nature of the web and the way ppl browse, engage with social networks etc, to find and link to content they like or find relevant or amusing, means that a more natural link profile will (and should) contain more non-relevant links than relevant.
Playing Devil’s Advocate, you could argue that sites with only highly relevant inbound links could be used as a measure by the search engines to penalise sites that are manipulating the rankings 😀
Hi !
Interesting article, but I’m not sure I completely understood your experiment.
You prooved that there’s randomness in the topics of the pages that link to some Wikipedia articles, not that these irrelevant or off topic links are what’s causing these articles to rank well despite the “myth”…
Am I missing something ?
what they found was that only a small fraction of “relevant” links were found to those wikipedia articles.
If you’re all about article publishing great news you don’t need relevant links.
Try that with a product tough…
Unfortunately I think that you’ve got it all wrong. Wikipedia is one of the worst sites that you could have chosen for this test for so many reasons. First of all it has an extremely high amount of authority, which skews the results. Another reason is that its power comes mainly from its strong internal linking, lack of outbound links, and strict editing guidelines (aka, Google favors them).
I have to side with Miguel here. Though I think it’s a great study and well worth researching, I think that using Wikipedia as the subject was a bad choice. Wikipedia is known to rank for a huge amount of keyphrases because it is such an authoritative site as well as the other thing Miguel mentioned. I’d be interested to see this same study on sites that don’t have above average authority ranking well for a competitive keyphrase
I think this ends up helping site owners who are willing to game links. Allowing sites to benefit more links regardless of relevance kinda makes it to valuable not to just get every link imaginable
One thing is for darn sure, those who are trying their hand at SEO themselves are at a complete loss. There’s simply so much conflicting info.
As soon as one clears it up (such as what this post may be attempting to do), one reads the very next post that come accross that says “what you previously read is wrong”.
Good information. If you get a backlink from lower PR site, how much benefit your site get?
This article is about as correct as saying McDonald’s is health food.
Just await the coming penguin algo being incorporated into the core algorithm, boy are we going to see heads roll then.
Does anyone realize how many sites use bought links? It’s a good portion of the net, then how many use comment spam or article spins to game rankings. Then how many use sitewide footer and nav links from parent companies?
In my backlink audits I’ve see more than enough to understand that once Google nails down penguin that SERPs will change drastically.
Your goal as a website on Google’s SERPs is to provide content for the user, if your site has backlinks from phone articles and you sell chalkboards, and those kinda backlinks make up most of your profile, you’re going to get the axe. Same goes for all those directories and advertisment backlinks.
Sooner the better.