Tuesday, 22 October 2013

TF-IDF Killed The Copywriting Spam

Note: If you want to play with TF-IDF, download this spreadsheet. The first tab is a simple TF-IDF calculator. Enter the occurrences of a word, the words in each document, total documents and the number of documents containing the phrase. It does the rest. The second tab demonstrates falling TF-IDF as documents containing a phrase goes up.

Term Frequency-Inverse Document Frequency (TF-IDF) proves that, in the modern SEO game, quality trumps quantity.
TF-IDF is a text-mining algorithm …
Wait! Don’t run! I’m not here to teach you the math behind TF-IDF. Truth is, I barely understand it myself. But Term Frequency, Inverse Document Frequency (TF-IDF – a great phrase for the next SEO cocktail party you attend) contains some crucial lessons for us copywriters.
Here’s a very brief description of TF-IDF and how it works (a little fancy math involved):

TF = Term Frequency

We all know this one:
If our key phrase is “flibbergibbet,” and it occurs 4 times in a document that’s 400 words in length, then the TF for “flibbergibbet” is:
4 / 400 = 1%
Some folks call this keyword density. But we’re past that now.

Inverse Document Frequency

Inverse Document Frequency (IDF) is the inverse of the number of documents in which a phrase occurs. That’s a terrible description – I know that because the mathematicians I know all punched me in the arm after I said it. But it’ll work for our purposes.
In case you want to know:
IDF = log(total documents/number of documents with phrase)
So, if “flibbergibbet” appears in 250 out of 1000 documents, the IDF is:
log(1000/250) = .6

TF-IDF

TF-IDF is the Term Frequency times the Inverse Document Frequency, or TF*IDF.
Here’s the thing about IDF that you must understand: As the number of documents containing a phrase goes up, the TF-IDF score goes down. Have a look at this graph — document frequency goes up as you move to the right:
tf-idf goes down as doc occurences go up!!
Yikes. So, the more times you mention a phrase, the less important that phrase appears on a specific page.

What It All Means

We don’t know for certain if the search engines use TF-IDF to determine the importance of a word on a page. But it’s likely they use it or something very like it.
Say you want your website to rank well for our favorite word. You include the word at least 3 times on every single page of your site. That actually reduces the TF-IDF score of each page for “flibbergibbet.”
Of course, there are many, many other ranking factors. Thousands. If your site is 150 pages of fantastic content, and it:
  1. Has a unique, fully descriptive title tag for every page
  2. Has a unique structure for every page
  3. Doesn’t spin or duplicate content
  4. Uses fully-descriptive ALT attributes, etc. etc.
… then TF-IDF probably doesn’t hurt you at all. A visiting search engine can use other signals to determine page relevance.
But content farmers, beware. If you crank out 999 pages of total crap, using your key phrase 5-10 times per page, all you’ve done is made it harder for a search engine to figure out which page is most important for that phrase.
If I were a search engine (and I’m not), I’d take that as a signal of a poorly-organized site.
Wouldn’t you?

The Lesson

In the past, site owners created page after page expounding on a specific key phrase, repeating it time after time in articles that were barely different, poorly written and poorly structured. That’s still a standard “SEO copywriting” tactic. I use quotes because it’s not SEO copywriting at all.
TF-IDF explains why that tactic has lost its power. It also shows why the cliche “If you want to rank, write good stuff,” really is the right strategy. TF-IDF means more isn’t necessarily better. So, write good stuff!