Archive Page 2

canonical url

15Jun06

Q: What is a canonical url? Do you have to use such a weird word, anyway?
A: Sorry that it’s a strange word; that’s what we call it around Google. Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:

  • www.example.com
  • example.com/
  • www.example.com/index.html
  • example.com/home.asp

But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.

Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, don’t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.

Q: Is there anything else I can do?
A: Yes. Suppose you want your default url to be http://www.example.com/ . You can make your webserver so that if someone requests http://example.com/, it does a 301 (permanent) redirect to http://www.example.com/ . That helps Google know which url you prefer to be canonical. Adding a 301 redirect can be an especially good idea if your site changes often (e.g. dynamic content, a blog, etc.).

o Directory submissions.
These sites offer lists of niche directories to submit your link to:
http://www.directoryarchives.com
http://www.isedb.com

o News feeds.
PR Web and others offer free and paid services to give you a chance to secure links in news articles about your company.
http://www.urlwire.com/
http://www.prweb.com/

o Become an authority site and get sites to link for you for free. Creating content people want to link to.
Example, http://www.searchenginewatch.com

o Leave testimonials.
Example, see the footer area (very bottom of the page) of: http://searchenginelowdown.com/

o By donating to charities and non profit organizations.
Many sites allow you to donate to their cause and they will give you a link on their site.

o Volunteer web development services.
Many non-profit organizations need a web presence but cannot afford professional web design services. Offer them for free in exchange for a credit at the bottom of their pages.

o Making tools and content that make people bookmark the site.
Example, the link popularity check tool at:
http://www.marketleap.com/publinkpop/default.htm

o Giving things away for free.

o Ask for links from suppliers and friends in your industry.

o Write articles and submit them to similar websites.

A new patent application from Microsoft describes some ways to identify some of the spam pages that show up in search engine results. The research that led to the application started off by looking at something else completely, but a chance discovery turned up some interesting results.

The initial research began with something Microsoft calls Pageturner. Pageturner is a project that looks at how often web pages update, and how frequently they might need to be crawled. It also looks at identifying duplicate and near duplicate content on web pages.

The Microsoft researchers on that project found themselves being drawn to some very different research after looking at some of their results, especially from some pages located in Germany, which changed too quickly. Here are a couple of papers that describe some of the results of the original research:

On the Evolution of Clusters of Near Duplicate Web Pages (pdf)

A Large-Scale Study of the Evolution of Web Pages (pdf)

A presentation from 2003, PageTurner: A large-scale study of the evolution of web pages (powerpoint), provides a little more insight into aspects of Pageturner, and some differences based upon different top level domains. There’s a lot of interesting information from this study. Here are some conclusions noted in the presentation:

  • Pages don’t change much from week to week
  • Pages have predictable change rate
  • Markup-only changes often due to
    • Session IDs
    • Banner ads
  • Large changes due to
    • Log files, weblogs, and crafty porn

That last one - crafty porn - was the key to some further research tackling web spam. Microsoft came out with a patent application this last week that includes some of the research initiated during the Pageturner project, and the followup research that it inspired. The title of the document is Content evaluation , and it was filed on September 30, 2004, and published on March 30, 2006.

Here’s the abstract from the patent application:

Evaluating content is described, including generating a data set using an attribute associated with the content, evaluating the data set using a statistical distribution to identify a class of statistical outliers, and analyzing a web page to determine whether it is part of the class of statistical outliers. A system includes a memory configured to store data, and a processor configured to generate a data set using an attribute associated with the content, evaluate the data set using a statistical distribution to identify a class of statistical outliers, and analyze a web page to determine whether it is part of the class of statistical outliers. Another technique includes crawling a set of web pages, evaluating the set of web pages to compute a statistical distribution, flagging an outlier page in the statistical distribution as web spam, and creating an index of the web pages and the outlier page for answering a query.

The named inventors are Marc Najork, Dennis Fetterly, and Mark Manasse (also here).

The three have also worked together on the following papers, which look at web spam more closely.

Detecting Phrase-Level Duplication on the World Wide Web (pdf)

While conducting research for the Pageturner project, looking a more than a few hundred million webpages, our researchers noticed a very large number of machine generated spam pages from a handful of servers in Germany. Those pages were generated by assembling “grammatically wellformed German sentences drawn from a large collection of sentences.” As a researcher, when you find something interesting like this, it’s difficult not to act upon it. The three Microsoft team members started looking for more:

This discovery motivated us to develop techniques for finding other instances of such “slice and dice” generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.

One clue that set off the discovery was that a number of these German pages were likely to change much more quickly than pages elsewhere. Delving deeper, they found a million pages from 116,654 hosts all sharing the same IP address, and operated by the same organization.

The paper describes how to locate the duplicated use of phrases from other pages, that may have been taken from other sites, and joined together in grammatically correct sentences. It discusses a way of identifying those pages by a process called shingling, and it describes some other characteristics of automatically generated pages that are intended to lure people to sites from search engine searches.

Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages (pdf)

Spam web pages that are machine generated tend to differ in a number of ways from most other web pages, and can possibly be identified through statistical analysis. This paper looks at some ways of finding those pages. The types of things that the paper notes as predictive of automatically generated spam pages include:

  • Pages with Long “host” names and a large number of “characters, dots, dashes and digits” in them tend to be spam pages. A “host” name is the section of a URL before a domain name. On many sites, this is often a “www,” but some sites use a subdomain name (or different host name) on it.
  • Host name resolutions may help point to spam pages. These are pages that all point to the same IP address, and share the same domain name, but a different host name. Example: “http://some-host-name.example.com” could be one of 20,000 addresses that all point to the same IP address.
  • Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?
  • Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page. (Amongst the pages they were looking at, they found 944 such hosts serving 323,454 pages which all had “no variance in word count.”)
  • Content evolution properties: Spam pages tend to change everytime they are downloaded, which stands out from much more slowly changing pages on other sites.

Our researchers aren’t strangers to trying to locate spam pages on the web. A patent with Marc Najork named as the inventor, when he worked at Hewlett-Packard, also looks interesting:

System and method for identifying cloaked web servers

Abstract:

A search engine receives from a client a representation of a first object that was returned by a web server to the client in response to a request from the client. The search engine receives from the web server a second object in response to an identical request from the search engine, and compares the representation of the first object to a representation of the second object. The web server is determined to be cloaked if the representation of the first object does not match the representation of the second object. Typically, the client receives a URL embedded in a response to a search request submitted to the search engine. A toolbar operating in conjunction with the web browser on the client processes the URL. The processing includes: directing the web browser to obtain an object corresponding to the URL from a web server addressed by the URL; converting the object to a feature vector; and delivering the feature vector and the URL back to the search engine.

I recall a few forum posts from Ammon Johns (here’s one) which mentioned that a toolbar from a search engine could easily be used to help compare what a search engines sees when it spiders a site to what a human visitor sees when it visits the same page. It’s likely that if Hewlett-Packard has devised a means of doing this, that the major search engines are capable of a similar ability.

If you prefer the movie version, instead of reading through a lot of this text, Dennis Fetterly gave a presentation at Purdue University on December 8, 2004, about using some of these processes to identify spam:

Using Statistical Analysis to Locate Spam Web Pages

The presentation is 36 minutes long, and includes a lot of examples, and a nice question and answer session.

In May, the WWW2006 conference includes a presentation of papers, including some on search spam. One of those will be Detecting Spam Web Pages through Content Analysis, in which our three collaborators are joined by Alexandros Ntoulas of the UCLA Computer Science Dept. Should be an interesting presentation.

Browsing forums you will find this question asked many times by newbies. Most old timers and some people that just like to repeat the answer they where told will tell you 20 or so a month, after letting your site simmer for 6 months.

So at what point do you blow off all the bullshit that you read on the forums and start doing your own thing, the way you see it should be done!

The answer is now, as anything you read about SEO (including this) is just speculation, as none of us have a direct line to any of the search engines where we can ask and get 100% answer to our questions.

So how fast should you build links?

Let me show you a formula that I came up with some years ago, or rather a method that you can use to get a VERY solid number on the amount of links to build per month that will not get your domain put in a filter, and be seen as natural by the search engines.

Example:

Lets say I’m building a webmaster blog, and one of my top keywords are ‘SEO Blog’.

I hop on over to Google and type in SEO Blog, then take the top 10 domains and store them into a text file.

mattcutts.com
beanstalk-inc.com
toprank.blogspot.com (Removed)
abakus-internet-marketing.de
news.stepforth.com (Removed)
cre8pc.com
hawaiistreets.com
seobook.com
blogtopsites.com

Its important to note that some domains like blogspot, are not good for this example and I do not include them in my formula as they are over linked. Just like about.com, etc which covers so many topics that they have mad ranking and too many links to so many pages it skews the calculation IMO.

So with that said I’m going to remove some of the domains so that I get a cleaner result (toprank.blogspot.com, and news.stepforth.com have been removed from my list.).

At this point you can use any Whois tool that you like to lookup the domain information. What we will be looking for is the created date of the domain name.

My results:

mattcutts.com: 21-06-2003
beanstalk-inc.com: 08-30-2004
abakus-internet-marketing.de: REMOVED NO CREATE DATE
cre8pc.com: 08-16-1997
hawaiistreets.com: 07-28-2005
seobook.com: 09-23-2003
blogtopsites.com: 03-30-2004

At this point Im left with about 6 domains, which IMO is fine for this example, but you may want to use more domains to get a stronger reading of how fast to build links.

At this point you will want to use the search engines (not recommended) or your favorite link checking script to get a calculation on the number of back links that each domain has. I personally use SEO Elite!

mattcutts.com: 397 links
beanstalk-inc.com: 279 links
cre8pc.com: 271 links
hawaiistreets.com: 17 links
seobook.com: 1242 links
blogtopsites.com: 1242 links

Once you have each domains number of backlinks you want to take the domain’s number of links and divide the number of links by the number of months that it has been around.

Example:

mattcutts.com created on 21-06-2003 and has 397 links. At the time of this post his domain has been around 36 months, so we take 397/36 which equals 11 rounded.

Now we know mattcutts.com has built around 11 links per month for the last 36 months, but we are not going to just use his domain for our estimation as he could have gotten more links from being the cool kid on the block for working at Google Inc. LOL

After processing all the domains and coming up with the number of links per month they have gotten since they where created we get the following:

mattcutts.com: 11
beanstalk-inc.com: 13
cre8pc.com: 3
hawaiistreets.com: 2
seobook.com: 39
blogtopsites.com: 48

At this point we need to take the total number of links for all domains and add them up then divide them by the number of domains to get the average.

All links per month is 116, divided by 6 domains equals 19 links per month.

Now that we are done with all that fun processing and data collection we know we can build a seo blog or a website related to the topic at a rate of around 19 links per month without google taking notice.

The number WILL change if you check other industries but on average its around 20, but if you are really picky and want to find out what you really need I recommend using the above.

I have been surprised many times when using the above formula in some industries where I would think 20-50 links a month was normal and come to find out the average was as little as 6-11 a month, and in some places as high as 100 a month.

It’s best to check it you really want to know how hard your going to have to work to get top ranking!

In my experience you should be able to build the above number of links for the 6 months the domain is simmering, and will be ranking very well in Google time it gets around as Google will see your sites as growing in popularity and will promote your website compared to demoting it like other sites that under or over build links.

I think it’s also good to note that you may see some domains like seobook.com that have thousands of links, but if you follow the above formula you should be able to rank with them at the top of the SERP once your site gets out of the SandBox with lower number of links as you will be getting links in a tolerance level that Google likes, and because of that Google will promote your site to the top of the SERP. A good example of this is hawaiistreets.com that is under building their links. If they raised the number of links per month from 2 to 19 I bet they would move to the top 5 sites if not the top 3 or number 1.

How did I come up with this?

Some time back I was reading a article on programming and inventory tracking for big supermarkets like Wal-Mart and one of the things they did to track employee theft was to track the number of items coming into and out of the store to be able to find locations in the store where employees where taking items.

From that I came up with the idea that Google used their search results like inventory, and they track it just like Wal-Mart would for events that are out of the normal patterns they are expecting from the data base and sub results for keywords or industry related topics.

At this point I have not developed a program to do the above because so many sites like about.com, etc take up the results and I feel its better when you do it by hand because of that. Plus I just don’t feel like doing it at this time, but if you are a developer and you wish to make such a program please have the respect to give me credit in the about section of the program or under the form on the site with a link to my blog.

vary the link text

Importance of varying your link text
Some webmasters learn how powerful link anchor text is and quickly rush off to build thousands of links using the exact same link text. This can raise a flag in search engines because the link pattern will appear unnatural. This is because naturally occurring link patterns include phrases such as “click here” or your “company name”, etc. It looks unnatural for the vast majority of your links to have the exact same anchor text.

The Hilltop algorithm was proposed by two researchers working under the auspices of the University of Toronto. One of those researchers is now at Google and Google also acquired the patent early in 2003. The algorithm overcomes problems with broad search terms which return large sets of documents that have to be ranked. It is hard for a search engine to analyze the quality of these results based on on-page factors alone, especially where the results are heavily optimized for search engines or are copies of other high ranking pages.

The Hilltop algorithm is based on a similar assumption to Google PageRank. That is the quality and quantity of inbound-links is an indication of how the page should be ranked. The key difference is that only ‘expert sources’ derived from the query terms are used when judging inbound-links. In other words Google’s original PageRank algorithm gives a global rank of the quality of a web page. Hilltop determines the quality based on the relevance to the query term.

An expert or authority page covers a certain topic based on the query text and has links to many none-affiliated pages on the same topic area. An example would be a Open Directory Project (DMOZ) or Yahoo! directory page. Search results are only considered if they have more than one inbound-link from these expert pages and those inbound-links contain anchor text that matches the target page and the query terms. If Hilltop can’t find more than one expert page it returns no results. The algorithm is oriented towards quality not quantity of results.

To score highly under Hilltop sites have to acquire inbound-links from expert sites. This can be by writing articles or providing other useful information. These can be placed on the website and possibly submitted to authority sites with a link request.

Sites are judged to be affiliated if they are on the same Internet network (using IP address information) or the same domain. This should help to exclude link farms.

The exact implementation of Hilltop is described in Bharat and Mihaila’s paper: . It is believed by some experts to form a significant part of Google’s Florida algorithm changes that penalized many heavily optimized and commercial web pages. An obvious problem is the compilation of the ‘expert pages’ and the processing necessary to match with results pages. It seems likely that the Hilltop algorithm is only run on certain popular keywords with a precompiled set of expert pages.

Mission: Impossible III is cut for violent scenes and cultural references before its release in China, reports say.

The creators of South Park are among the stars headlining at this year’s Edinburgh TV festival.

William Hung performs the Ricky Martin song 'She Bangs' during halftime of the University of California's men's volleyball game Wednesday, Feb. 18, 2004, at Haas Pavilion in Berkeley, Calif.  (AP Photo/Ben Margot, FILE)AP - Off-key pop idol William Hung has something new to crow about: he's being crowned "Artichoke King" in the small city that calls itself "artichoke capital of the world."


Reuters - CBS Radio on Thursday said it
reached a seven-year deal to use Arbitron Inc’s portable people
meter rating method, becoming the first major radio company to
sign on to the new technology.


You are currently browsing the sonufifu weblog archives.