Search Engine Optimization Featured Clients
HOME    |    SEO 2.0 EXPLAINED    |    SEO PACKAGES    |    SEO NEWS    |    SEO TOOLS    |    CONTACT US    |    CLIENT LOGIN

Tag Archive 'deep web'

Apr 29 2008

Google crawls the ‘deep web’

Published by ingrid under Google Quakes

Webmasters are in for quite an adventure as Google takes another step in its perpetual mission to index information. Just a few days ago, Google (via the Webmaster Central Blog) announced that it would begin to fill out HTML forms and crawl through the results. Here’s a quick excerpt from the official blog entry: “In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <form> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.â€

Google has been working on this technology since it purchased Transformic three years ago. They have been attempting to solve two issues:

(1) determining which web forms are worth crawling into, and

(b) filling in the values in the forms they decide to penetrate in order to retrieve the data behind them.

Radio buttons and check boxes were easy, so the real problem was dealing with free-text inputs. They needed a technology to comprehend the semantics of that particular input box to deduce potential valid inputs. Google has since addressed these problems (obviously), and is now ready to crawl the deep, invisible web.In comes the paranoiaA number of webmasters have expressed concern. If Google can now index dynamic content returned through form input, what’s next? Will it be able to pry into private intranets, unlinked pages, and the rest of the online information that as of now is hidden (rather comfortably) from the search engine giant’s sight? Google clarifies that it will still respect robots.txt files, but is this reassurance enough?

Whether the skepticism of some observers is valid or not is still debatable. They certainly bring up valid issues. With this new system, for instance, googlebots may find duplicate content that aren’t really duplicated but seem to be. Google penalizes duplicate content, so this is a major concern. Some webmasters also feel that Google, by implementing this new crawling technique, is doing what spammers do – they are inserting random/faulty text into a website’s form fields. Many webmasters may end up blocking googlebots altogether, and this defeats Google’s mission of indexing all online information there is.

On the other side of the fence, this new crawling technique may just be the right formula. Some webmasters see it as a great new development for searchers, as it will expose the deep web and give them more relevant results. Their advice to other webmasters is this: Knowing fully well that googlebot can now explore behind your HTML forms, safeguard what you need to keep out of sight. You can’t rely on just putting content that you don’t want to be indexed behind an HTML form anymore. Take measures to block it. Keep an eye on your logs, too.

Deep web and SEO

Now you might be asking – will Google’s ability to crawl through HTML forms have any effect on search engine optimization? From a purely SEO standpoint, this new development may appear insignificant at first. But look again and it just might be noteworthy. Keep in mind that the deep web is estimated to contain more than 500 times the information already visible on the web. If Google has indexed 1.4 billion web pages, then there are over 700 billion deep-web documents waiting for indexing. If just 1/8 of this turns out to be highly focused and relevant, then we stand to compete with 87 billion new content pages. This certainly seems like a brand new challenge.

Robert Bibb CEO

Blackwood Productions

One response so far