Last week, Google released their new toy: the Google Code Search. This lets us search any code posted on the Internet for a regular expression. Yes, it allows us to search on a regex! I thought regex indexing wasn’t possible or at least not practical. It would require a MASSIVELY HUGE index compared to the data it tries to represent.
Or maybe they aren’t using index at all? Without indexing, a search would take a very long time. Take, for example, simply grepping the entire Linux kernel source tree here takes about 5 minutes. A search on Google Code Search, however, return its result in less than a few seconds. And let’s consider their haystack is supposedly the entire source code that get posted on the Internet.
However, this is Google we are talking about, they could utilize their sheer number of servers to do a number of regex search simultaneously. But somehow I still think they are using index to do the search.
PostgreSQL (and possibly other RDBMS too) do have a regex optimization in place. We can hardly call that a regex indexing, but when an index is present, it would make use of that for an anchored regex search like ‘^foo’.
So, are they using index or not? If they are using index, how
large huge is it? Anybody want to speculate?
On a side note, it is too tempting to try the ultimate regular expression: the RFC822 regex. It is more than 6 KB worth of regex for really matching an email address. When I tried it on Google Gode Search, it refused with error message:
Your client has issued a malformed or illegal request.
Has the RFC822 regex reach the limit of Google regex search, or it is simply the limit of my browser GET request?