Last week, Google [released their new toy](http://googleblog.blogspot.com/2006/10/more-developer-love-with-google-code.html): the [Google Code Search](http://www.google.com/codesearch/advanced_code_search). This lets us search any code posted on the Internet for a regular expression. Yes, it allows us to search on a regex! I thought regex indexing wasn’t possible or at least not practical. It would require a MASSIVELY HUGE index compared to the data it tries to represent.
Or maybe they aren’t using index at all? Without indexing, a search would take a very long time. Take, for example, simply [grepping](http://en.wikipedia.org/wiki/Grep) the entire Linux kernel source tree here takes about 5 minutes. A search on Google Code Search, however, return its result in less than a few seconds. And let’s consider their [haystack](http://en.wikipedia.org/wiki/Haystack) is supposedly the entire source code that get posted on the Internet.
However, this is Google we are talking about, they could utilize their sheer number of servers to do a number of regex search simultaneously. But somehow I still think they are using index to do the search.
[PostgreSQL](http://www.postgresql.org) (and possibly other RDBMS too) do have a regex optimization in place. We can hardly call that a regex indexing, but when an index is present, it would make use of that for an anchored regex search like ‘^foo’.
So, are they using index or not? If they are using index, how
large huge is it? Anybody want to speculate?
On a side note, it is too tempting to try the ultimate regular expression: the [RFC822 regex](http://www.regularexpressions.info/email.html). It is more than 6 KB worth of regex for *really* matching an email address. When I tried it on Google Gode Search, it refused with error message:
> **Bad Request**
> Your client has issued a malformed or illegal request.
Has the RFC822 regex reach the limit of Google regex search, or it is simply the limit of my browser GET request?