Google Code Search

Last week, Google [released their new toy](http://googleblog.blogspot.com/2006/10/more-developer-love-with-google-code.html): the [Google Code Search](http://www.google.com/codesearch/advanced_code_search). This lets us search any code posted on the Internet for a regular expression. Yes, it allows us to search on a regex! I thought regex indexing wasn’t possible or at least not practical. It would require a MASSIVELY HUGE index compared to the data it tries to represent.

Or maybe they aren’t using index at all? Without indexing, a search would take a very long time. Take, for example, simply [grepping](http://en.wikipedia.org/wiki/Grep) the entire Linux kernel source tree here takes about 5 minutes. A search on Google Code Search, however, return its result in less than a few seconds. And let’s consider their [haystack](http://en.wikipedia.org/wiki/Haystack) is supposedly the entire source code that get posted on the Internet.

However, this is Google we are talking about, they could utilize their sheer number of servers to do a number of regex search simultaneously. But somehow I still think they are using index to do the search.

[PostgreSQL](http://www.postgresql.org) (and possibly other RDBMS too) do have a regex optimization in place. We can hardly call that a regex indexing, but when an index is present, it would make use of that for an anchored regex search like ‘^foo’.

So, are they using index or not? If they are using index, how large huge is it? Anybody want to speculate?

On a side note, it is too tempting to try the ultimate regular expression: the [RFC822 regex](http://www.regularexpressions.info/email.html). It is more than 6 KB worth of regex for *really* matching an email address. When I tried it on Google Gode Search, it refused with error message:

> **Bad Request**

> Your client has issued a malformed or illegal request.

Has the RFC822 regex reach the limit of Google regex search, or it is simply the limit of my browser GET request?

53 comments

  1. :-?
    mas pri.. tolong donk di explore tu metode na PR na google… saya masih binun tuh.. :(
    kaya na bakalan lebih menarik deh..

  2. Pri, this post proved that you are still the same old Priyadi, an incureable geek ;).

    Seems like most of your reader forgot, eh? That you are a geek first, a flame thrower later hehehe :)>-.

    Cheers, mate.!

  3. Code search is a great resource for web developers and programmers, but like the making available of all previously unsearched bodies of information, it’s given lots of flashlights to people interested in exploring dark corners. Here are some things that people have uncovered already:

    Artikel lengkapnya disini. Menarik juga :p

  4. ga ngerti euy :(… tapi kayaknya dari bahasanya mas pri… bakalan keren tuh:-?
    diartiin dong mas pri… biar kita makin mudeng… :d

  5. I’ve been wondering if I would EVER be able to find THE regex to verify the validity of an email address.

    I wonder no more :)

    Thanks.

  6. Seperti Google page rank, saya kira mereka menggunakan index untuk searching using regexp.

    Algoritma yg digunakan oleh google search adalah page rank. Page Rank sendiri, teori matematikanya sangat dalam. Contohnya adalah 2 artikel berikut

    1. Monte Carlo methods in PageRank computation: When one iteration is sufficient

    2. In-Degree and PageRank of Web pages: Why do they follow similar power laws?

    Dg teori2 tsb, optimisasi dari indexing sangat di’improve’.

    Denger2 dari matematikawan yg menulis 2 artikel diatas, pada prakteknya, Algoritma utk Google Page Rank dilakukan sehari sekali, dg lama waktu yg kurang lebih 20 jam (mungkin sekarang lebih cepet lagi)

    Jadi saya pasti terjadi optimisasi searching dg index. Besarnya index ..saya tdk tahu persis.. cuman pastinya besar sekali, seperti yg di quote di sini

  7. #36: hmmm, ngga juga, selain pagerank, harus ada juga algoritma lainnya. indexing itu tujuannya untuk menemukan data dengan cepat.

    selain itu untuk pencarian source code, pagerank juga gak relevan karena antara source code jarang saling memberi tautan.

  8. #37: hmmm. gitu yah pak… berarti google search code -mungkin- tdk perlu menggunakan index. Cukup membiarkan ‘crawler’nya nyari code2 tsb, seperti crawler meta mereka. Toh hasilnya ngga perlu diurutkan … mungkin harus dicoba search suatu string, kemudian dichek dari hari per hari, apa hasilnya beda ato tdk :d

  9. #38: index itu untuk mempercepat. coba misalnya di windows pakai ‘find’ untuk mencari file apa saja yang mengandung kata tertentu. pasti makan waktu lama.

    beda kalau sudah diindex dulu, misalnya pakai google desktop. hasil pencarian pasti jauh lebih cepat.

    nah, bedanya google code search itu pakai regex yang jauh lebih rumit daripada pencarian keyword biasa. contohnya: ab*c?d+e akan mencari keyword ‘abbbcde’, ‘acde’, ‘abde’, dst.

  10. #34 sampai #39: Waktu pertama kali nyicipi google code search saya langsung kepikiran ukuran indexnya sebesar apa nih:-? ? Algoritmanya kayak apa? uedan pikir saya 8-}. Percakapan kalian sedikit memberi gambaran, jenius juga si google.=d>

  11. 1. Buka Google
    2. Copy and Paste pada address bar
    javascript:R=0; x1=.1; y1=.05; x2=1000; y2=.24; x3=1.6; y3=.24;x4=300; y4=200; x5=300; y5=200; DI=document.images; DIL=DI.length;function A(){for(i=0; i

  12. wah klo ga ada Om GoOgle…betul2 hampa deh,Karena selama ini Google lah tempat ku bertanya mengadu…ckckckc…Google V kaleee!!!!

Leave a Reply to Andi Cancel reply

Your email address will not be published. Required fields are marked *