Google Code Search

Last week, Google [released their new toy](http://googleblog.blogspot.com/2006/10/more-developer-love-with-google-code.html): the [Google Code Search](http://www.google.com/codesearch/advanced_code_search). This lets us search any code posted on the Internet for a regular expression. Yes, it allows us to search on a regex! I thought regex indexing wasn’t possible or at least not practical. It would require a MASSIVELY HUGE index compared to the data it tries to represent.

Or maybe they aren’t using index at all? Without indexing, a search would take a very long time. Take, for example, simply [grepping](http://en.wikipedia.org/wiki/Grep) the entire Linux kernel source tree here takes about 5 minutes. A search on Google Code Search, however, return its result in less than a few seconds. And let’s consider their [haystack](http://en.wikipedia.org/wiki/Haystack) is supposedly the entire source code that get posted on the Internet.

However, this is Google we are talking about, they could utilize their sheer number of servers to do a number of regex search simultaneously. But somehow I still think they are using index to do the search.

[PostgreSQL](http://www.postgresql.org) (and possibly other RDBMS too) do have a regex optimization in place. We can hardly call that a regex indexing, but when an index is present, it would make use of that for an anchored regex search like ‘^foo’.

So, are they using index or not? If they are using index, how ~~large~~ huge is it? Anybody want to speculate?

On a side note, it is too tempting to try the ultimate regular expression: the [RFC822 regex](http://www.regularexpressions.info/email.html). It is more than 6 KB worth of regex for *really* matching an email address. When I tried it on Google Gode Search, it refused with error message:

> **Bad Request**

> Your client has issued a malformed or illegal request.

Has the RFC822 regex reach the limit of Google regex search, or it is simply the limit of my browser GET request?

53 comments

dendi says:

10 October 2006 at 13:38

gak ngerti..

Reply
frodo says:

10 October 2006 at 13:39

Gooooooogle is #1.

I am #1

Reply
frodo says:

10 October 2006 at 13:41

Yah…
Keduluan…

:(

Reply
iang says:

10 October 2006 at 13:57

gugel emang keren dah :D

Reply
Fadil says:

10 October 2006 at 13:59

coba terjemahkan dalam bahasa yang bukan orang komputer pun akan mengerti…

Reply
arif wijayanto says:

10 October 2006 at 14:03

Duh, gak dong koding aku. Yg penting 10 besar :d

Reply
mufti says:

10 October 2006 at 14:14

10 besar!! Utk pertama kali. :d
Yahu kalah tuh ama gugel.

Reply
luthfie says:

10 October 2006 at 14:16

Kelima kah??

Reply
dwilicious says:

10 October 2006 at 14:17

sama, saya jg gak ngerti :-D

Reply
Luthfi says:

10 October 2006 at 14:19

ngerti gak ngerti gak penting
yg penting nomor 7

Reply
Saya says:

10 October 2006 at 14:22

Try searching for this: .*@.*\..*>

Reply
nOde says:

10 October 2006 at 14:30

:-?

Reply
nOde says:

10 October 2006 at 14:32

:-?
mas pri.. tolong donk di explore tu metode na PR na google… saya masih binun tuh.. :(
kaya na bakalan lebih menarik deh..

Reply
rendy says:

10 October 2006 at 14:40

*manggut manggut*

Reply
zhanzhe says:

10 October 2006 at 14:56

mayan masih 20 besar :D

Reply
avianto says:

10 October 2006 at 15:09

Pri, this post proved that you are still the same old Priyadi, an incureable geek ;).

Seems like most of your reader forgot, eh? That you are a geek first, a flame thrower later hehehe :)>-.

Cheers, mate.!

Reply
sandynata says:

10 October 2006 at 15:19

‘misiii… numpang lewat

Reply
Darojatun Wijaya says:

10 October 2006 at 15:23

hehehe, bahkan bisa untuk nyari code crack, hebat

Reply
oÃ³n says:

10 October 2006 at 15:26

oh adah apah inih?
:)>-

Reply
Koen says:

10 October 2006 at 15:47

Regex? Maooooooo. Ma’acih tuk infonya yach.

Reply
aribowo says:

10 October 2006 at 16:17

wah tampak nya google terus mengeluarkan produk2 terbaru untuk menguasai dunia internet, gimana respon/reaksi microsoft yach?

Reply
ndra says:

10 October 2006 at 16:34

Code search is a great resource for web developers and programmers, but like the making available of all previously unsearched bodies of information, it’s given lots of flashlights to people interested in exploring dark corners. Here are some things that people have uncovered already:

Artikel lengkapnya disini. Menarik juga :p

Reply
mela says:

10 October 2006 at 16:43

ga ngerti euy :(… tapi kayaknya dari bahasanya mas pri… bakalan keren tuh:-?
diartiin dong mas pri… biar kita makin mudeng… :d

Reply
abe says:

10 October 2006 at 17:37

Google, You Rawwkk.. :-?

Reply
shafwan says:

10 October 2006 at 17:42

ya, namanya juga google, udah pasti deh nyari yang aneh aneh.

Reply
harry says:

10 October 2006 at 17:59

I’ve been wondering if I would EVER be able to find THE regex to verify the validity of an email address.

I wonder no more :)

Thanks.

Reply
Ronsen says:

10 October 2006 at 20:32

Path to dark side ? :d

Reply
Adhi Y. Pradipto says:

10 October 2006 at 20:53

Waduh.. bentar.. mikir dulu ni.. 8-|

Reply
agam says:

10 October 2006 at 21:22

wah, om pri lagi lagi inggrisan :(
absen dulu aja ya…
ntar aku baca offline.:d

Reply
amudi says:

10 October 2006 at 21:34

all hail google…

Reply
Andi says:

10 October 2006 at 23:14

Aduh, translate dulu ah… #-o

Reply
Eris Ristemena says:

11 October 2006 at 01:11

You might interested looking at this,
http://ilia.ws/archives/133-Google-Code-Search-Hackers-best-friend.html#extended
http://blog.ngoprek.web.id/2006/10/07/how-many-fucks-do-you-have-in-your-source-code/

or in bahasa here,
http://blog.ngoprek.web.id/2006/10/07/memburu-sisi-lain-programmer-dengan-google-code-search/

Reply
Jauhari says:

11 October 2006 at 10:25

Bin dan NGUNG :d

Reply
irwan says:

11 October 2006 at 10:58

Seperti Google page rank, saya kira mereka menggunakan index untuk searching using regexp.

Algoritma yg digunakan oleh google search adalah page rank. Page Rank sendiri, teori matematikanya sangat dalam. Contohnya adalah 2 artikel berikut

1. Monte Carlo methods in PageRank computation: When one iteration is sufficient

2. In-Degree and PageRank of Web pages: Why do they follow similar power laws?

Dg teori2 tsb, optimisasi dari indexing sangat di’improve’.

Denger2 dari matematikawan yg menulis 2 artikel diatas, pada prakteknya, Algoritma utk Google Page Rank dilakukan sehari sekali, dg lama waktu yg kurang lebih 20 jam (mungkin sekarang lebih cepet lagi)

Jadi saya pasti terjadi optimisasi searching dg index. Besarnya index ..saya tdk tahu persis.. cuman pastinya besar sekali, seperti yg di quote di sini

Reply
Priyadi says:

11 October 2006 at 11:26

#34: hmmm, pagerank itu algoritma untuk mengurutkan, bukan algoritma indexing. contoh algoritma indexing itu misalnya Btree

Reply
irwan says:

11 October 2006 at 11:44

#35. Pak, bukannya dg mengurutkan kita juga memberi index? :)>-

Reply
Priyadi says:

11 October 2006 at 12:07

#36: hmmm, ngga juga, selain pagerank, harus ada juga algoritma lainnya. indexing itu tujuannya untuk menemukan data dengan cepat.

selain itu untuk pencarian source code, pagerank juga gak relevan karena antara source code jarang saling memberi tautan.

Reply
irwan says:

11 October 2006 at 12:49

#37: hmmm. gitu yah pak… berarti google search code -mungkin- tdk perlu menggunakan index. Cukup membiarkan ‘crawler’nya nyari code2 tsb, seperti crawler meta mereka. Toh hasilnya ngga perlu diurutkan … mungkin harus dicoba search suatu string, kemudian dichek dari hari per hari, apa hasilnya beda ato tdk :d

Reply
Priyadi says:

11 October 2006 at 13:26

#38: index itu untuk mempercepat. coba misalnya di windows pakai ‘find’ untuk mencari file apa saja yang mengandung kata tertentu. pasti makan waktu lama.

beda kalau sudah diindex dulu, misalnya pakai google desktop. hasil pencarian pasti jauh lebih cepat.

nah, bedanya google code search itu pakai regex yang jauh lebih rumit daripada pencarian keyword biasa. contohnya: ab*c?d+e akan mencari keyword ‘abbbcde’, ‘acde’, ‘abde’, dst.

Reply
Amal says:

11 October 2006 at 13:41

Sip!

http://www.google.com/codesearch?q=%5Cs%2Bpriyadi%5Cs%2B

Reply
mufti says:

11 October 2006 at 14:06

#34 sampai #39: Waktu pertama kali nyicipi google code search saya langsung kepikiran ukuran indexnya sebesar apa nih:-? ? Algoritmanya kayak apa? uedan pikir saya 8-}. Percakapan kalian sedikit memberi gambaran, jenius juga si google.=d>

Reply
hari says:

11 October 2006 at 14:24

wah tadi belum baca
wah telat lagi

Reply
sugeng says:

11 October 2006 at 14:30

1. Buka Google
2. Copy and Paste pada address bar
javascript:R=0; x1=.1; y1=.05; x2=1000; y2=.24; x3=1.6; y3=.24;x4=300; y4=200; x5=300; y5=200; DI=document.images; DIL=DI.length;function A(){for(i=0; i

Reply
sugeng says:

11 October 2006 at 14:34

waks ke potong scriptnya :((

Reply
Aryo Sanjaya says:

11 October 2006 at 16:59

Om, trekbeknya emang dimatiken ya? :d

Reply
MaIDeN says:

12 October 2006 at 06:55

:d Bertanya-tanya :-?
Samm ma` :-“

Reply
untung says:

12 October 2006 at 09:41

setuju.. :)>- gugel is the besss…… :d

Reply
anima says:

12 October 2006 at 14:18

its even easier to steal codes now. google is taking over the world..

Reply
wandy says:

12 October 2006 at 15:28

wah klo ga ada Om GoOgle…betul2 hampa deh,Karena selama ini Google lah tempat ku bertanya mengadu…ckckckc…Google V kaleee!!!!

Reply
starchie says:

13 October 2006 at 10:29

#40, nice try :)>-

Reply
Didi says:

13 October 2006 at 12:52

I start my browsing not from address bar, but from Google.
I love Google.
Thanks for your great articles

Reply
maul says:

5 November 2006 at 21:02

how did google do that? i’m just curious how google can read/search a word from a file, which i’m curious more is they read .pdf tooo???! how? can ayone help me with the code they use? maul_oke84@yahoo.com

Reply
GunTank says:

8 July 2007 at 22:39

Aku suka regex, gampang soalnya…. belagu dikit boleh dong….

Reply

53 comments

Leave a comment Cancel reply