Last week, Google [released their new toy](http://googleblog.blogspot.com/2006/10/more-developer-love-with-google-code.html): the [Google Code Search](http://www.google.com/codesearch/advanced_code_search). This lets us search any code posted on the Internet for a regular expression. Yes, it allows us to search on a regex! I thought regex indexing wasn’t possible or at least not practical. It would require a MASSIVELY HUGE index compared to the data it tries to represent.
Or maybe they aren’t using index at all? Without indexing, a search would take a very long time. Take, for example, simply [grepping](http://en.wikipedia.org/wiki/Grep) the entire Linux kernel source tree here takes about 5 minutes. A search on Google Code Search, however, return its result in less than a few seconds. And let’s consider their [haystack](http://en.wikipedia.org/wiki/Haystack) is supposedly the entire source code that get posted on the Internet.
However, this is Google we are talking about, they could utilize their sheer number of servers to do a number of regex search simultaneously. But somehow I still think they are using index to do the search.
[PostgreSQL](http://www.postgresql.org) (and possibly other RDBMS too) do have a regex optimization in place. We can hardly call that a regex indexing, but when an index is present, it would make use of that for an anchored regex search like ‘^foo’.
So, are they using index or not? If they are using index, how large huge is it? Anybody want to speculate?
On a side note, it is too tempting to try the ultimate regular expression: the [RFC822 regex](http://www.regularexpressions.info/email.html). It is more than 6 KB worth of regex for *really* matching an email address. When I tried it on Google Gode Search, it refused with error message:
> **Bad Request**
> Your client has issued a malformed or illegal request.
Has the RFC822 regex reach the limit of Google regex search, or it is simply the limit of my browser GET request?
gak ngerti..
Gooooooogle is #1.
I am #1
Yah…
Keduluan…
:(
gugel emang keren dah :D
coba terjemahkan dalam bahasa yang bukan orang komputer pun akan mengerti…
Duh, gak dong koding aku. Yg penting 10 besar :d
10 besar!! Utk pertama kali. :d
Yahu kalah tuh ama gugel.
Kelima kah??
sama, saya jg gak ngerti :-D
ngerti gak ngerti gak penting
yg penting nomor 7
Try searching for this: .*@.*\..*>
:-?
:-?
mas pri.. tolong donk di explore tu metode na PR na google… saya masih binun tuh.. :(
kaya na bakalan lebih menarik deh..
*manggut manggut*
mayan masih 20 besar :D
Pri, this post proved that you are still the same old Priyadi, an incureable geek ;).
Seems like most of your reader forgot, eh? That you are a geek first, a flame thrower later hehehe :)>-.
Cheers, mate.!
‘misiii… numpang lewat
hehehe, bahkan bisa untuk nyari code crack, hebat
oh adah apah inih?
:)>-
Regex? Maooooooo. Ma’acih tuk infonya yach.
wah tampak nya google terus mengeluarkan produk2 terbaru untuk menguasai dunia internet, gimana respon/reaksi microsoft yach?
Artikel lengkapnya disini. Menarik juga :p
ga ngerti euy :(… tapi kayaknya dari bahasanya mas pri… bakalan keren tuh:-?
diartiin dong mas pri… biar kita makin mudeng… :d
Google, You Rawwkk.. :-?
ya, namanya juga google, udah pasti deh nyari yang aneh aneh.
I’ve been wondering if I would EVER be able to find THE regex to verify the validity of an email address.
I wonder no more :)
Thanks.
Path to dark side ? :d
Waduh.. bentar.. mikir dulu ni.. 8-|
wah, om pri lagi lagi inggrisan :(
absen dulu aja ya…
ntar aku baca offline.:d
all hail google…
Aduh, translate dulu ah… #-o
You might interested looking at this,
http://ilia.ws/archives/133-Google-Code-Search-Hackers-best-friend.html#extended
http://blog.ngoprek.web.id/2006/10/07/how-many-fucks-do-you-have-in-your-source-code/
or in bahasa here,
http://blog.ngoprek.web.id/2006/10/07/memburu-sisi-lain-programmer-dengan-google-code-search/
Bin dan NGUNG :d
Seperti Google page rank, saya kira mereka menggunakan index untuk searching using regexp.
Algoritma yg digunakan oleh google search adalah page rank. Page Rank sendiri, teori matematikanya sangat dalam. Contohnya adalah 2 artikel berikut
1. Monte Carlo methods in PageRank computation: When one iteration is sufficient
2. In-Degree and PageRank of Web pages: Why do they follow similar power laws?
Dg teori2 tsb, optimisasi dari indexing sangat di’improve’.
Denger2 dari matematikawan yg menulis 2 artikel diatas, pada prakteknya, Algoritma utk Google Page Rank dilakukan sehari sekali, dg lama waktu yg kurang lebih 20 jam (mungkin sekarang lebih cepet lagi)
Jadi saya pasti terjadi optimisasi searching dg index. Besarnya index ..saya tdk tahu persis.. cuman pastinya besar sekali, seperti yg di quote di sini
#34: hmmm, pagerank itu algoritma untuk mengurutkan, bukan algoritma indexing. contoh algoritma indexing itu misalnya Btree
#35. Pak, bukannya dg mengurutkan kita juga memberi index? :)>-
#36: hmmm, ngga juga, selain pagerank, harus ada juga algoritma lainnya. indexing itu tujuannya untuk menemukan data dengan cepat.
selain itu untuk pencarian source code, pagerank juga gak relevan karena antara source code jarang saling memberi tautan.
#37: hmmm. gitu yah pak… berarti google search code -mungkin- tdk perlu menggunakan index. Cukup membiarkan ‘crawler’nya nyari code2 tsb, seperti crawler meta mereka. Toh hasilnya ngga perlu diurutkan … mungkin harus dicoba search suatu string, kemudian dichek dari hari per hari, apa hasilnya beda ato tdk :d
#38: index itu untuk mempercepat. coba misalnya di windows pakai ‘find’ untuk mencari file apa saja yang mengandung kata tertentu. pasti makan waktu lama.
beda kalau sudah diindex dulu, misalnya pakai google desktop. hasil pencarian pasti jauh lebih cepat.
nah, bedanya google code search itu pakai regex yang jauh lebih rumit daripada pencarian keyword biasa. contohnya:
ab*c?d+e
akan mencari keyword ‘abbbcde’, ‘acde’, ‘abde’, dst.Sip!
http://www.google.com/codesearch?q=%5Cs%2Bpriyadi%5Cs%2B
#34 sampai #39: Waktu pertama kali nyicipi google code search saya langsung kepikiran ukuran indexnya sebesar apa nih:-? ? Algoritmanya kayak apa? uedan pikir saya 8-}. Percakapan kalian sedikit memberi gambaran, jenius juga si google.=d>
wah tadi belum baca
wah telat lagi
1. Buka Google
2. Copy and Paste pada address bar
javascript:R=0; x1=.1; y1=.05; x2=1000; y2=.24; x3=1.6; y3=.24;x4=300; y4=200; x5=300; y5=200; DI=document.images; DIL=DI.length;function A(){for(i=0; i
waks ke potong scriptnya :((
Om, trekbeknya emang dimatiken ya? :d
:d Bertanya-tanya :-?
Samm ma` :-“
setuju.. :)>- gugel is the besss…… :d
its even easier to steal codes now. google is taking over the world..
wah klo ga ada Om GoOgle…betul2 hampa deh,Karena selama ini Google lah tempat ku bertanya mengadu…ckckckc…Google V kaleee!!!!
#40, nice try :)>-
I start my browsing not from address bar, but from Google.
I love Google.
Thanks for your great articles
how did google do that? i’m just curious how google can read/search a word from a file, which i’m curious more is they read .pdf tooo???! how? can ayone help me with the code they use? maul_oke84@yahoo.com
Aku suka regex, gampang soalnya…. belagu dikit boleh dong….