version française

Sébastien

Sébastien

Sebastien is responsible for the team that works on Exalead Web Search Engine. Working on Exalead.com, they fine tune indexes and develop innovative features to make the search experience faster, more engaging, and more relevant. His team also provides Exalead customers with solutions that leverage large volumes of Web content (Web pages, images, videos, search feeds...). Along with fellow Exaleader Stephane Donze, Sebastien graduated from France’s top engineering schools, Ecole Polytechnique and Télécom Paris. He joined Exalead in 2001. His expertise is particularly oriented towards the structure of the web, crawling techniques and distributed software architectures.

November 10th, 2009

Transforming a demo into a full-scale production-ready application

Jean Marc brought you a very delightful post about Chromatik last week with a lot of beautiful images. I will now describe in more detail how it was built. As with the DVD you perhaps watched last night, I am afraid there will be fewer big special effects in this blog post than in Jean Marc’s post, but I hope to give you an insightful view of what happened behind the scenes.

Chromatik was an elaborate demo, the result of a long effort on both the back-end and the front-end. It indexes one million images. For each image, a unique color signature was built and indexed. Our current intuitive user interface, exploits this index to help you filter and select images by choosing a combination of colors, luminosity or text.

A large number of people tried and liked the Chromatik demo so much that we received several requests to integrate it into the official Exalead search site. And because the demo ran relatively bug free and smoothly, our friends thought it was a piece of cake. Of course, it was a bit more work than we initially expected. So where are the challenges?

1) The front-end side

A lot of questions needed to be answered:

  • How will I adapt the GUI of my application to integrate the new features?
  • Are all these new features necessary?
  • What is the feedback we’ve received on the different features?
  • What is the added value of these features?

The answers to these questions will impact the total amount of space on the GUI we will take for surfacing them.

2) The back-end side

Let’s begin with a little theory:

Theorem of the factor 10 effect:
No matter how good a developer you are, if non-trivial code has been designed and tested with only N elements, it won’t work without modifications when applied to 10 * N elements.

Demonstration: Rather simple: if you don’t believe it, try it yourself…

In this case we wanted a factor 1000, so we knew it would need some adjustments. When you know this theorem, the advantage is that you can anticipate potential problems. And the experience we have accumulated from similar situations at Exalead help us predict most of the bottlenecks.

Example 1: Chromatik needed 300MB RAM, which is quite good for 1M images. But, if you multiply this number by 2000, you have 600GB RAM, which is quite large, even if the final index is distributed over multiple machines.
We therefore decided to reduce the richness of the colors, while maintaining usability, migrate from version 4.6 to version 5.0 of Exalead CloudView, and use a more compressed encoding. In the end, it now only costs 9GB.

Example 2: When you want to analyze two billion images, you need to have a robust code, which means that’s able to handle all sort of images even those which do not have a valid RFC. It’s not that easy, when even the most used library in the world for basic image manipulation can crash on some images as we reported.
The result was that this run spotted some bugs in our code we hadn’t seen before and therefore had to fix.

Example 3: The demo was initially a single machine application. We needed to use the distributed system framework included in the CloudView technology to be able to run the whole process of extracting, crawling, and indexing in only a few weeks. This framework really helped us transform the single machine demo to a fully load-balanced and monitored application. This use case is a little different than our standard www.exalead.com chain, so we discovered and tweaked a few cumbersome points in the code.

The purpose of this integration was to offer a new service to the users of the exalead.com search engine and improve the robustness of the Chromatik technology. We now better understand the impact of different tweaks on color indexing.

Transforming a demo into a real product is not as easy as it seems. I hope this post helps you understand why a lot of companies only show you demos but never real live applications.

At Exalead, we don’t sell demos to our customers; we sell tested and robust solutions. We make sure we work hard to test and uncover all the issues so our customers’ implementations go smoothly.

March 17th, 2008

Guide for Webmasters: Part 1, Making the Most of Your Content

Interested in improving the visibility of your site on our engine? Hopefully this series of posts will help.

First up: answers to the two most frequently posed webmaster questions:

1) Why doesn’t my site appear (or why does it only partially appear) when I do a site search (i.e., typing “site: mysitename.com” in the search box)?

All or part of your site may be inaccessible to our robots. Try the following to improve your performance:

2) Why doesn’t my site appear for a given keyword?

  • First, check to see that the keyword is in our index for your site. Enter the keyword in the search field, along with “site:mysitename.com” to limit the search for that keyword to just your site (replacing “mysitename.com” with your domain name, of course). If it is not indexed, follow the steps for question 1 above.
  • Refine the keywords in your site so they are as specific as possible. It could be the keyword you are checking is too general, and sites that larger, more relevant and/or more popular are ranking ahead of your site for that keyword.
  • Verify that the content of your site corresponds well to the keyword. It’s not enough for a keyword to simply appear, it must be integrally related to the rest of the site content.

You’ll find further info on keyword relevancy in Search Engine Optimization (SEO): More Old-School Than You Think.”

And be careful out there! Stick to keeping your content fresh and relevant for your target audience. Reverting to tricks like hidden text, duplicate content, spam link exchanges or other such tactics to improve your ranking could get you banned from our index (for more info, see “The Road to Better Site Indexing – Episode 2”).

You’ll also find general webmaster tips in our site’s help pages.

August 28th, 2007

The Road to Better Site Indexing: Episode 3, Sitemaps (based on a true story)

Humphrey Bogart
In our prior episodes:
The crawler known as “Bot” travels across the web, moving from page to page and site to site by following links he discovers along the way. But Bot isn’t the type to let himself be led about aimlessly. He tries to imitate his hero Humphrey Bogart, who never shied away from a tangled web yet always managed to stay on the right track.

But being a perfectionist, Bot wasn’t entirely satisfied with his own method. Was he overlooking a significant thread? Leaving an important page unturned? He had a hunch he could do better.

Leaving important content in the dustbin of unindexed pages was just the sort of slip-up that really peeved Bot’s equally perfectionist client Betty, a.k.a. “The Webmaster.” Betty had specifically called on Bot to crawl her entire site, and Bot had missed several pages.

To get their relationship back on the right track, Bot had an idea: he would ask Betty to tell him flat out everything she wanted him to know about her site. And being a guy always in the know, Bot knew just what tool Betty could use to set the record straight: a sitemap.
He proposed; she accepted.

Now Betty can rest easy knowing all the content she wants to share with the world will be indexed. And just what is this handy tool known as a sitemap?
It’s actually not much more than a laundry list of links. Constructing one is a snap. You simply create a text file listing the URLs you want indexed, along with any key facts you want Bot to know (like how often a file is updated), and place it anywhere you’d like, giving Bot the location in your robots.txt file, for example at the root of your web site: http://www.example.com/sitemap.xml.

Sitemaps can be written in XML (the preferred method), or communicated via syndication feeds or simple text files. A sitemap in XML looks something like this:

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
</url>
</urlset>

You can visit http://www.sitemaps.org/ for all the details. It’s the official site of the Sitemaps protocol, which was first proposed by Google, then fleshed out through discussions with MSN, Yahoo and Ask. It’s now the standard adopted by Google, Yahoo, Ask, and, as of July 2007, Exalead.
But bad guys consider yourselves forewarned: Bot knows not every webmaster is not as straight up as Betty. He stays a step ahead of all nefarious sitemap tricks, checking out every list of links spun his way and skipping right over bum lists.

Sébastien

July 11th, 2007

The Road to Better Site Indexing – Episode 2

Next in our
series: “The Ballot Box Stuffers,” a.k.a. link farms.

What’s a link
farm? Let’s look at the randomly chosen site http://www.rc-car.ravemart.com. At
first glance, this remote control car ecommerce site seems to be a typical
small biz site plying its wares in the typical way.

Now take a closer
look by clicking on some of the text links at the bottom of the page. You’ll
see this site has decided to aid the search engines by providing links to
thousands of its friends’ sites. Click on the generically titled “Link” and
you’ll find links to the company’s “Partners” under categories such as Debt
Consolidation, Vitamins, Legal Services and Sweepstakes. Or click on
“Sponsored Links” for fast access to Cheap Viagra, Casino And Poker
News, Psychic Readings, or Online Dating Services.

Similarly, visit www.all-carpets.com where under “Resources” you’ll not only find links
for Home Flooring and Kitchen Design (not too far off base), but also Cash Back
Credit Cards, South African Zulu Culture, and Offshore Banking (uh-oh, we’re in
left field now…).

Even larger
companies and organizations may participate in these types of link programs,
where all members link to all other members, often regardless of the relevance
of the content, in the somewhat desperate hope of augmenting their popularity
and hence appearing higher in search results.

If you’re a web
searcher trying to find quality results, don’t worry, we remain vigilant ;-) .
If you’re a site owner, avoid participating in link farms. You may find the
strategy backfires as your site is demoted or even dropped from some search
engine databases. Instead seek out quality reciprocal links with sites with
whom you share a genuine relationship, and you’ll build the kind of true
popularity both visitors and search engines appreciate.

Next episode: The
Integration of RSS Feeds.

Sebastien, Web
Team Head Chef

July 11th, 2007

The Road to Better Site Indexing – Introduction and Episode 1

The question that usually follows “How
can I make my site appear at the top of search engine results?” is “Why don’t
search engines index all my pages?”

Firest, you should know that pages
accessible uniquely through JavaScript or through form submissions are not
reachable by search engines and therefore they cannot be indexed. And there is
no means for a search engine to know whether it’s missing some pages in a site,
whether the missing page count is 10 or 10,000 (outside of site maps, which I
will discuss in a future post).

Next, let’s refresh ourselves on the
fundamental methods search engines use to find the pages they index: 1) They
follow a submission made by a human being (0,0001% of cases), or 2) They follow
a link from another page. Therefore, if there is a link to a given page, the
probability that it will be indexed is high. Alternately a personal site with
no external links to it has little chance of being indexed by a search engine.
So more links are always better, right?

Not necessarily. It should be understood that from the point of view of
a search engine, a risk arises not from a dearth of links, but from too many.
Why? Because search engines seek to provide the most relevant results for
visitors, returning pages with the content most likely to match visitors’ needs
and expectations. A site that arrived at the top of the results solely because
there were tens of thousands of links to it would not pass this test. In fact,
an overabundance of external links may indicate a “spamming” campaign aimed at
search engines and be an indicator of poor site quality.

Here are two cases of what we’ll call legitimate ‘overabundance,’ an overabundance of links due to valid, non-spamming factors that can be properly managed by search engines.


 

Case 1: User Sessions

When you visit
an e-commerce site, unique “session” information will often be assigned to your
computer. This information uniquely identifies your particular connection and
visit. It may include, for example, a unique ID for your computer and a code
for your browser version or geographic location.

This session information tracks your movements,
preferences and selections as you navigate a site. This is not for nefarious
ends, but is rather used to perform practical tasks like maintaining items in
your shopping cart, showing prices in your local currency or displaying a list
of products you’ve viewed. This session information is most often added to the end
of the URL (web address) for every page you visit.

For instance, say you are visiting Amazon.com and you
navigate to a Stanley wrench set. The URL displayed in your browser is

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/
ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7
.

Only the first part of the URL,

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/,
is needed to locate the product information for this wrench set. The rest,
“ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7″

is session information for your
particular visit.

A search engine may come across thousands of links
like the longer address, each of which may appear different because unique session
information is appended, and because each may show different user-dependent
content such as navigation history, promotions, or recommended products. But
any search engine worth its salt can discern the repetitive addresses from the
essential URL, and will know this is not a case of spamming.


Case 2: Calendar Menus

Some sites let you navigate through their
content by clicking on a calendar. For example, you may be able to peruse news
articles or events on a site by choosing a date or date range.

Such menus generate links like:
http://www.ecvd.eu/index.php?option=com_events&task=view_month&Itemid=32&year=2011&month=09&day=12

A competent search engine will know which
of these types of links returns valid content and which does not, and what
baseline URL should be included in a search index. In other words, having a
zillion external links for events on dates from 1950 to 2060 for a site with ten
events will definitely not boost that site’s ranking ;-) .

Now you may say these two cases look like
easy ones for a search engine to manage, and you’d be right. The real
difficulties arise from the following three cases, because (scoop!) there are
unscrupulous people out there ready to do anything to improve their search
engine ranking. You’ve most likely encountered their handiwork when using a
search engine other than Exalead.

You run your search and click on a page you
think is relevant, only to encounter an endless list of meaningless links or
keywords, a pastiche of content “borrowed” from other more relevant sites, or
an endless loop of promising links that ultimately go nowhere.

These types of pages are generated by the
folks at the top of our list of ballot-box stuffers, those trying to improve
their search engine rank through:

* Link farms and keyword stuffing,

* Content scraping, including the abuse of
RSS Feeds, and

* Creating content labyrinths.

We’ll be covering these tactics in upcoming
episodes. In the meantime, you can see why search engines may need to limit the
number of pages they index for a site. This ‘quota’ is determined based on the
site’s reputation, the duplication of its content, and a thousand other
parameters, all factored in an attempt to keep the game honest so web searchers
get the most relevant search results possible.

 

Sebastien, Head Chef, Web Team

July 10th, 2007

New Things Algorithmic – Episode 2 and Epilogue

Next in our series, “What’s New with
Exalead’s Ranking System,” the second major improvement in the GREMLINS
release:

Ingredient
Number 2: Ahhh…You Understand Me.

GREMLINS seeks to better understand who you
are and what you’re looking for, even if you’re a bit challenged in the
research skills department. Say for example you live in San Francisco

and launch a search for “dog
therapy.” You may actually be looking for:

1) Therapeutic options for
treating your Dalmation’s separation anxiety problem,

2) Information about
animal-assisted therapy for the aged or infirm,

3) A California manufacturer of skateboards and
skateboard accessories, named “Dog Therapy,” or

4) Recordings by a Berlin garage band named “Dog Therapy” you came across during your trip to Germany last year.

There was never a problem for a search like
1. Sites about therapeutic treatments and treatment providers for troubled
canines will top your results. Exalead has always privileged an exact match of
keywords in the order they are entered, even if they weren’t enclosed in
quotation marks.

If 2 was really what you were after,
GREMLINS has been refined to better understand that “therapy dogs” may be
closer to your intentions that “dog therapy” per se, and it will offer up links
like “Therapy dogs,” “Animal assisted therapy,” and “Therapy Dog Training”
under “Related Terms” in the “Narrow Your Search” panel.

For quests 3 and 4, the skateboard
manufacturer will be privileged in your search results over the Berlin garage
band as GREMLINS factors your location and language into its calculation of the
relevancy of results. This will save you a lot of time in 9 out of 10 of your
web queries.

But, if you really wanted to find that
garage band and didn’t think to add “Germany”
or Berlin” or “Band” to your request, all is still not lost. Check the “Languages” filter
under the Narrow Your Search panel. If there’s an option for German, it might
just be your band lurking there in the shadows.

EPILOGUE

Let’s try another search. Say you launch a
request on “Martin Luther King” because you:

1) Want to listen to one of his
famous speeches,

2) Are working on a report on
kingfishers for school,

3) Want to browse forums and blogs
dedicated to Martin Luther King, or

4) Need to find photos of Martin
Luther King with a landscape-orientation.

Curious as to what GREMLINS makes of these?
Have more questions about how search requests are analyzed? Grab a cup of
coffee and stay tuned for further details here on Exalead’s approach to content
relevancy.

Sebastien, Head Chef, Web Team

June 29th, 2007

New Things Algorithmic – Episode 1

Asking a search engine company how they determine content relevance (and hence page ranking) is a bit like asking Coca Cola for the recipe for its famous bubbly. The odds are pretty much nil of getting a direct reply, and what’s more, just as Coca Cola has continually tweaked its concoction over the years, so search engines are continually refining the algorithms they use to deliver the most relevant search results possible.
But, just between you and me, come what may, I’m going to lift the veil on the latest major release of the Exalead search engine: GREMLINS, which pulls together all our latest improvements in relevancy analysis.

Ingredient Number 1: A Breath of Fresh Air in Our Index
You launch a search request for “Beryl”:

  • Because you are a fan of 3D effects and you’re looking for the official site of the free Beryl software.
  • Because all your web 2.0 geek friends are up on Beryl and only by catching up on Wikipedia can you save face.
  • Because your name is Beryl, you’re an actress, and you want to be sure casting agents can find your personal page using Exalead.
  • Because you want to buy your girlfriend a pair of yellow Beryl gemstone earrings.

Whatever the purpose for your search, Exalead was designed to guide you toward the most relevant results for you. Toward this end, our all-new algorithms provide a better analysis of the pages indexed, notably our evaluation of inbound links. We’ve improved our semantic interpretation of these links as well as our analysis of the links’ evolution over time. The results of these improvements are very noticeable, especially when you search on a word associated with rather specific topics, and in particular when one or more of those topics is associated with current news and events (you don’t know Beryl??).

Next Episode:
Ingredient Number 2: Ahhh…You Understand Me.
Sebastien, Web Development Director

June 4th, 2007

Search Engine Optimization (SEO): More Old-School Than You Think

At least once a day someone asks me “How can I get my site to the top of search engine results?” They’re hoping I have a secret formula ready to whisper in their ear, and invariably I disappointment them.

In spite of the buzz surrounding SEO, achieving the best position possible for your site is really old-fashioned Marketing 101. You have to start by asking not “what do I have to do to reach the top of the results?” but rather “what is it I’m selling, and to whom??” If you are a store in Perth, Australia specializing in work boots for the local construction trade, then good luck trying to position your site on the keyword “shoes” (or that that matter even trying to buy an ad for that word!!). You’ll be squaring off against 200 million competitors.

And even if you succeeded in scoring well for “shoes”, would you be reaching your true potential customers? No. Better to concentrate instead on keywords and phrases like “work boots Perth” or “construction shoes Australia,” depending on where you’re willing to ship and how much competition you face as you broaden your geographic scope. This is the first phase of a good site optimization plan: identifying the keywords that most closely match the products or services you offer, and the clientele you’re trying to reach.

So what do you do with these keywords once you’ve identified them? You head on to Phase 2, which consists of simply working them into your site content in a logical fashion, adding them to URLs, page titles, content headings and page copy. You should then ask your business partners and friends to link to your site using these keywords in the link description.

It really operates the same way as placing a good yellow pages ad: know whom you’re targeting, where they’ll be looking for you, and in as few words as possible (preferably in your trade name!), let them know exactly what you’re selling.

I’ll return to the topic of search engine optimization in more detail in my next post, but for now, I’ll leave you with an insider tip. Take a look at the site http://www.guppies.com/. It ranks at or near the top of all the search engines. Now notice how it’s got that animated guppy on the home page? This proves that search engines love animated guppies. So sprinkle some liberally throughout your site and you’re sure to climb in the ranks.

Or, you can tune into my next post for some less wiggly advice ;-) .

Sebastien, Web Development Director