Author Archives: Sébastien

About Sébastien

Sebastien is responsible for the team that works on Exalead Web Search Engine. Working on Exalead.com, they fine tune indexes and develop innovative features to make the search experience faster, more engaging, and more relevant. His team also provides Exalead customers with solutions that leverage large volumes of Web content (Web pages, images, videos, search feeds...). Along with fellow Exaleader Stephane Donze, Sebastien graduated from France’s top engineering schools, Ecole Polytechnique and Télécom Paris. He joined Exalead in 2001. His expertise is particularly oriented towards the structure of the web, crawling techniques and distributed software architectures.
  • Transforming a demo into a full-scale production-ready application

    November 10th, 2009 by Sébastien Exalabs, New products & features, Products, Programming, Technology 1

    Jean Marc brought you a very delightful post about Chromatik last week with a lot of beautiful images. I will now describe in more detail how it was built. As with the DVD you perhaps watched last night, I am afraid there will be fewer big special effects in this blog post than in Jean Marc’s post, but I hope to give you an insightful view of what happened behind the scenes.

    Chromatik was an elaborate demo, the result of a long effort on both the back-end and the front-end. It indexes one million images. For each image, a unique color signature was built and indexed. Our current intuitive user interface, exploits this index to help you filter and select images by choosing a combination of colors, luminosity or text.

    A large number of people tried and liked the Chromatik demo so much that we received several requests to integrate it into the official Exalead search site. And because the demo ran relatively bug free and smoothly, our friends thought it was a piece of cake. Of course, it was a bit more work than we initially expected. So where are the challenges?

    1) The front-end side

    A lot of questions needed to be answered:

    • How will I adapt the GUI of my application to integrate the new features?
    • Are all these new features necessary?
    • What is the feedback we’ve received on the different features?
    • What is the added value of these features?

    The answers to these questions will impact the total amount of space on the GUI we will take for surfacing them.

    2) The back-end side

    Let’s begin with a little theory:

    Theorem of the factor 10 effect:
    No matter how good a developer you are, if non-trivial code has been designed and tested with only N elements, it won’t work without modifications when applied to 10 * N elements.

    Demonstration: Rather simple: if you don’t believe it, try it yourself…

    In this case we wanted a factor 1000, so we knew it would need some adjustments. When you know this theorem, the advantage is that you can anticipate potential problems. And the experience we have accumulated from similar situations at Exalead help us predict most of the bottlenecks.

    Example 1: Chromatik needed 300MB RAM, which is quite good for 1M images. But, if you multiply this number by 2000, you have 600GB RAM, which is quite large, even if the final index is distributed over multiple machines.
    We therefore decided to reduce the richness of the colors, while maintaining usability, migrate from version 4.6 to version 5.0 of Exalead CloudView, and use a more compressed encoding. In the end, it now only costs 9GB.

    Example 2: When you want to analyze two billion images, you need to have a robust code, which means that’s able to handle all sort of images even those which do not have a valid RFC. It’s not that easy, when even the most used library in the world for basic image manipulation can crash on some images as we reported.
    The result was that this run spotted some bugs in our code we hadn’t seen before and therefore had to fix.

    Example 3: The demo was initially a single machine application. We needed to use the distributed system framework included in the CloudView technology to be able to run the whole process of extracting, crawling, and indexing in only a few weeks. This framework really helped us transform the single machine demo to a fully load-balanced and monitored application. This use case is a little different than our standard www.exalead.com chain, so we discovered and tweaked a few cumbersome points in the code.

    The purpose of this integration was to offer a new service to the users of the exalead.com search engine and improve the robustness of the Chromatik technology. We now better understand the impact of different tweaks on color indexing.

    Transforming a demo into a real product is not as easy as it seems. I hope this post helps you understand why a lot of companies only show you demos but never real live applications.

    At Exalead, we don’t sell demos to our customers; we sell tested and robust solutions. We make sure we work hard to test and uncover all the issues so our customers’ implementations go smoothly.

  • Guide for Webmasters: Part 1, Making the Most of Your Content

    March 17th, 2008 by Sébastien Programming, Tips and tricks 3

    Interested in improving the visibility of your site on our engine? Hopefully this series of posts will help.

    First up: answers to the two most frequently posed webmaster questions:

    1) Why doesn’t my site appear (or why does it only partially appear) when I do a site search (i.e., typing “site: mysitename.com” in the search box)?

    All or part of your site may be inaccessible to our robots. Try the following to improve your performance:

    2) Why doesn’t my site appear for a given keyword?

    • First, check to see that the keyword is in our index for your site. Enter the keyword in the search field, along with “site:mysitename.com” to limit the search for that keyword to just your site (replacing “mysitename.com” with your domain name, of course). If it is not indexed, follow the steps for question 1 above.
    • Refine the keywords in your site so they are as specific as possible. It could be the keyword you are checking is too general, and sites that larger, more relevant and/or more popular are ranking ahead of your site for that keyword.
    • Verify that the content of your site corresponds well to the keyword. It’s not enough for a keyword to simply appear, it must be integrally related to the rest of the site content.

    You’ll find further info on keyword relevancy in Search Engine Optimization (SEO): More Old-School Than You Think.”

    And be careful out there! Stick to keeping your content fresh and relevant for your target audience. Reverting to tricks like hidden text, duplicate content, spam link exchanges or other such tactics to improve your ranking could get you banned from our index (for more info, see “The Road to Better Site Indexing – Episode 2”).

    You’ll also find general webmaster tips in our site’s help pages.

  • The Road to Better Site Indexing: Episode 3, Sitemaps (based on a true story)

    August 28th, 2007 by Sébastien Programming, Tips and tricks 0

    Humphrey Bogart
    In our prior episodes:
    The crawler known as “Bot” travels across the web, moving from page to page and site to site by following links he discovers along the way. But Bot isn’t the type to let himself be led about aimlessly. He tries to imitate his hero Humphrey Bogart, who never shied away from a tangled web yet always managed to stay on the right track.

    But being a perfectionist, Bot wasn’t entirely satisfied with his own method. Was he overlooking a significant thread? Leaving an important page unturned? He had a hunch he could do better.

    Leaving important content in the dustbin of unindexed pages was just the sort of slip-up that really peeved Bot’s equally perfectionist client Betty, a.k.a. “The Webmaster.” Betty had specifically called on Bot to crawl her entire site, and Bot had missed several pages.

    To get their relationship back on the right track, Bot had an idea: he would ask Betty to tell him flat out everything she wanted him to know about her site. And being a guy always in the know, Bot knew just what tool Betty could use to set the record straight: a sitemap.
    He proposed; she accepted.

    Now Betty can rest easy knowing all the content she wants to share with the world will be indexed. And just what is this handy tool known as a sitemap?
    It’s actually not much more than a laundry list of links. Constructing one is a snap. You simply create a text file listing the URLs you want indexed, along with any key facts you want Bot to know (like how often a file is updated), and place it anywhere you’d like, giving Bot the location in your robots.txt file, for example at the root of your web site: http://www.example.com/sitemap.xml.

    Sitemaps can be written in XML (the preferred method), or communicated via syndication feeds or simple text files. A sitemap in XML looks something like this:

    <urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
    <url>
    <loc>http://www.example.com/</loc>
    <lastmod>2005-01-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
    </url>
    <url>
    <loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
    <changefreq>weekly</changefreq>
    </url>
    <url>
    <loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
    </url>
    </urlset>

    You can visit http://www.sitemaps.org/ for all the details. It’s the official site of the Sitemaps protocol, which was first proposed by Google, then fleshed out through discussions with MSN, Yahoo and Ask. It’s now the standard adopted by Google, Yahoo, Ask, and, as of July 2007, Exalead.
    But bad guys consider yourselves forewarned: Bot knows not every webmaster is not as straight up as Betty. He stays a step ahead of all nefarious sitemap tricks, checking out every list of links spun his way and skipping right over bum lists.

    Sébastien

  • The Road to Better Site Indexing – Episode 2

    July 11th, 2007 by Sébastien Programming, Tips and tricks 1

    Next in our
    series: “The Ballot Box Stuffers,” a.k.a. link farms.

    What’s a link
    farm? Let’s look at the randomly chosen site http://www.rc-car.ravemart.com. At
    first glance, this remote control car ecommerce site seems to be a typical
    small biz site plying its wares in the typical way.

    Now take a closer
    look by clicking on some of the text links at the bottom of the page. You’ll
    see this site has decided to aid the search engines by providing links to
    thousands of its friends’ sites. Click on the generically titled “Link” and
    you’ll find links to the company’s “Partners” under categories such as Debt
    Consolidation, Vitamins, Legal Services and Sweepstakes. Or click on
    “Sponsored Links” for fast access to Cheap Viagra, Casino And Poker
    News, Psychic Readings, or Online Dating Services.

    Similarly, visit www.all-carpets.com where under “Resources” you’ll not only find links
    for Home Flooring and Kitchen Design (not too far off base), but also Cash Back
    Credit Cards, South African Zulu Culture, and Offshore Banking (uh-oh, we’re in
    left field now…).

    Even larger
    companies and organizations may participate in these types of link programs,
    where all members link to all other members, often regardless of the relevance
    of the content, in the somewhat desperate hope of augmenting their popularity
    and hence appearing higher in search results.

    If you’re a web
    searcher trying to find quality results, don’t worry, we remain vigilant ;-) .
    If you’re a site owner, avoid participating in link farms. You may find the
    strategy backfires as your site is demoted or even dropped from some search
    engine databases. Instead seek out quality reciprocal links with sites with
    whom you share a genuine relationship, and you’ll build the kind of true
    popularity both visitors and search engines appreciate.

    Next episode: The
    Integration of RSS Feeds.

    Sebastien, Web
    Team Head Chef

  • The Road to Better Site Indexing – Introduction and Episode 1

    by Sébastien Programming, Tips and tricks 1

    The question that usually follows “How
    can I make my site appear at the top of search engine results?” is “Why don’t
    search engines index all my pages?”

    Firest, you should know that pages
    accessible uniquely through JavaScript or through form submissions are not
    reachable by search engines and therefore they cannot be indexed. And there is
    no means for a search engine to know whether it’s missing some pages in a site,
    whether the missing page count is 10 or 10,000 (outside of site maps, which I
    will discuss in a future post).

    Next, let’s refresh ourselves on the
    fundamental methods search engines use to find the pages they index: 1) They
    follow a submission made by a human being (0,0001% of cases), or 2) They follow
    a link from another page. Therefore, if there is a link to a given page, the
    probability that it will be indexed is high. Alternately a personal site with
    no external links to it has little chance of being indexed by a search engine.
    So more links are always better, right?

    Not necessarily. It should be understood that from the point of view of
    a search engine, a risk arises not from a dearth of links, but from too many.
    Why? Because search engines seek to provide the most relevant results for
    visitors, returning pages with the content most likely to match visitors’ needs
    and expectations. A site that arrived at the top of the results solely because
    there were tens of thousands of links to it would not pass this test. In fact,
    an overabundance of external links may indicate a “spamming” campaign aimed at
    search engines and be an indicator of poor site quality.

    Here are two cases of what we’ll call legitimate ‘overabundance,’ an overabundance of links due to valid, non-spamming factors that can be properly managed by search engines.


     

    Case 1: User Sessions

    When you visit
    an e-commerce site, unique “session” information will often be assigned to your
    computer. This information uniquely identifies your particular connection and
    visit. It may include, for example, a unique ID for your computer and a code
    for your browser version or geographic location.

    This session information tracks your movements,
    preferences and selections as you navigate a site. This is not for nefarious
    ends, but is rather used to perform practical tasks like maintaining items in
    your shopping cart, showing prices in your local currency or displaying a list
    of products you’ve viewed. This session information is most often added to the end
    of the URL (web address) for every page you visit.

    For instance, say you are visiting Amazon.com and you
    navigate to a Stanley wrench set. The URL displayed in your browser is

    http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/
    ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7
    .

    Only the first part of the URL,

    http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/,
    is needed to locate the product information for this wrench set. The rest,
    “ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7″

    is session information for your
    particular visit.

    A search engine may come across thousands of links
    like the longer address, each of which may appear different because unique session
    information is appended, and because each may show different user-dependent
    content such as navigation history, promotions, or recommended products. But
    any search engine worth its salt can discern the repetitive addresses from the
    essential URL, and will know this is not a case of spamming.


    Case 2: Calendar Menus

    Some sites let you navigate through their
    content by clicking on a calendar. For example, you may be able to peruse news
    articles or events on a site by choosing a date or date range.

    Such menus generate links like:
    http://www.ecvd.eu/index.php?option=com_events&task=view_month&Itemid=32&year=2011&month=09&day=12

    A competent search engine will know which
    of these types of links returns valid content and which does not, and what
    baseline URL should be included in a search index. In other words, having a
    zillion external links for events on dates from 1950 to 2060 for a site with ten
    events will definitely not boost that site’s ranking ;-) .

    Now you may say these two cases look like
    easy ones for a search engine to manage, and you’d be right. The real
    difficulties arise from the following three cases, because (scoop!) there are
    unscrupulous people out there ready to do anything to improve their search
    engine ranking. You’ve most likely encountered their handiwork when using a
    search engine other than Exalead.

    You run your search and click on a page you
    think is relevant, only to encounter an endless list of meaningless links or
    keywords, a pastiche of content “borrowed” from other more relevant sites, or
    an endless loop of promising links that ultimately go nowhere.

    These types of pages are generated by the
    folks at the top of our list of ballot-box stuffers, those trying to improve
    their search engine rank through:

    * Link farms and keyword stuffing,

    * Content scraping, including the abuse of
    RSS Feeds, and

    * Creating content labyrinths.

    We’ll be covering these tactics in upcoming
    episodes. In the meantime, you can see why search engines may need to limit the
    number of pages they index for a site. This ‘quota’ is determined based on the
    site’s reputation, the duplication of its content, and a thousand other
    parameters, all factored in an attempt to keep the game honest so web searchers
    get the most relevant search results possible.

     

    Sebastien, Head Chef, Web Team

  • New Things Algorithmic – Episode 2 and Epilogue

    July 10th, 2007 by Sébastien Tips and tricks 1

    Next in our series, “What’s New with
    Exalead’s Ranking System,” the second major improvement in the GREMLINS
    release:

    Ingredient
    Number 2: Ahhh…You Understand Me.

    GREMLINS seeks to better understand who you
    are and what you’re looking for, even if you’re a bit challenged in the
    research skills department. Say for example you live in San Francisco

    and launch a search for “dog
    therapy.” You may actually be looking for:

    1) Therapeutic options for
    treating your Dalmation’s separation anxiety problem,

    2) Information about
    animal-assisted therapy for the aged or infirm,

    3) A California manufacturer of skateboards and
    skateboard accessories, named “Dog Therapy,” or

    4) Recordings by a Berlin garage band named “Dog Therapy” you came across during your trip to Germany last year.

    There was never a problem for a search like
    1. Sites about therapeutic treatments and treatment providers for troubled
    canines will top your results. Exalead has always privileged an exact match of
    keywords in the order they are entered, even if they weren’t enclosed in
    quotation marks.

    If 2 was really what you were after,
    GREMLINS has been refined to better understand that “therapy dogs” may be
    closer to your intentions that “dog therapy” per se, and it will offer up links
    like “Therapy dogs,” “Animal assisted therapy,” and “Therapy Dog Training”
    under “Related Terms” in the “Narrow Your Search” panel.

    For quests 3 and 4, the skateboard
    manufacturer will be privileged in your search results over the Berlin garage
    band as GREMLINS factors your location and language into its calculation of the
    relevancy of results. This will save you a lot of time in 9 out of 10 of your
    web queries.

    But, if you really wanted to find that
    garage band and didn’t think to add “Germany”
    or Berlin” or “Band” to your request, all is still not lost. Check the “Languages” filter
    under the Narrow Your Search panel. If there’s an option for German, it might
    just be your band lurking there in the shadows.

    EPILOGUE

    Let’s try another search. Say you launch a
    request on “Martin Luther King” because you:

    1) Want to listen to one of his
    famous speeches,

    2) Are working on a report on
    kingfishers for school,

    3) Want to browse forums and blogs
    dedicated to Martin Luther King, or

    4) Need to find photos of Martin
    Luther King with a landscape-orientation.

    Curious as to what GREMLINS makes of these?
    Have more questions about how search requests are analyzed? Grab a cup of
    coffee and stay tuned for further details here on Exalead’s approach to content
    relevancy.

    Sebastien, Head Chef, Web Team

  • New Things Algorithmic – Episode 1

    June 29th, 2007 by Sébastien Tips and tricks 0

    Asking a search engine company how they determine content relevance (and hence page ranking) is a bit like asking Coca Cola for the recipe for its famous bubbly. The odds are pretty much nil of getting a direct reply, and what’s more, just as Coca Cola has continually tweaked its concoction over the years, so search engines are continually refining the algorithms they use to deliver the most relevant search results possible.
    But, just between you and me, come what may, I’m going to lift the veil on the latest major release of the Exalead search engine: GREMLINS, which pulls together all our latest improvements in relevancy analysis.

    Ingredient Number 1: A Breath of Fresh Air in Our Index
    You launch a search request for “Beryl”:

    • Because you are a fan of 3D effects and you’re looking for the official site of the free Beryl software.
    • Because all your web 2.0 geek friends are up on Beryl and only by catching up on Wikipedia can you save face.
    • Because your name is Beryl, you’re an actress, and you want to be sure casting agents can find your personal page using Exalead.
    • Because you want to buy your girlfriend a pair of yellow Beryl gemstone earrings.

    Whatever the purpose for your search, Exalead was designed to guide you toward the most relevant results for you. Toward this end, our all-new algorithms provide a better analysis of the pages indexed, notably our evaluation of inbound links. We’ve improved our semantic interpretation of these links as well as our analysis of the links’ evolution over time. The results of these improvements are very noticeable, especially when you search on a word associated with rather specific topics, and in particular when one or more of those topics is associated with current news and events (you don’t know Beryl??).

    Next Episode:
    Ingredient Number 2: Ahhh…You Understand Me.
    Sebastien, Web Development Director

  • Search Engine Optimization (SEO): More Old-School Than You Think

    June 4th, 2007 by Sébastien Tips and tricks 10

    At least once a day someone asks me “How can I get my site to the top of search engine results?” They’re hoping I have a secret formula ready to whisper in their ear, and invariably I disappointment them.

    In spite of the buzz surrounding SEO, achieving the best position possible for your site is really old-fashioned Marketing 101. You have to start by asking not “what do I have to do to reach the top of the results?” but rather “what is it I’m selling, and to whom??” If you are a store in Perth, Australia specializing in work boots for the local construction trade, then good luck trying to position your site on the keyword “shoes” (or that that matter even trying to buy an ad for that word!!). You’ll be squaring off against 200 million competitors.

    And even if you succeeded in scoring well for “shoes”, would you be reaching your true potential customers? No. Better to concentrate instead on keywords and phrases like “work boots Perth” or “construction shoes Australia,” depending on where you’re willing to ship and how much competition you face as you broaden your geographic scope. This is the first phase of a good site optimization plan: identifying the keywords that most closely match the products or services you offer, and the clientele you’re trying to reach.

    So what do you do with these keywords once you’ve identified them? You head on to Phase 2, which consists of simply working them into your site content in a logical fashion, adding them to URLs, page titles, content headings and page copy. You should then ask your business partners and friends to link to your site using these keywords in the link description.

    It really operates the same way as placing a good yellow pages ad: know whom you’re targeting, where they’ll be looking for you, and in as few words as possible (preferably in your trade name!), let them know exactly what you’re selling.

    I’ll return to the topic of search engine optimization in more detail in my next post, but for now, I’ll leave you with an insider tip. Take a look at the site http://www.guppies.com/. It ranks at or near the top of all the search engines. Now notice how it’s got that animated guppy on the home page? This proves that search engines love animated guppies. So sprinkle some liberally throughout your site and you’re sure to climb in the ranks.

    Or, you can tune into my next post for some less wiggly advice ;-) .

    Sebastien, Web Development Director