version française

Archive for the ‘Tips and tricks’ Category

June 4th, 2009

Microsoft and Google Play Catch Up

With names like Wonder Wheel  and Bing xRank, it’s easy to see how someone can get caught up in the marketing fanfare of these recent technology announcements.

Fortunately, for our customers, these capabilities (and more) are already available to them.

For example, Miiget  is a technology that we’ve had since 2008 and is equivalent to Wonder Wheel in concept.

 www.tweepz.com was built using Exalead technology as a small evening project by an exalead consultant and is similar to xRank. 

But to be fair, Microsoft had to do something and Google had to respond (or preempt?). Yet, the battle isn’t just between Microsoft and Google, but between these two and other web businesses as well.

These new offerings raise the user experience bar for every business that depends upon web traffic. As a provider to these other businesses, we’re keenly aware of that. We have been from our beginning. That’s why customers such as Yakaz,  Hometrader, 118218, and Skyblog use our technology. To improve user experience and user traffic and, in some cases, dramatically so.

Exalead believes in the concept of the long tail. That is, not everyone is satisfied with services provided by Google and Microsoft. In fact we find many users rely on multiple search sites, each for a different purpose. It’s a best-of-breed idea. This makes sense to us, since the time taken to go to a specific site is negligible compared to the length of the conversation with the site. So why not go to the best?

At least that’s how we see it. So in support of our customers, Exalead will continue to lead search technology performance and innovation. Current work in the area of semantics for increased recall, precision and insight will enable our customers to be a step ahead of their competition and provide better information with less cost and effort.

March 2nd, 2009

Exalead’s Morgan Zimmermann to Discuss Search Opportunities for Online Publishers

Since Exalead started as a web search company in 2000, we’ve gained insights into the kind of search system that users need to locate specific information across a complex set of media and data types on the web. Many online publishers are finding that standard built-in search solutions don’t fit the bill when success with readers and users depends in large part on the ease with which they can navigate through content within the website and beyond.

For instance, with the growth of video and audio as outlets for content on the web, advanced search is necessary to cull this data and make it readily available and integrated with more structured data types. In addition, users are joining the information creation process with social reviews and rankings, and online publishers need search that effectively tracks and analyzes these interactions.

Our customers have also found that highly scalable search architecture is important as the web is becoming increasingly interconnected in interesting ways. From the perspective of online publishers, there’s a great opportunity for “mash-ups” between data from the local site itself and internal databases, and useful contextual data from the web and outside applications.

On Thursday March 5th at 8:00am PT, Morgan Zimmermann, Exalead VP of Business Development, will continue our discussion about the business opportunities that advanced search presents to online publishers looking to do more with their content. Morgan will discuss how online publishers can:

- Regain complete control over their content and transform it into a long-term, organic, profitable business
- Achieve strategic independence from content aggregators and advertisers
- Secure brand positioning across a spectrum of innovative user experiences
- Use ‘mash up’ and Hybrid search to improve profitability

You can register for the webinar entitled “Online Publishers: Is Content the Only Key to Success?” here.

November 27th, 2008

FAST’s Performance Slowdown

Heard something notable at the Butler Group Enterprise Search Strategy Briefing in late November.

A rep from Scotland’s National Health Service talked through a case study of their use of FAST and offered up some … interesting … metrics.

The customer indicated that they were anticipating growing their system from 11 million documents to 18 million documents … but that this growth would require 22 servers.  Considering that NHS employes a staff of roughly 150,000, and assuming all these staff run 10 searches a day for a maximum of … say … 16 hours per day, this is roughly 1 query per second.

This means FAST, for this implementation, needs 22 servers to run 1 query per second across 18 million docs. Without going into all the technical detail, this isn’t entirely surprising given FAST’s dependence on a slew of different technologies (which adds to the complexity of their deployment) and their need to distribute to more and more servers as the amount of content that needs to be located, searched and indexed grows (which presents a challenge for companies whose data pools are increasing …  i.e. all of them).

Just for the sake of comparison, Exalead customers get 20 queries per second across 20 million docs with only 1 server — less cumbersome, more efficient and greener than the 22 servers described by NHS.

Especially in this time of economic downturn and budget belt-tightening, it’s even more crucial that businesses get the most IT bang for their buck.   Make sure you make the right choice for your information access so you can utilize your important data and preserve your corporate resources.

November 3rd, 2008

Map the Web with Gephi

Innovation is a leading priority for Exalead. That is why the company often gives its support to external initiatives like this project set up by students from U.T.C. that developed Gephi, in collaboration with WebAtlas association. Gephi is an open source software under GPL3 license that enables 3D networks graphics manipulation, exploration and visualization.

Carte DPI

What is this graphic about?
It represents a semantic analysis of the relationship between terms used on the Web to speak about Intellectual Property Rights in the French language.  Each node symbolizes a word or a group of words and each edge connects two expressions when these are co-cited in more than 120 000 web pages.  Each color refers to a “semantic cluster”, which is a bunch of words than concern the same topic.

How can I get this type of graphic?
After an extraction of related terms found on Exalead databases and a manual filtering phase, the project team receives a GDF file with ordered data.  Then, the exploitation of this file by Gephi combined with a specific algorithm leads to the data “spatialization”. Then color filters highlight different semantic clusters.

Here is one of the first demonstrations of Gephi with real-time spatialization of several keyword clusters. In this video, the blue color refers to a “genetics” cluster, orange nodes relate to terms about biology and laboratories, green ones concern words speaking about controversy in the domain of GMOs and purple nodes relate to innovation and research development in biotechnology.



Gephi – Dynamic demo from gephi on Vimeo
 
Congratulations to the project team for this great web mapping tool!
Do not hesitate to visit the Gephi website to obtain more information and test this software.
If you are interested in this subject, you should know that the team continues to recruit.

May 19th, 2008

Exalead : Right on target !

bond.JPG

Exalead has been part of the server revolution, providing faster and more efficient service over the years.

This is not the first time nor the last time you will hear about our server improvements. In fact, we will be providing regular updates to address the evolution of traffic, the increase in the number of indexed pages and our improvements in service.

Here is a brief summary of the stages that have affected the life of our production center.

To begin, Exalead installed some machines in the offices of our service providers. But considering our growth, it was necessary to give them dedicated homes that did not use our equipment.

March 2005: We installed the first dedicated room with the opening of our Site 1, consisting of more than 10 machines shared in more than 6 racks. Yes, they were big machines! This allowed us to index 1 billion pages.

August 2005: We added around 30 servers to address the traffic, with the capability of indexing more than 2 billion pages.

March 2006: Then things really heated up, and we opened a second site and added more than 50 servers (10 racks) that enabled us to index more than 8 billion pages.

January 2007: As a result of the abundance of services and ideas that leave our laboratories, we had to add more servers to Site 1.

2007 to Present: Our laboratories continue to work and prepare for an upgrade to enrich our architecture, improve speed, and become more robust and efficient. But we had to add 20 machines to Site 1 in august 2007.

Since then, we have been actively working to put these improvements on line, so you can see the evolution, but this is not the calm before the storm…

March 17th, 2008

Guide for Webmasters: Part 1, Making the Most of Your Content

Interested in improving the visibility of your site on our engine? Hopefully this series of posts will help.

First up: answers to the two most frequently posed webmaster questions:

1) Why doesn’t my site appear (or why does it only partially appear) when I do a site search (i.e., typing “site: mysitename.com” in the search box)?

All or part of your site may be inaccessible to our robots. Try the following to improve your performance:

2) Why doesn’t my site appear for a given keyword?

  • First, check to see that the keyword is in our index for your site. Enter the keyword in the search field, along with “site:mysitename.com” to limit the search for that keyword to just your site (replacing “mysitename.com” with your domain name, of course). If it is not indexed, follow the steps for question 1 above.
  • Refine the keywords in your site so they are as specific as possible. It could be the keyword you are checking is too general, and sites that larger, more relevant and/or more popular are ranking ahead of your site for that keyword.
  • Verify that the content of your site corresponds well to the keyword. It’s not enough for a keyword to simply appear, it must be integrally related to the rest of the site content.

You’ll find further info on keyword relevancy in Search Engine Optimization (SEO): More Old-School Than You Think.”

And be careful out there! Stick to keeping your content fresh and relevant for your target audience. Reverting to tricks like hidden text, duplicate content, spam link exchanges or other such tactics to improve your ranking could get you banned from our index (for more info, see “The Road to Better Site Indexing – Episode 2”).

You’ll also find general webmaster tips in our site’s help pages.

February 1st, 2008

Video Search Update, Part 3: Preview & Refine Results

Now that we’ve updated you about new platforms added to the index (Part 2), and told you how you can add your own videos, let’s take a closer look at the structure of the search results.

Enter for example ‘Daft Punk’ in the video search engine:
http://www.exalead.com/video/results?q=daft+punk

When you click on a video’s thumbnail image, you can preview the video without leaving the search results page. Handy, huh?

You can also refine your results by confining them to a particular source, a specific video duration, or even a specific topical category and descriptive keyword.

Happy video hunting!

Refining Exalead Video Search Results

 

September 13th, 2007

Search secrets: searching like a pro with regular expressions

Well known to computer programmers, regular expressions (“regex” or “regexp” to insiders) are also a secret search weapon of librarians around the globe. A regular expression is simply a text pattern that can be used to find matching text strings. Regular expressions use wildcards and special shorthand notations to describe these patterns. Regular expressions are not available in most search engines, but they are part of Exalead’s Advanced Search options (which is one reason hard-core info-geeks are so fond of Exalead!).

What does a regular expression look like? Let’s look at an example using a period (“.”), the regular expression wildcard representing all letters of the alphabet. If you wanted to use this wildcard within a regular expression in the Exalead engine, you would first frame your query with forward-slash marks “/” to indicate it’s a regular expression, then place the period wherever you wanted variations of a single letter to appear. Thus, the regular expression “/c.p/” would return matches where the “.” is replaced by any single letter, as in “cop,” “cup” and “cap”.

Now one would be hard pressed to imagine a practical reason for running a search that would return both “cop” and “cup,” but using regular expressions to search for potentially misspelled proper names, product codes or technical terms can be very handy.

Imagine for instance you’re doing some research on Exalead. To make sure you haven’t missed an important document in which Exalead has been misspelled, you might try something like “/ex.lead/” to catch variants such as “exelead” or “exilead”.

You could also try “/exa*lead/”, with the asterisk (“*”) being a regex wildcard that indicates the preceding letter can be repeated 0 or more times. A search on “/exa*lead/” would therefore return variants like “exalead”, “exaalead” and “exaaalead”.

If you wanted to exclude documents in which Exalead was correctly spelled, you could simply add “-exalead” to your query, i.e. “/exa*lead/ -exalead”, returning only matches like “exaalead” and “exaaalead”. (The minus sign is an Exalead Advanced Search option that lets you exclude words from the results for any query. Looking for company names containing “Einstein” but no time to wade through a zillion articles on Albert Einstein? Try “einstein -albert“!).

Sometimes, you may not be using regular expressions to hunt for misspellings but rather to include legitimate spelling variations, like “color” (American English) and “colour” (British English). Here, you could use a vertical bar (“|”) between alternative characters or words, which is regex ‘shorthand’ for “or”. For example, entering “/gr(a|e)y/ whale” would tell ExaBot to find all matches for either “gray whale” or “grey whale.”

To learn more about regular expressions, take a look at the regex Wikipedia article. Be sure to also look over all of Exalead’s Advanced Search options. Used alone or in combination (as with the “/exa*lead/ -exalead” example), they offer an easy way to inject some high-octane fuel into your next query.

August 29th, 2007

Exalead: A New Addition to the Prediction Research Toolbox?

Wharton
Formulating predictions, such as the movements of the stock market or the likelihood of a movie’s success, have traditionally been costly, and unevenly successful, endeavors. Prediction research often involves labor-intensive efforts to understand geographically localized social trends and “on-the-ground” conditions. Now, as reported in Knowledge@Wharton , two Wharton professors, Albert Saiz and Uri Simonsohn, have found a cheaper way to deliver some of the same benefits as this type of resource-intensive research: an Internet search.

Using Exalead as their Internet search tool of choice, they chose to study political corruption as a test case. They found that the Internet search results for this topic on Exalead showed a strong correlation to ‘real world’ facts regarding corruption, namely, the frequency and proximity of the word ‘corruption’ alongside various locality names and socioeconomic indicators matched known ‘real-world’ corruption linkages.

This reliable correlation means social scientists are likely to use Internet search statistics as a proxy for measuring local social trends that are otherwise difficult to assess (such as measurements within relatively closed societies), and certainly astute market researchers will be adding Internet search results analysis to their arsenal in determining the best markets for product launches or the best geographical distribution for campaign election funds.

Of course at Exalead, we’re as interested in innovative ways to use Internet search as we are pleased that these two professors assessed all the major search engines over the course of their research, and selected Exalead as the most reliable (giving high marks on reliability to Ask.com as well). The others, Simonsohn stated, either couldn’t support a single automated search or were simply too unreliable, producing radically different results from week to week. You can download the complete paper from the Social Science Research Network site.

Carole&Co

August 28th, 2007

The Road to Better Site Indexing: Episode 3, Sitemaps (based on a true story)

Humphrey Bogart
In our prior episodes:
The crawler known as “Bot” travels across the web, moving from page to page and site to site by following links he discovers along the way. But Bot isn’t the type to let himself be led about aimlessly. He tries to imitate his hero Humphrey Bogart, who never shied away from a tangled web yet always managed to stay on the right track.

But being a perfectionist, Bot wasn’t entirely satisfied with his own method. Was he overlooking a significant thread? Leaving an important page unturned? He had a hunch he could do better.

Leaving important content in the dustbin of unindexed pages was just the sort of slip-up that really peeved Bot’s equally perfectionist client Betty, a.k.a. “The Webmaster.” Betty had specifically called on Bot to crawl her entire site, and Bot had missed several pages.

To get their relationship back on the right track, Bot had an idea: he would ask Betty to tell him flat out everything she wanted him to know about her site. And being a guy always in the know, Bot knew just what tool Betty could use to set the record straight: a sitemap.
He proposed; she accepted.

Now Betty can rest easy knowing all the content she wants to share with the world will be indexed. And just what is this handy tool known as a sitemap?
It’s actually not much more than a laundry list of links. Constructing one is a snap. You simply create a text file listing the URLs you want indexed, along with any key facts you want Bot to know (like how often a file is updated), and place it anywhere you’d like, giving Bot the location in your robots.txt file, for example at the root of your web site: http://www.example.com/sitemap.xml.

Sitemaps can be written in XML (the preferred method), or communicated via syndication feeds or simple text files. A sitemap in XML looks something like this:

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
</url>
</urlset>

You can visit http://www.sitemaps.org/ for all the details. It’s the official site of the Sitemaps protocol, which was first proposed by Google, then fleshed out through discussions with MSN, Yahoo and Ask. It’s now the standard adopted by Google, Yahoo, Ask, and, as of July 2007, Exalead.
But bad guys consider yourselves forewarned: Bot knows not every webmaster is not as straight up as Betty. He stays a step ahead of all nefarious sitemap tricks, checking out every list of links spun his way and skipping right over bum lists.

Sébastien