Archive for the ‘Programming’ Category

March 17th, 2008

Guide for Webmasters: Part 1, Making the Most of Your Content

Interested in improving the visibility of your site on our engine? Hopefully this series of posts will help.

First up: answers to the two most frequently posed webmaster questions:

1) Why doesn’t my site appear (or why does it only partially appear) when I do a site search (i.e., typing “site: mysitename.com” in the search box)?

All or part of your site may be inaccessible to our robots. Try the following to improve your performance:

2) Why doesn’t my site appear for a given keyword?

  • First, check to see that the keyword is in our index for your site. Enter the keyword in the search field, along with “site:mysitename.com” to limit the search for that keyword to just your site (replacing “mysitename.com” with your domain name, of course). If it is not indexed, follow the steps for question 1 above.
  • Refine the keywords in your site so they are as specific as possible. It could be the keyword you are checking is too general, and sites that larger, more relevant and/or more popular are ranking ahead of your site for that keyword.
  • Verify that the content of your site corresponds well to the keyword. It’s not enough for a keyword to simply appear, it must be integrally related to the rest of the site content.

You’ll find further info on keyword relevancy in Search Engine Optimization (SEO): More Old-School Than You Think.”

And be careful out there! Stick to keeping your content fresh and relevant for your target audience. Reverting to tricks like hidden text, duplicate content, spam link exchanges or other such tactics to improve your ranking could get you banned from our index (for more info, see “The Road to Better Site Indexing – Episode 2”).

You’ll also find general webmaster tips in our site’s help pages.

January 15th, 2008

Video Search Update, Part 2: New Sites Indexed

In our Video Search Update, Part 1, we told you how we broadened our index to include your direct submissions. We have now enlarged the index once again, adding these popular sites:

zdnet.fr

thatvideosite.com

comedycentral.com

videonetart.com

AskANinja.com

wat.tv

wideo.fr

blip.tv

veoh.com

onowa.com

nasa.gov

video.on.nytimes.com

latelelibre.fr

sports.espn.go.com

feeds.reuters.com

stage6.divx.com

stupidvideos.com

livevideo.com

video.lequipe.fr

archive.org

channels.ourmedia.org

revver.com


Now that the RSS mode has been activated, all that’s needed to add a new source is to locate its corresponding RSS feed. So if you find a good video source, send us the feed URL.

January 11th, 2008

Video Search Update, Part 1: Submit Your Video!

After having indexed Dailymotion, Youtube and Metacafe, we decided to enlarge our index by enabling you to submit your videos to our crawler directly.

We currently support the Media RSS format, adopted by the majority of video content distributors: http://en.wikipedia.org/wiki/Media_RSS

All you need to do is send us your feed URL so our crawler can fetch it.

Once your feed is submitted, our crawler will check back regularly to verify that the video is still available.

An example of a video feed:

<?xml version=”1.0″ encoding=”utf-8″?>
<rss version=”2.0″ xmlns:media=”http://search.yahoo.com/mrss/”>
<channel>
<title>My site</title>
<link>http://www.mysite.com/rss/mrss.xml</link>
<description>Videos published on my site</description>
<item>
<author>jane56</author>
<title>Interview with Tom</title>
<link>http://www.mysite.com/video/1</link>
<description>Tom responds to my questions about the new product.</description>
<guid isPermaLink=”true”>http://www.mysite.com/video/1</guid>
<pubDate>Mon, 25 Nov 2007 08:42:00 +0000</pubDate>
<media:content url=”http://www.mysite.com/player/1/interview_de_tom.swf”
type=”application/x-shockwave-flash”/>
<media:content duration=”325″ >
<media:thumbnail url=”http://www.mysite.com/vimages/1.jpg” width=”340″
height=”250″ />
<media:keywords>Tom, interview, new</media:keywords>
<media:rating scheme=”urn:simple”>nonadult</media:rating>
<media:category>Entertainment</media:category>
</item>
</channel>
</rss>

<guid>:
-The guid tag contains the URL of a page where the video can be found. When the user runs a search and clicks on a result, he/she will be directed to this URL.

<thumbnail>:
- The thumbnail tag contains a link to a descriptive image for the video.

<pubDate>:
- This tag is for the publication date of the video.

<media:content>:
-This tag contains a direct link to the video. ‘Type’ is a standard video MIME type.

<media:keywords>:
- A list of keywords associated with the video.

<media:category>:
- One or more categories associated with the video.

<media:rating scheme=”urn:simple”>:
- Indicates if the content is ‘adult’ or ‘nonadult’ (suitable for minors) in nature.

This list is not exhaustive. See http://search.yahoo.com/mrss for further specification details.

Contact us if you have any technical questions!

August 28th, 2007

The Road to Better Site Indexing: Episode 3, Sitemaps (based on a true story)

Humphrey Bogart
In our prior episodes:
The crawler known as “Bot” travels across the web, moving from page to page and site to site by following links he discovers along the way. But Bot isn’t the type to let himself be led about aimlessly. He tries to imitate his hero Humphrey Bogart, who never shied away from a tangled web yet always managed to stay on the right track.

But being a perfectionist, Bot wasn’t entirely satisfied with his own method. Was he overlooking a significant thread? Leaving an important page unturned? He had a hunch he could do better.

Leaving important content in the dustbin of unindexed pages was just the sort of slip-up that really peeved Bot’s equally perfectionist client Betty, a.k.a. “The Webmaster.” Betty had specifically called on Bot to crawl her entire site, and Bot had missed several pages.

To get their relationship back on the right track, Bot had an idea: he would ask Betty to tell him flat out everything she wanted him to know about her site. And being a guy always in the know, Bot knew just what tool Betty could use to set the record straight: a sitemap.
He proposed; she accepted.

Now Betty can rest easy knowing all the content she wants to share with the world will be indexed. And just what is this handy tool known as a sitemap?
It’s actually not much more than a laundry list of links. Constructing one is a snap. You simply create a text file listing the URLs you want indexed, along with any key facts you want Bot to know (like how often a file is updated), and place it anywhere you’d like, giving Bot the location in your robots.txt file, for example at the root of your web site: http://www.example.com/sitemap.xml.

Sitemaps can be written in XML (the preferred method), or communicated via syndication feeds or simple text files. A sitemap in XML looks something like this:

<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.example.com/catalog?item=12&desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.example.com/catalog?item=83&desc=vacation_usa</loc>
</url>
</urlset>

You can visit http://www.sitemaps.org/ for all the details. It’s the official site of the Sitemaps protocol, which was first proposed by Google, then fleshed out through discussions with MSN, Yahoo and Ask. It’s now the standard adopted by Google, Yahoo, Ask, and, as of July 2007, Exalead.
But bad guys consider yourselves forewarned: Bot knows not every webmaster is not as straight up as Betty. He stays a step ahead of all nefarious sitemap tricks, checking out every list of links spun his way and skipping right over bum lists.

Sébastien

July 11th, 2007

The Road to Better Site Indexing – Episode 2

Next in our
series: “The Ballot Box Stuffers,” a.k.a. link farms.

What’s a link
farm? Let’s look at the randomly chosen site http://www.rc-car.ravemart.com. At
first glance, this remote control car ecommerce site seems to be a typical
small biz site plying its wares in the typical way.

Now take a closer
look by clicking on some of the text links at the bottom of the page. You’ll
see this site has decided to aid the search engines by providing links to
thousands of its friends’ sites. Click on the generically titled “Link” and
you’ll find links to the company’s “Partners” under categories such as Debt
Consolidation, Vitamins, Legal Services and Sweepstakes. Or click on
“Sponsored Links” for fast access to Cheap Viagra, Casino And Poker
News, Psychic Readings, or Online Dating Services.

Similarly, visit www.all-carpets.com where under “Resources” you’ll not only find links
for Home Flooring and Kitchen Design (not too far off base), but also Cash Back
Credit Cards, South African Zulu Culture, and Offshore Banking (uh-oh, we’re in
left field now…).

Even larger
companies and organizations may participate in these types of link programs,
where all members link to all other members, often regardless of the relevance
of the content, in the somewhat desperate hope of augmenting their popularity
and hence appearing higher in search results.

If you’re a web
searcher trying to find quality results, don’t worry, we remain vigilant ;-).
If you’re a site owner, avoid participating in link farms. You may find the
strategy backfires as your site is demoted or even dropped from some search
engine databases. Instead seek out quality reciprocal links with sites with
whom you share a genuine relationship, and you’ll build the kind of true
popularity both visitors and search engines appreciate.

Next episode: The
Integration of RSS Feeds.

Sebastien, Web
Team Head Chef

July 11th, 2007

The Road to Better Site Indexing – Introduction and Episode 1

The question that usually follows “How
can I make my site appear at the top of search engine results?” is “Why don’t
search engines index all my pages?”

Firest, you should know that pages
accessible uniquely through JavaScript or through form submissions are not
reachable by search engines and therefore they cannot be indexed. And there is
no means for a search engine to know whether it’s missing some pages in a site,
whether the missing page count is 10 or 10,000 (outside of site maps, which I
will discuss in a future post).

Next, let’s refresh ourselves on the
fundamental methods search engines use to find the pages they index: 1) They
follow a submission made by a human being (0,0001% of cases), or 2) They follow
a link from another page. Therefore, if there is a link to a given page, the
probability that it will be indexed is high. Alternately a personal site with
no external links to it has little chance of being indexed by a search engine.
So more links are always better, right?

Not necessarily. It should be understood that from the point of view of
a search engine, a risk arises not from a dearth of links, but from too many.
Why? Because search engines seek to provide the most relevant results for
visitors, returning pages with the content most likely to match visitors’ needs
and expectations. A site that arrived at the top of the results solely because
there were tens of thousands of links to it would not pass this test. In fact,
an overabundance of external links may indicate a “spamming” campaign aimed at
search engines and be an indicator of poor site quality.

Here are two cases of what we’ll call legitimate ‘overabundance,’ an overabundance of links due to valid, non-spamming factors that can be properly managed by search engines.


 

Case 1: User Sessions

When you visit
an e-commerce site, unique “session” information will often be assigned to your
computer. This information uniquely identifies your particular connection and
visit. It may include, for example, a unique ID for your computer and a code
for your browser version or geographic location.

This session information tracks your movements,
preferences and selections as you navigate a site. This is not for nefarious
ends, but is rather used to perform practical tasks like maintaining items in
your shopping cart, showing prices in your local currency or displaying a list
of products you’ve viewed. This session information is most often added to the end
of the URL (web address) for every page you visit.

For instance, say you are visiting Amazon.com and you
navigate to a Stanley wrench set. The URL displayed in your browser is

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/
ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7
.

Only the first part of the URL,

http://www.amazon.com/Stanley-92-716-Combination-Wrench-22-Piece/dp/B000JPUCT0/,
is needed to locate the product information for this wrench set. The rest,
“ref=sr_1_7/002-6118145-0432018?ie=UTF8&s=hi&qid=1181650669&sr=1-7″

is session information for your
particular visit.

A search engine may come across thousands of links
like the longer address, each of which may appear different because unique session
information is appended, and because each may show different user-dependent
content such as navigation history, promotions, or recommended products. But
any search engine worth its salt can discern the repetitive addresses from the
essential URL, and will know this is not a case of spamming.


Case 2: Calendar Menus

Some sites let you navigate through their
content by clicking on a calendar. For example, you may be able to peruse news
articles or events on a site by choosing a date or date range.

Such menus generate links like:
http://www.ecvd.eu/index.php?option=com_events&task=view_month&Itemid=32&year=2011&month=09&day=12

A competent search engine will know which
of these types of links returns valid content and which does not, and what
baseline URL should be included in a search index. In other words, having a
zillion external links for events on dates from 1950 to 2060 for a site with ten
events will definitely not boost that site’s ranking ;-).

Now you may say these two cases look like
easy ones for a search engine to manage, and you’d be right. The real
difficulties arise from the following three cases, because (scoop!) there are
unscrupulous people out there ready to do anything to improve their search
engine ranking. You’ve most likely encountered their handiwork when using a
search engine other than Exalead.

You run your search and click on a page you
think is relevant, only to encounter an endless list of meaningless links or
keywords, a pastiche of content “borrowed” from other more relevant sites, or
an endless loop of promising links that ultimately go nowhere.

These types of pages are generated by the
folks at the top of our list of ballot-box stuffers, those trying to improve
their search engine rank through:

* Link farms and keyword stuffing,

* Content scraping, including the abuse of
RSS Feeds, and

* Creating content labyrinths.

We’ll be covering these tactics in upcoming
episodes. In the meantime, you can see why search engines may need to limit the
number of pages they index for a site. This ‘quota’ is determined based on the
site’s reputation, the duplication of its content, and a thousand other
parameters, all factored in an attempt to keep the game honest so web searchers
get the most relevant search results possible.

 

Sebastien, Head Chef, Web Team

June 13th, 2007

Learning Javascript - Part I

Many developers hate Javascript. Nevertheless, they are asked every day to “add a bit of AJAX on the website”.

This post is dedicated to these developers.

Javascript is a rich language, it is object oriented and easy to learn.

The basics
The Mozilla Developer Connection is the best documentation source for Javascript. To learn the syntax and the basic data types, read A re-introduction to Javascript.

Objects
In Javascript, an object is basically a hash. The simplest way to create an object is:

var o = {};
o.name = "hello";
o.setName = function(name) {
  this.name = name;
}

In this example, o is an object, its name property is set to hello, and its setName property is a function.

Classes and inheritance.
Javascript handles the notion of class in an unusual way:

  • a Class is an object of type Function
  • this object has a magic member called prototype
  • when invoked with the new operator, a new empty object is created and the prototype is copied inside.

Example:

var MyClass = function() {
  //contructor
  this.name = "default name";
};
MyClass.prototype = {
  //prototype
  setName: function(name) {
    this.name = name;
  }
};

var o = new MyClass();
alert(o.name); // "default name"
o.setName("new name");
alert(o.name); // "new name"

There are multiple ways to write classes. For a full overview, read the excellent Classical inheritance in Javascript.

Access control
In Javascript, everything is public. The best way to preserve the notion of privacy is through convention. For instance, at Exalead we prefix private members with an underscore.

Everything Dynamic
In Javascript, everything can be changed at all times. Even methods. This is particularly useful when coding event based User Interfaces:

o.onReceiveSomeEvent = function() {
// do something
};

It is even possible to enrich basic data types with your own methods by adding methods to their prototype.

Example:

String.prototype.blank = function() {
  return /^s*$/.test(this);
}
"hello".blank(); // false
"     ".blank(); // true

The DOM - Document Object Model

Javascript runs in the browser in an HTML page. A representation of that page - the DOM - is available to Javascript through the global window object. Of course, it would be too easy if the DOM were identical between browsers. How then, do people write cross browser code? There are 2 approaches at least:

Approach #1:

if (navigator.appName == "Microsoft Internet Explorer"
    && navigator.appVersion >= "4.0") {
  element.attachEvent("onclick", function() {alert("click")});
}

if (navigator.appName != "Microsoft Internet Explorer"
    && navigator.appName != "Netscape") {
  element.addEventListener("click", function() {alert("click")});
}

This is the most trivial approach and the least elegant and scalable. It makes sharing code a nightmare and will inevitably make you hate Javascript. Don’t use it, ever.

Approach #2:

if (element.attachEvent) {
  element.attachEvent("onclick", function() {alert("click")});
}
if (element.addEventListener) {
  element.addEventListener("click", function() {alert("click")});
}

This is a lot better. This approach uses one of javascript’s strengths: testing the existence of a function. It provides maximal compatibility with minimum browser knowledge.

That’s all for today folks! In my next post, we’ll dissect the Prototype Javascript Framework.

- Damucho, for the WebDev team