Scratchpad

The art of search

(, , , , — )

21 Nov. 2007

Recently, I've been thinking more and more about the skills that go into an effective search, prompted in particular by my efforts to search for high-quality, academic pages written in Spanish. I was looking, specifically, for sites on the history of information.

In English, this would be a no-brainer: I'd start out with "history of (information|epistemology|knowledge)," based on my assumption that a person discussing the topic I was interested in would actually use the complete phrase "history of X." After 5 seconds of skimming result titles, I'd probably tack on something like -"history of information technology." Tweak, play, rinse, repeat. In Spanish, however, I immediately ran up against several problems:

  1. Word frequency.
    I don't actually have any idea how common particular words or phrases are in Spanish. I can do a direct translation, turning "history of information" into "historia de la informacion," (taking the time to actually use grave marks, which I am not doing right now) but I don't know if a native Spanish speaker would use this particular turn of phrase. I don't know how often they would use it. I don't know how often one word is used over another.
  2. Word pairings.
    Just as certain words are more common than others, so do certain words appear near others with degrees of regularity. Using Spanish, I don't have a clue what words might be used in conjunction with one another.
  3. Synonyms.
    Related to word frequency, I don't know which synonym for a word would be used in the context I am looking at. I don't know what particular words are considered high-brow, low-brow, pedantic, or slangy. If I'm looking at academic pages, I can assume the word chingar (fuck) will not appear, but what is the preferred alternative? "Relaciones sexuales?"
  4. Tone, source.
    In the unlikely event that I actually found a page that seemed to discuss what I was interested in, I found I had no way to evaluate the source using methods I am accustomed to. Normally a person reads something and decides, using multiple, subtle clues, whether the author seems reliable. Have they made absurd statements? Do you agree with their political sensibilities? Do they seem well-spoken, or practically illiterate? Is their language pretentious, plain, educated, insane? Is there a good flow? Does it just "sound right?" Using Spanish, I had no way to actually perform these actions.

Of course, this is important even for people who are ostensibly fluent in the language they are performing their searches in. I noticed earlier today that someone landed on my site using the search "animal planet mouse that fucks till its heart explodes" (without quotes). After getting over my initial reaction (something along the lines of, "dear god, what the fuck?"), I immediately started picking apart the search itself. Why didn't the user put "animal planet" in quotes as a phrase? Why didn't the user replaces fucks with something more appropriate to an Animal Planet audience, like copulates, mates, or just good ol' sex? Why didn't the user notice, before going to the effort to click through to my site, that the Google preview of that result shows that my site has nothing to do at all with the topic being searched for?

My second reaction was to marvel at how totally inappropriate all the results were, and to realize that blogging has had a profound effect on the usability of search results. The default setting for most blogs presents 10 posts on one page - 10 posts that often have little if anything to do with each other. 10 posts which are, in essence, completely different documents, but which search engines are treating as a single entity. Couple this with the fact that most blog posts are informal, written on the fly, not carefully planned for things like SEO and it's a wonder people can find anything.

No Comments

No comments yet.

RSS feed for comments on this post. ||

Leave a comment