Scratchpad

The art of search

(, , , , — )

21 Nov. 2007

Recently, I've been thinking more and more about the skills that go into an effective search, prompted in particular by my efforts to search for high-quality, academic pages written in Spanish. I was looking, specifically, for sites on the history of information.

In English, this would be a no-brainer: I'd start out with "history of (information|epistemology|knowledge)," based on my assumption that a person discussing the topic I was interested in would actually use the complete phrase "history of X." After 5 seconds of skimming result titles, I'd probably tack on something like -"history of information technology." Tweak, play, rinse, repeat. In Spanish, however, I immediately ran up against several problems:

  1. Word frequency.
    I don't actually have any idea how common particular words or phrases are in Spanish. I can do a direct translation, turning "history of information" into "historia de la informacion," (taking the time to actually use grave marks, which I am not doing right now) but I don't know if a native Spanish speaker would use this particular turn of phrase. I don't know how often they would use it. I don't know how often one word is used over another.
  2. Word pairings.
    Just as certain words are more common than others, so do certain words appear near others with degrees of regularity. Using Spanish, I don't have a clue what words might be used in conjunction with one another.
  3. Synonyms.
    Related to word frequency, I don't know which synonym for a word would be used in the context I am looking at. I don't know what particular words are considered high-brow, low-brow, pedantic, or slangy. If I'm looking at academic pages, I can assume the word chingar (fuck) will not appear, but what is the preferred alternative? "Relaciones sexuales?"
  4. Tone, source.
    In the unlikely event that I actually found a page that seemed to discuss what I was interested in, I found I had no way to evaluate the source using methods I am accustomed to. Normally a person reads something and decides, using multiple, subtle clues, whether the author seems reliable. Have they made absurd statements? Do you agree with their political sensibilities? Do they seem well-spoken, or practically illiterate? Is their language pretentious, plain, educated, insane? Is there a good flow? Does it just "sound right?" Using Spanish, I had no way to actually perform these actions.

Of course, this is important even for people who are ostensibly fluent in the language they are performing their searches in. I noticed earlier today that someone landed on my site using the search "animal planet mouse that fucks till its heart explodes" (without quotes). After getting over my initial reaction (something along the lines of, "dear god, what the fuck?"), I immediately started picking apart the search itself. Why didn't the user put "animal planet" in quotes as a phrase? Why didn't the user replaces fucks with something more appropriate to an Animal Planet audience, like copulates, mates, or just good ol' sex? Why didn't the user notice, before going to the effort to click through to my site, that the Google preview of that result shows that my site has nothing to do at all with the topic being searched for?

My second reaction was to marvel at how totally inappropriate all the results were, and to realize that blogging has had a profound effect on the usability of search results. The default setting for most blogs presents 10 posts on one page - 10 posts that often have little if anything to do with each other. 10 posts which are, in essence, completely different documents, but which search engines are treating as a single entity. Couple this with the fact that most blog posts are informal, written on the fly, not carefully planned for things like SEO and it's a wonder people can find anything.

More on Google Book Spider

(, , , , — )

5 Sep. 2007

Poking around Google Books a little more, discovered the following path from which metadata can be snagged to compile full details of item:

On results page, click on "about this book." Yeah. That's it. Duh. Of course, you still can't actually search on metadata, but at least it's there...you could automate a search and retrieve everything for certain keywords, then use the metadata to do a secondary "weeding."

Or, you could do this:
Google Books, search only "full view" to find complete e-books ->
On results page follow "Find this book in a library" for OCLC results ->
OCLC site, retrieve metadata for object.

Oops, is my face red. Sort of. At the same time, really, if the metadata is there, if each record is already tied to an OCLC record, is it really necessary to prevent users from searching the fields directly? Still, at least there seem to be workarounds.

Google Book Search Blows

(, , , — )

29 Aug. 2007

All I wanted to do was write a spider to steal all of the public domain books off of Google. But I can't. You know why?

You can't specifically search the metadata on Google Books.

That's right - no metadata. For you non-library/non-tech geeks out there who have no idea what I'm talking about, that means you can't do complex searches on the "aboutness" of a book. The text of a book is just text, but the metadata includes keywords on who wrote it, where it was written, when it was written, what it is about. Since a bazillion words appear in a book, it is often useful to search strictly on human-created, trusted metadata....it takes out the extra cruft and minimizes false positive results. But without that ability, you can't go "oh, Google Books, show me just the fiction," because the word "fiction" might appear in the title of an academic paper about fiction, but which itself is non-fiction. Google, as a machine, doesn't handle the difference between those two concepts very adeptly. It also doesn't appear to appreciate the difference between a book and a journal, and you can't search for items published in a particular country.

God, I hate Google more and more with each passing day.

Do you even mention?

(, — )

18 Jul. 2007

I try to keep a pretty close eye on my server logs, so I've been pretty entertained the last couple of days to see a sudden upsurge of visitors using variations of "butt" for the search terms that lead them to my site, including the following (theoretically apt) ones:

  • world's biggest natural butt (from Africa)
  • girl with the best butt (from Southern US)
  • white butt (from Brazil)

Okay, back to work. I've been trying to cook up something actually interesting for my 5 readers, but I could hardly pass this one up in the meantime.

Search result sifting using compression algorithm

(, , , , , — )

10 Jul. 2007

Clever way to sift standard search results (ie google) against a known good result, thereby increasing relevance of final results.

Wonder if it would be possible to build front end to perform actions in one step - type in search, paste known good, hit go. Automatically calls script and performs actions in one swoop.

On Digital History Hacks, of course.