Web searching for the researcher

Writers of every kind need to do research. Even if you’re writing pure fantasy, you occasionally need to check on human physiology right or cultural allusions. This often means using the search engines — but they make it hard. They give you popular results instead of accurate matches. They give you recent stuff while ignoring older pages. With some work, though, you can force useful results out of them.

I’ll be focusing here on DuckDuckGo. Google annoys me in too many ways to mention, and I get tired of people giving it free advertising by telling me to “google” something. DuckDuckGo has its own problems, but it least it doesn’t second-guess you based on your earlier Web activity. They’re similar in a lot of ways, so a lot of this advice will work with other search engines as well.

How do search engines work?

I don’t actually know how modern search engines work, but it’s obvious that they aren’t simply looking for pages that match your search terms. You’ll often find that the top match in your results doesn’t have all the words you searched for. This is how I think they work. I could be wrong.

Think of Web pages as asteroids floating in space. Each one has a certain location and mass. Now think of your search as a capsule adrift among them. Its location is your search string. (For now, imagine there are only three possible search terms, corresponding to the three dimensions of space.) An asteroid’s mass is the page’s popularity. An obscure page with few links to it is a mere pebble. A world-famous page is as massive as Ceres.

Your capsule will be drawn to asteroids that are both close by and massive. A pebble could be a lot closer to you than Ceres is, but it won’t exert a lot of pull. You’ll be drawn to the big rocks, even if they’re farther away, i.e., less relevant to your search.

Now think of these rocks as floating not in three-dimensional space, but in a space with tens of thousands of dimensions, corresponding to possible search terms. Their location in search space isn’t simply a matter of matching your search terms. A different term, which the system implementors deem similar, can change its position. If I search for my name, “McGath,” then pages which don’t have my name but have “McGrath” in them are deemed closer to my search than many pages that do have my name.

These rocks must be made of an unstable element, since their mass (search rank) decays over time. Try searching for general information on a subject when there was big news about it just yesterday. You can’t possibly be interested in information that’s more than a week old, can you?

Search tricks on DuckDuckGo

The results have only a loose connection to what you searched for. SEO, not relevance, is the overlord of search space. You have to boost your thrusters to avoid being dragged down the gravity well of irrelevant results. One way to do this is to know all the options you have. Let’s look at DuckDuckGo’s syntax documentation.

Sometimes narrowing your search to one domain helps. You can use site to do this:

governor site:ny.gov

(You can click on any of the examples to perform the actual search, but the results could be different from what I got.)

You might want to exclude a site from your search, using -site. This is useful while looking for third-party reviews. However, it doesn’t always overcome massive gravity.

google -site:google.com

The first match I get with that search is Google.com, even though it should be excluded. Other results include Google.dk, Google.mn, and Google.ch. But you also find the Wikipedia page about Google, Google applications for your iPhone, and something called “Google Frightgeist.”

Site restriction doesn’t work so well with hosting sites that give subdomains to their customers. For instance:

support site:livejournal.com

That gets you not only LiveJournal’s support pages, but posts in personal journals that support causes, talk about their support networks, and so on.

Continued in Part 2.

Posted in writing. Tags: , . Comments Off on Web searching for the researcher