Sunday, September 28, 2008

Search engines: a study of nine search engines in four categores

Alec Holt, BSc, PGDipSci, MCom(Otago), MNZRS
Dallas Knight, MHealInf
Department of Health Informatics,
University of Otago, Dunedin, New Zealand

Corresponding Author:
Dallas Knight, MHealInf, PhD candidate
Department of Health Informatics
University of Otago
PO Box 56
Dunedin
New Zealand
Phone: +64 6 835 5939
Email: dknight@actrix.co.nz



ABSTRACT

Background:

Search engines are the most common starting point for health searches. More health consumers and health professionals use a search engine as a starting point for health related queries than a website [1]. Most searchers use less than four words, do not look past the first few results [2], are pleased with their results [3] and do not use the advanced features, or other features now available. Health searchers’ queries have been found to be suboptimal [2].
Objective:
This study’s objective was to determine how search engines within different categories compare, and to look at features and trends of search engines that are commonly used for queries by both health consumers and professionals.
Methods:
Nine search engines in four categories were objectively and subjectively assessed then ranked. Assessments covered relevance, popularity, usability, website quality and search engine features. Queries relating to five health scenarios were used to formulate query terms. Rankings were summed and ranked again. Features and the impact of Web 2.0 technology in search are also discussed.
Results:
Search engines within the general category (Google, SearchYahoo!) performed best overall. Meta search engines (Dogpile, Jux2) also performed well with vertical search engines (Healia, Kosmix, Healthline) next. Health portals (Revolution Health, WebMD) produced relevant useful results for common terms, but not for unusual query terms.
Conclusions:
In this study to rank search engines, the general and meta search engines ranked higher than the vertical search engines. The health portals provided a fuller social experience, including discussion forums. Google is a good place to start a health search, but knowledge of how a search engine works and using queries that are more effective, may improve results. Rich web technologies (RWT) are changing the search landscape with personalisation, customisation and increased human input influencing the search process.

KEYWORDS
Search engine, vertical search engine; information retrieval; patient education; patient empowerment, semantic, Web 2.0.

1 Introduction

Shared decision-making and informed consent is necessary and normal in today’s health setting, making web-based information important to both consumer and health care professional. The informed patient is more likely to participate and become involved with treatment and preventative processes. Most consumer health searches originate from keywords entered into the query-field of a search engine. In addition, more than 50% of health professionals, when searching for a journal article in Highwire Press, also used general search engines [4]. Requests originating from Google were the highest and amounted to over half of all requests.
A study of medical and health queries to different search engines [5] found that just a small percentage of all queries were health related and this number has declined as a percentage of all queries. Although the percentage of all search engine queries is low, fully 88% of women and 77% of American men have looked online for health information [6]. Normal search strategies by users have been found to be suboptimal, but search engines are constantly changing and improving the features, interface and algorithms to provide a better user experience. Google launched over 450 improvements in 2007 [7] [8], and this rate of change has continued. Other major search engines are also making changes that contribute positively to the search process. Challenges to consumers in the use of search engines include the volume, relevance and quality of information, as well as conflicting evidence from trusted sources and variable eHealth literacy levels.

However, the search landscape is changing. Algorithms have improved and searches for common health conditions are now more likely to return results from sources that are able to be trusted. Google PageRank system determines the rank order of results returned on the Search Engine Results Page (SERP) and this is calculated by considering the number and importance of backlinks or inbound links to a website or page. Each link is made by a human-generated as opposed to machine-generated process. Also, the combination of rich web technologies (RWT) using AJAX (Asynchronous JavaScript And XML), ASP, Python, PHP (Hypertext Pre-processor), Ruby On Rails and others, have spawned Web 2.0 and the Semantic web [9]. Web-based applications incorporating data sharing and data aggregations allow the resulting applications to incorporate interaction, participation, sharing, collaboration and greater involvement in the use of the Internet.

The popularity and resulting financial success of Google has assisted in encouraging search engine developers. The result is increasing numbers of new search engines, all with different features and interfaces.

2 Methods
For this study, the term ‘search engine’ was taken from Google search engine’s ‘define’ feature. The query “define:search engine” was entered, and a broad definition was adopted. Within four broad categories of search engines, nine were chosen for evaluation.

Five scenarios were used to formulate queries that were in keeping with the type of queries generally used by most searchers [10].


Popularity
The selected search engines were ranked for popularity using the Alexa traffic ranking tool [11], Google PageRank [12], Google backlinks, Yahoo backlinks and the Del.icio.us bookmarking site [13]. Rankings were summed and ranked again.
Usability
Usability of the Search Engine Landing Page (SELP) and the Search Engine Results Page (SERP) were subjectively assessed considering criteria within guidelines that have been produced by the United States Department of Health and Human Services [14].
Relevance
In addition, the first ten results for each health scenario were assessed for relevance, using Precision and Relative Recall measures.
Result quality
Each of the first ten results was also assessed for usefulness and quality, which was determined by considering the FA4CT (Find Answers and Compare, Check Credibility, Check Trustworthiness) algorithm. This is a new tool for consumers to assess quality of websites [15].
Features

Features offered by each search engine, were identified and discussed.

3 Results
Google.com was ranked first as the most popular search engine, with Search.Yahoo!, the next most popular.



Dogpile, an established meta search engine was third with the health portals more highly ranked than the vertical search engines. WebMD and Revolution Health are both health portals targeted at United States consumers and offer medical news, and general health information as well as query fields for search.
Popularity rankings were triangulated against the compete.com analytics [16], which measures people count. Rankings on these numbers using the Pearson r correlation method resulted in a correlation of 0.87

Relevance measured against a predetermined standard for each scenario then Precision and Relative recall calculated. Jux2 was top of the table for precision with the general search engines following. The portals did not rank as highly for Precision possibly because the search scenarios included some unusual search terms. Both WebMD and Revolution Health offer a wide variety of information on common conditions but in the queries relating to uncommon conditions, they returned fewer relevant results.


Usability was ranked by subjectively assessing the search engine landing page (SELP) and search engine results page (SERP). The impact of sponsored results and other forms of advertising were also assessed. The clean uncluttered interfaces of the general and meta search engines resulted in higher rankings than for the landing pages of the health portals. Health-portal landing pages were more cluttered with polls, quizzes, personal stories and other topics that made a ‘busy’ interface for starting a search. Advertisements on Dogpile were visually so similar to the organic results that it was difficult to distinguish one from the other.



Evaluation of relevance, usefulness, usability and website quality for the first ten websites within each scenario, resulted in the general and meta search engines taking the top four rankings.
Table 7 Summary of relevance, usefulness, usability and website quality across five health search scenarios


Features offered
Revolution Health and WebMD offered a wide variety of services covering all age groups, both genders and a wide variety of conditions, as well as login. WebMD information is drawn from many formats, including news, interviews and articles written in-house. Login services for Revolution Health included the opportunity to create personal stories that can be anonymous, shared or commented on by other logged-in members. Questions, groups, friends, responses and a range of social services were included within this site. No search operators were available for the health portals. Even so, although they are features of the general search engines they are rarely used.
All search engines offered clustering of results, and most offered the saved search feature, which reduces the anchoring cognitive bias effect [17]. All had some form of spelling suggestions and/or assistance. Speed was not specifically assessed, but by observation, Revolution Health was not as fast as all of the other search engines. Blended (universal) search, used by Google, Yahoo! and Live search engines, returns news, images, video and books as well as web pages.



Overall, the general search engines, then the meta search engines were ranked more highly than the other categories for this mixed ranking method.

4 Discussion
A health search is dependent on the technology used, but also user motivation, perceived needs, skills and other characteristics as well as interaction with the technology. Each search is a unique experience because the user and user needs vary for each search. After an initial search, the search terms may have to be modified.

Traditional search engines are based on text relevance and link analysis. New Web 2.0 technologies are enabling many different methods of information retrieval over and above the key words into a search engine query-field. Delivering content to search engine users has become a more complex and evolving process. Web 2.0 search involves creating, sharing and collaborating of information using text, video and sound. It has created a participating rather than passive environment. Vehicles such as custom-built search engines, blog sites, wikis, podcasts, social bookmarking, RSS site feeds and alerts allow users to pull selected information automatically into emails, customised page readers or aggregators. A custom search portal on any topic with can easily be set up and maintained by communities with shared interests. EureksterSwiki, Google, Yahoo and other sites offer this service at no cost. Mobile search engines are increasing in number in line with increasingly available Internet-enabled mobile devices.

Personalisation, which involves creating a user-profile, and customisation, which allows the user to select pre-defined sets, are increasingly offered as part of the search. Google is increasing the personalised search features, with search now based on search history, recent searches and localisation. Interface personalisation features are themes, colour choices and home page choices as well as language options and page reader options for RSS. Signup and login are prerequisites for use of these features. Google.com also includes a default definition of the search term on the SERP. Other features that are increasingly incorporated into the newer search engines include better user interfaces with more organisation, visualisation or audio features, follow-up options, saving, drilling down and suggestions for further searches.

Web 2.0 technologies can achieve data integration or mashup using freely available tools. My Yahoo!Pipes [18] service from Yahoo! is one of many editors that enable users to combine and filter web content then direct it to a feed reader. A “pipe” or feed can then be shared. The Web has been described as “a vast database of information” [19]. Using this service requires a greater degree of skill than a simple key word entry into a query-field of a search engine but the motivated health consumer has more choice with the new technologies increasingly and freely available. Yahoo!360 is a personalised Yahoo! service offering personalised page, blog, mailbox as well as search. Yahoo also provides an “alpha beta” search engine [20], which enables customisation of both information source and search profile. This search engine could come into the category of a Custom Search Engine, which, along with Google Coop [21], Rollyo [22] and others, allows users to create a search engine using only trusted sources.

Social bookmarking has the potential to further enhance search. Folksonomies, or collections of tags or labels, are also searchable. Studies have even proposed combining bookmarking data with standard link-based search [23] [24]. This present study did not include search engines from this category but used Del.icio.us website [13] tags as one factor in assessing popularity. Searching for content using social bookmarks is another new way of searching for content. Social bookmarking sites allow users to save searches, label or tag them and identify other taggers with similar interests. Individuals can be added as contacts within a network, and setup options enable alerts when recent content is added. A tag search in Del.icio.us for “cancer” returned 66,675 results with the National Cancer Institute (US) having the highest number of tags (523). “Ulcerative colitis” returned 342 results with the Wikipedia article highest with 18 tags. As more people vote up, label or tag sites, there is potential for the best voted sites to be useful in health search. Privacy issues surrounding health are an issue in social search and searchers may be reluctant to allow others to view their tags or saved sites.

One of the health scenarios covered in this study was seeking a self-help tool for depression. None of the nine search engine results returned satisfying results. Moodgym [25] is one such tool. A search for the tag “moodgym” in Del.icio.us showed 50 results, which indicates this tool has been valuable to these voters. There is potential for a health user to find other sites saved by a user who has also used the tag “moodgym”. There is also the potential to add this tagger to create a personal network of people and keep an eye on new tags added. This method of searching for content or reliable health sites is a step away from the keyword entry into a general search engine, and requires motivation and new skill. At present there are more tags for design, software, programming websites so far but there is potential for health tags to be added.

With so many new services coming on-stream, it is possible for a motivated computer-user to make use of these services, though based on the current model of a health searcher, it is questionable whether an average health searcher would be motivated enough to spend the time and make the effort. The use of these new technologies by the average health searcher is still in the future.

Semantic search engines have not been considered in this search engine comparison but word meaning and concept matching rather than key word matching are an evolving direction of search. Eleven search engines in this category were identified. An example is Medline.cognition.com [26], which uses word-meaning technology to search Medline.

The three dominant general search engines (Google, Yahoo! and Live) enable users to create their own personal searchable web tags and to save content they find relevant. Users can share content, search content that other searchers have found useful, import contacts and save favourite pages by using a button on the toolbar of the browser. Results pages can be structured according to preferences by extracting structured data from feeds and search across subscribed content can be part of the search. Signing into Google opens up an enhanced search experience. Personally selected pages can be saved pages into Google Notebook, with saved history, alerts sent to email, the ability to create custom search engines create individual pages with information feeds as more options. Allowing saved pages to remain private or shared by only contacts or everyone enables individual preferences and control over content.

Microsoft offers mylive.com, another free subscription service within which users can create and share pages, use email services and experience personalised features. The three major general search engines have extended their user experiences, expecting a commitment to a whole service, which becomes a personalised experience.

As search engines improve, some features that were previously considered challenges are being resolved. With improved results from trustworthy sources now often appearing on the first page, a large volume of returned information seems less of a concern. While this has an upside, new good quality sites have difficulty in achieving a high ranking and depend on social bookmarking to achieve recognition. Previous lack of organisation is also resolving with the general search engines all going some way to providing better organised results in health searches. Information provided for recognised medical conditions is categorised with options that include, treatment, test/diagnosis, symptoms and cause/risk factors identified at the top of the SERP in the general search engines.

Question and answer systems - yet another method of information retrieval for consumers -allow users to type a question into the query-field and receive user-voted answers. Users are invited to log in and vote answers up or down. Answers.com, wikianswers.com and answers.yahoo.com are examples. Answers.yahoo.com has a health section with options available for different countries.

Evaluation of the first ten sites for each scenario in each search engine included relevance, usefulness, usability of sites, and quality of health information evaluation (using the FA4CT tool). The range of queries included unusual search terms and common health terms, with the vertical and health portal search engines ranking lowest when using uncommon terms. Vertical search has excited some health sectors with the promise of targeting a section of the millions of pages on the web. However, when the large general search engines can return relevant pages in a short time it is difficult to see an advantage unless the user is looking for a social experience.
Google emerged as possibly the best search engine across all comparisons and search.yahoo next. Both were general search engines that offered a wide range of additional features. The meta search engines, Jux2 and Dogpile, ranked relatively high with precision and relative recall, but did not have language, social aspects or other commonly added features. All search engines except Revolution Health were fast but the efficiency of the general search engines was superior. Sponsored results were more intrusive in the health portals and for Dogpile. Dogpile interspaced the sponsored links with organic results in a manner that did not clearly differentiate the sponsored and organic links. Searchers could be unaware of links leading to sponsored sites.

Search engine comparison can be made on many fronts. Characteristics of the technology, the user, their interaction and the information requirements are some of the broad categories that need consideration. The “who, what, when, where, why and how” variables of each individual search are all important facets. Search engine users are as heterogeneous as web sources. A multidimensional approach is required and even so, any one study is only able to evaluate a small selection of the variables.

This study has focused on traditional evaluation measures plus popularity, usability, website quality and search engine features. It has considered that most users do not search past the first page of results. Unless sites are on the first page of results, they may not be available to the majority of searchers using a search engine, despite their quality, excellent or otherwise.
Google search engine is continually upgrading services, and despite newer search engines mounting challenges, the popularity of Google is increasing. Health consumers who simply want a standard key word matching search, can do no better at this time than start at the Google search engine, but learning to use it better is a worthwhile option. Seventy percent of searchers use it already [27], including possibly 56% of health professionals searching for health information [4]. Health professionals may also lack the skills required to use PubMed [4] and find better results starting from a general search engine. Advanced search features including Google Scholar search, in general search engines allow searches that are more specific. Web 2.0, semantic search, custom search engines and rapid evolution of technologies around information retrieval have provided a multitude of other ways for consumers to acquire health information if they have the search skills required. The other major general search engines, Yahoo! and Live, are constantly upgrading.
As the World Wide Web (WWW) is morphing, search engines are evolving and the search process will continue to change. Current skill sets for consumer search are limited, pushing search engine companies to carefully consider the “user experience” and “individual user needs” factors in the equation.
5 Limitations
The method of search engine evaluation used in this study, did not take full account of the richness of social experience and individual page organisations offered by the vertical health search engines and the health portals. These aspects may fulfil some user needs or conversely be seen by some searchers as ‘clutter’.

Acknowledgements

Conflicts of Interest
None declared.

References
1. Fox, S. Online Health Search, Pew Internet & American Life Project 2006. URL: http://www.pewinternet.org/pdfs/PIP_Online_Health_2006.pdf [WebCite Cache]
2. Eysenbach G, Kohler C. How do consumers search for and appraise health information on the World Wide Web? Qualitative study using focus groups, usability tests, and in-depth interviews. Bmj, 2002. 324 (7337): p. 573-7. [FREE Full text] [Medline] [CrossRef]
3. Fallows, D. 2005 Search Engine Users, Pew Internet & American Life Project. URL: http://www.pewinternet.org/ppf/r/146/report_display.asp [WebCite Cache]
4. Steinbrook, R. Searching for the right search - reaching the medical literature. N Engl J Med 2006; 354: 4 [Medline] [CrossRef]
5. Spink A., Yang Y, Nykanen P, Lorence DP, Ozmutlu S. A study of medical and health queries to web search engines. Health Info Libr J, 2004. 21 1 : p. 44-51.[Medline] [CrossRef]
6. Fox.S. Online Health Search Pew Internet & American Life Project 2006. URL http://www.pewinternet.org/pdfs/PIP_Online_Health_2006.pdf. [WebCite Cache]
7. Manber U. Interview in Popular Mechanics April 16 2008. URL: http://www.popularmechanics.com/blogs/technology_news/4259137.html [WebCite Cache]
8. Manber, U. The Official Google Blog. Introduction to Google Search Quality 2008. URL http://googleblog.blogspot.com/2008/05/introduction-to-google-search-quality.html Accessed September 20 2008 [WebCite Cache]
9. Gutmans A. PHP Leads Web 2.0. A Closer Look at the Hidden Drivers and Enablers 2006 White Paper. URL http://static.zend.com/topics/php_leads_web2_0.pdf [FREE Full text] [WebCite Cache]
10. Lewandowski D, Hochstotter N. Web searching: a quality measurment perspective. In: Web Search: multidisciplinary perspectives, A. Spink and M. Zimmer, Heidelberg, Springer 2008, p. 351.
11. Alexa.com.website. URL http://www.alexa.com/site/company. [WebCite Cache]
12. Wikipedia.org website. Google Page Rank. URL http://en.wikipedia.org/wiki/PageRank [WebCite Cache]
13. Del.icio.us.com website. URL http://del.icio.us.com/ [WebCite Cache]
14. Edejer TT. Disseminating health information in developing countries: the role of the internet. Bmj, 2000. 321 7264 : p. 797-800. [FREE Full text] [Medline] [CrossRef]
15. Eysenbach G, Thomson M. The FA4CT algorithm: a new model and tool for consumers to assess and filter health information on the Internet. Medinfo 2007. 12 Pt 1: p. 142-6. [Medline] [CrossRef]
16. Compete.com website. Website analysis. URL http://siteanalytics.compete.com/google.com+search.yahoo.com/?metric=uv [WebCite Cache]
17. Lau AY, Coiera EW. Do people experience cognitive biases while searching for information? J Am Med Inform Assoc, 2007. 14 5: p. 599-608. [FREE Full text] [Medline] [CrossRef]
18. Yahoo!Pipes. URL http://pipes.yahoo.com/pipes/ [WebCite Cache]
19. Yahoo! Pipes and The Web As Database. URL: http://www.readwriteweb.com/archives/yahoo_pipes_web_database.php [WebCite Cache]
20. YahooAlpha.com URL http://au.alpha.yahoo.com/ [WebCite Cache]
21. Google Coop Search Engine. URL http://www.google.com/coop/cse/ [WebCite Cache]
22. Rollyo.com Rollyo Custom Search Engine. URL: http://rollyo.com/ [WebCite Cache]
23. Shenghua B, Shenghua B, Guirong X, Xiaoyuan W, Yong Y, Ben F, Zhong S. Optimizing web search using social annotations, in Proceedings of the 16th international conference on World Wide Web 2007, ACM: Banff, Alberta, Canada. [FREE Full text] [CrossRef ]
24. Yusuke Y, Jatowt A, Nakamura S, Katsumi T. 2007 Can social bookmarking enhance search in the web? Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries. URL http://portal.acm.org/citation.cfm?doid=1255175.1255198 [WebCite Cache]
25. Moodgym.Moodgym training program: delivering cognitive behaviour therapy for preventing depression. URL http://moodgym.anu.edu.au/welcome [WebCite Cache]
26. Medline.cognition.com URL http://medline.cognition.com/ [WebCite Cache]
27. Hitwise.com URL: http://www.marketingcharts.com/interactive/google-approaches-70-share-of-us-searches-up-8-yoy-5278/ [WebCite Cache]


Abreviations
RWT: rich web technologies
PHP Hypertext Pre-processor
SELP: search engine landing page
SERP: search engine results page
WWW: World Wide Web
AJAX: Asynchronous Javascript And XML

No comments: