Information Retrieval Systems

Design, query, and evaluate information retrieval systems.

STATEMENT OF COMPETENCY

Subject databases and web search engines are two prominent examples of information retrieval systems, which academic librarians engage with daily. As such, understanding the principles with which these systems are designed, demonstrating the ability to assign systems to corresponding information needs, and possessing the search strategies necessary for comprehensive navigation, are crucial skills which all information professionals should arguably attain.

Design

Information retrieval (IR) systems are web-based organization and discovery tools, which compile documents and/or records—otherwise known as information entities—and facilitate their searchability and findability via user interfaces. Weedman (2019a) notes that IR systems are inherently built upon a data structure: an arrangement of fields (which establish, and sometimes combine, conceptual boundaries of search terms that can be inputted) and field values (accepted search terms, corresponding to assigned fields, which can supply search results) (p. 123).

Within IR systems, information entities are typically assigned metadata, which is defined by the National Information Standards Organization (NISO) (2004) as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” (as cited by Weedman, 2019b, p. 10). In addition to offering information regarding a given entity’s authorship, publisher, and place and time of origin, metadata might convey its subject or aboutness via subject headings or descriptors. When the creator of an IR system standardizes a set of descriptors, they have effectively generated a controlled vocabulary, which can be utilized for the conceptual grouping of related entities (p. 10). While navigating a controlled vocabulary requires a developed understanding of online searching, it can efficiently yield relevant, manageable sets of search results. Given the inflexibility of descriptors within a controlled vocabulary, IR systems typically supply thesauri, which include entry vocabulary (pointing the user toward valid descriptors, based on similarly worded terms), as well as broader terms, narrower terms, and related terms, which establish relationships between descriptors in a hierarchical context (Weedman, 2019a, pp. 130-135).

Keyword searches—familiar to virtually all users of Google or library OPACs—are facilitated by natural language: a process by which an entity’s full text, title, and/or abstract are rendered searchable (Weedman, 2019b, p. 10). Depending on the algorithm implemented by the designer, a given IR system might rank keyword search results according to relevance, prominence, or frequency of inputted terms, or some combination thereof (p. 11). While smaller-scale databases may intentionally index documents for findability, larger search engines typically use algorithms known as “web crawlers” and “spiders”, to automate the process of evaluating and indexing information entities (p. 12).

Given the user-facing nature of IR systems, usability presents a frequent concern during the design process. Krug (2014) advocates for “self-evident”, “effortless” usability within user interfaces, (p. 19) and points to individualized usability testing of IR system prototypes as an effective, practical method for ensuring ease of navigation, as well as determining the information needs and behaviors of target users (p. 113). Weedman (2019c) describes the value of surveys, interviews, and focus groups for determining user preferences within IR systems, as well as paper-based card sorting exercises, for purposes of understanding user expectations regarding IR system organization and layout (pp. 253-259).

Querying

Before formulating search strategies for use within a given IR system, one must first evaluate the purpose and structure of the system at hand. Take Google, for example; given its breadth of coverage, as well as its automated, algorithm-driven incorporation of searchable content, it does not incorporate a controlled vocabulary. As such, users are limited to keyword searches, but can employ a set of techniques in order to target searches effectively.

Phrase searching involves placing multiple words within quotation marks, thus searching for chosen words in sequence. Users can perform wildcard searches to account for multiple variations of a given word or root (e.g., inputting “vot*” will allow for results containing “vote”, “votes”, “voter”, “voters”, and “voting”, while inputting “wom?n” will allow for results containing “woman” or “women”). By inputting Boolean operators of “AND”, “OR”, and “NOT” users can establish relationships between search terms, and expand or narrow the scope of their results as needed. Site searching allows the user to seek results emerging from a specific website, or a specific domain category. For example, attaching “site:.edu” to one’s search terms will yield results from verified educational institutions, while attaching “site:.census.gov” will yield results from the U.S. Census Bureau.

All these techniques (except for site searching) are generally applicable in OPACs and other federated search engines, which may use Library of Congress Subject Headings for classification, but are generally designed for use via keyword searching (Mann, 2008, p. 85). They prove useful within many subscription databases as well, alongside subject searching. While Google and other commercial search engines can prove useful for fulfilling immediate information needs, and performing searches in a casual context, library OPACs and subscription databases are generally better suited to locating—and gaining access to—scholarly content.

Large, cross-disciplinary citation and abstracting databases—such as Scopus (Elsevier) or Web of Science (Clarivate)—are not built upon controlled vocabularies, and therefore rely upon keyword searching techniques, as well. However, one of their key strengths lies in their ability to draw connections between documents based on common citations. Upon locating an article relevant to one’s information needs, consulting “cited by” and “related documents” listings will yield articles with overlapping citations, and potential thematic similarity, as well; Mann (2015) notes the value of these features, as they generally prove more efficient and effective than “footnote chasing” (p. 159). Scopus and Web of Science additionally allow users to select a search filter titled “review”, which limits all results to literature review articles: comprehensive overviews of given research topics, which typically contain many useful citations (pp. 162-165).

Specialized subject databases—such as Sociological Abstracts (ProQuest) or Food Science Source (EBSCO)—are typically built upon controlled vocabularies, which are viewable within user-facing thesauri, accessible within their interfaces. Thesauri generally allow users to locate descriptors within a hierarchical listing, or locate them via a search box. Search results within thesauri can often be ranked according to relevance, and an applied entry vocabulary will suggest assigned descriptors, given similarly worded inputs. Upon locating relevant descriptors, users can click them, in turn automatically adding them to a search box within the database. By way of Boolean operators, or additional search boxes, users can combine multiple descriptors, or supplement subject searches with keyword terms. While library OPACs and cross-disciplinary resources like Scopus are appealing for their sheer breadth of coverage, subject specific databases bearing controlled vocabularies are additionally useful for their ability to ensure conceptually relevant search results.

Evaluation

Evaluation procedures are a crucial component of IR system maintenance. Ceri, et al. (2013) note that while notions of “relevance” in IR contexts are largely subjective—depending on the nature of a given information need, as well as a given searcher’s perspective—evaluation measures can improve the structural integrity of IR systems, which in turn contributes to the perceived relevance of search results (p. 4). As such “user satisfaction” is a primary goal of IR system design; evaluating rates of returning users can contribute to designers’ understanding of how effective their IR systems are in retrieving relevant search results (p. 7).

In more technical procedures, designers often run sets of queries within IR systems in order to evaluate measures of precision and recall. Precision metrics determine the proportion of relevant to irrelevant search results within the boundaries of a given query (Ceri, et al., 2013, pp. 7-8). Alternatively, recall metrics measure the proportion of relevant search results within one query against the totality of relevant documents within the larger system (pp. 7-8). Designers typically evaluate precision and recall in conjunction with another, when making structural adjustments to IR systems; given the subjectivity of relevance, achieving an ideal balance between the two can be difficult. For example, increasing selectivity in response to user queries may serve interests of precision by yielding a higher proportion of relevant search results, while sacrificing interests of recall by omitting additional, relevant search results (p. 8). This tension can be attributed to the difficulty of objectively determining what can be dismissed as “chaff”, in addition to the important consideration of mitigating information glut.

COMPETENCY DEVELOPMENT

Although I was vaguely familiar with keyword search strategies, and the purpose of metadata, before beginning this program, INFO 202 (Information Retrieval System Design) presented the beginning of my understanding of IR systems, and the considerations and processes which inform their construction. INFO 244 (Online Searching) remains one of the most valuable courses I’ve taken within this program, and gave me the tools to perform strategic searches within databases, as well as evaluating the merits of IR systems depending on given information needs. INFO 254 (Information Literacy & Learning) facilitated the application of my knowledge of IR system querying and evaluation within the context of instructional design, while INFO 294 (Professional Experience: Internships) allowed me to expand this skillset to actual instructional content within the Portland State University Library. INFO 256 (Archives & Manuscripts) gave me insight into IR systems commonly used within archival professions, and their relative merits.

While I entered the MLIS with the intention of pursuing archival work, or music-related special librarianship, being introduced to database search strategies alongside instructional design was the impetus for my change of pathways. At this point, I am determined to become an academic librarian in an instructional capacity, and to promote efficient, effective search strategies with the hope of empowering students as they pursue research processes.

SELECTED ARTIFACTS

Through the original database, pieces of written work and video content included herein, I seek to demonstrate my understanding of IR system design, querying, and evaluation.

INFO 202 - Database Design - "Find My Shoes" copy.mov

INFO 202 – Database: “Find My Shoes”

This video demonstrates a database I designed within the WebData Pro application, for purposes of locating different types of shoes. The database facilitates searches within four fields, with three field values apiece. Two of the fields are textbox fields; “type” corresponds to values of “athletic”, “sandal”, and “formal”, while “brand” corresponds to “New Balance”, “Birkenstock”, and “Cole Haan”. The remaining two fields are dropdown list fields: “color” corresponds to values of “blue”, “brown”, and “black”, while “size” corresponds to values of “10.5”, “11” and “11.5”. This database demonstrates my improved understanding of IR design, and the fields and field values which supply their foundation.

INFO 202 - Beta Prototype copy.pdf

INFO 202 – Beta Prototype: Candles Database

This database is the result of a group project, in which I worked with four fellow students to build a database of candles, set detailed definitions the assigned fields, and establish criteria for assigning field values. While I contributed little to the design of the database itself within WebData Pro, I was charged with indexing two candles within the database, creating the rules regarding the “burn time” and “country of manufacture” fields, and editing the statement of purpose. In addition to establishing my ability to work productively in a group setting, this database project reflects my understanding of the importance of clearly defining the fields and their values within a data structure.

INFO 244 - Presentation - Taylor Kaplan copy.mov

INFO 244 – Presentation: “Utilizing SAGE Research Methods”

This presentation—generated via Microsoft PowerPoint and converted to video—aims to introduce users to a subscription database for use in academic research settings: namely, SAGE Research Methods. I outline the database’s scope, content, and authority, draw attention to its controlled vocabulary, and demonstrate its ability to effectively facilitate a subject search, using descriptors contained in its “methods map”.

INFO 244 - Database Review - Taylor Kaplan copy.pdf

INFO 244 – Database Review

This review reflects my ability to evaluate information retrieval systems according to multiple criteria, and assesses the scope, content, authority, purpose, usability, frequency of update, intended audience, and search features of two subscription databases: CQ Researcher and Hein Online.

INFO 256 - Reference Resources Assignment - Taylor Kaplan copy.pdf

INFO 256 – Reference Resources Assignment

Within this assignment, I evaluate scope, contents, searchability, authority, and overall quality of six reference resources demonstrating varying levels of prominence within archival professions: UK National Archives, Library of Congress Finding Aids, OCLC WorldCat, ArchiveGrid, Mountain West Digital Archive, and Online Archives of California. For the sake of consistency, I evaluate each database with the use of the same search string: “Spanish Flu” OR (“Influenza Epidemic” AND 1918). This assignment demonstrates my ability to both query and evaluate information retrieval systems.

CONCLUSION

IR systems present the models and interfaces by which digital information is inputted and outputted. They are constantly used by users and information professionals alike: for ready reference, in-depth consultations, and beyond. As such, it is essential that academic librarians demonstrate knowledge of the design principles behind these systems, understand the merits of specific IR systems relative to given types of information needs, and possess the skills and strategies required to navigate them efficiently, effectively, and comprehensively. I look forward to further professional experience in the field of academic librarianship, which will allow me to apply subscription databases and related IR systems within the context of research and instruction services.

REFERENCES

Ceri, S., Bozzon, A., Brambilla, M., Della Valle, E., Fraternali, P., & Quarteroni, S. (2013). Web information retrieval. Springer.

Krug, S. (2014). Don’t make me think: A common sense approach to web usability. New Riders.

Mann, T. (2008). The Peloponnesian War and the future of reference, cataloging, and scholarship in research libraries. Journal of Library Metadata, 8(1), 53-100. https://doi.org/10.1300/J517v08n01_06

Mann, T. (2015). The Oxford guide to library research (4th ed.). Oxford University Press.

Weedman, J. (2019a). Designing for search. In V. M. Tucker (Ed.), Information Retrieval System Design: Principles & Practice (6th ed., pp. 118-139). AcademicPub.

Weedman, J. (2019b). Information retrieval: Designing, querying, and evaluating information systems. In V. M. Tucker (Ed.), Information Retrieval System Design: Principles & Practice (6th ed., pp. 6-20). AcademicPub.

Weedman, J. (2019c). User research. In V. M. Tucker (Ed.), Information Retrieval System Design: Principles & Practice (6th ed., pp. 251-261). AcademicPub.