Cognitive Issues and Approaches in Music Information Retrieval

David Huron
&
Bret Aarden


Introduction

Technological achievements in the past few years have inspired many individuals and corporations to dream of creating world-wide music distribution systems. In attempting to realize such dreams, a large number of issues arise. These issues include technical, legal, commercial, cultural, moral, and musical concerns. Some of the most important issues in music information retrieval are psychological in nature. These relate primarily to understanding human motivations for seeking music in the first place, and crafting psychologically potent ways of accessing the music.

Traditionally, musical information has been retrieved via standard reference information such as the name of the composer or the title of the work. Music librarians and cataloguers have devised detailed reference standards in order to maximize the potential utility of reference indices (e.g. AARC2, IASA rules, MARC format, etc.; see Miliano, 1999; Olson, 1997; University of Rochester, 1999). While this basic information remains important, standard reference tags have surprisingly limited value in most music-related queries. As we will see, different music users have extraordinarily diverse retrieval needs.

Desiderata for Music Information Retrieval

Discussion of design issues should begin with a clear vision of what an ideal music information retrieval system might look like. At least six desiderata might be identified.

  1. Access to all of the world's music: every sound recording, every score and manuscript, every film or video clip, every piece of music-related information.
  2. Access via any indexing method: by culture, by artist, by content, by social and psychological affect, by association, similarity, allusion, etc.
  3. A system that provides appropriate reimbursement to all who have contributed, without favoritism to particular countries, languages, cultures, companies, or individuals.
  4. An open system to which anyone might contribute and from which anyone might benefit.
  5. A system that is self-correcting and provides ways to assess and characterize the reliability and accuracy of the information.
  6. A system that respects and ensures privacy, and is sensitive to different cultural practices and traditions. [1]

Until the arrival of the Internet, such a system would have seemed unimaginable. Even as the huge potential of Internet-based music distribution is becoming apparent, however, the concomitant problems of cataloging and categorizing very large music databases are also coming to the fore.

The foremost goal of conventional reference information has been to uniquely identify some artifact or document. By providing a canonical reference (author, title, publisher, edition, etc.), users could be minimally confident that an artifact in hand truly corresponds to some cited source.

However, librarians and general users have long recognized that conventional reference catalogues (such as alphabetized author lists) do not provide the best points-of-entry for accessing items of interest. Experience with the world-wide web has established that the most common user queries do not involve looking for items whose existence is already known to the user. Rather, the most common query is to discover new items of which the user is ignorant. For these sorts of queries, classic reference information has proved to have a limited value.

In the case of music, many of the ways in which users wish to access information is highly subjective, such as finding music that is "similar" to other music, music that has a certain emotional content, or music in some specified style or texture (Cronin, 1997-98; Huron, 1989a; Orpen & Huron, 1992; Simpson & Huron, 1993). These subjective motivations highlight the need for more sophisticated indexing methods.

In attempting to build such musical indexes, three general questions arise: (1) What is the best taxonomic system by which to classify moods, styles, and other musical characteristics? (2) Given the extraordinary volume of existing musical materials, how do we go about assembling the taxonomic information in an accurate and economical manner? In particular, how might we create automated systems that will reliably characterize sound recordings, scores, etc.? (3) What is the best way for users to interact with the indexed information?

Musical Uses and Taxonomic Systems

Music is used for an extraordinary variety of purposes: the restaurateur seeks music that targets a certain clientele; the aerobics instructor seeks a certain tempo; the film director seeks music conveying a certain mood; an advertiser seeks a tune that is highly memorable; the conductor seeks music written for a particular instrumental combination; the physiotherapist seeks music that will motivate a patient; the truck driver seeks music that will keep him alert, the music teacher seeks music that will challenge the ears of her students; the casual listener seeks music that is novel, interesting, and enjoyable (see Huron, 1989b).

Although there are many other uses for music, music's preeminent functions are social and psychological. Consequently, we can expect that the most useful retrieval indexes will be those that facilitate searching according to such social and psychological functions. Typically, such indexes will focus on stylistics, mood, and similarity information. In each of these domains, a cognitive approach provides opportunities to better understand the salient issues and so better design appropriate access tools.

The form and content of indexes are also likely to be influenced by purely business-related factors. In the first instance, it is not clear what business model is likely to prove profitable. Record companies (the main owners of music) are currently in a dilemma with regard to Internet-based music distribution or sales. Record companies have traditionally maintained a collaborative relationship with retail distribution networks, however selling music directly over the web would place the record companies in direct competition with their retailers. Consequently, most music distribution via the web has been initiated by small start-up companies that are independent of the majors. Major record companies find themselves squeezed between losing their market to nimble start-ups or disenfranchising their retailers by becoming direct competitors. We can expect the majors to watch carefully the developing web-based services, and to sign licensing agreements or acquire promising companies as the shape of future business patterns becomes clearer. Many start-ups have endeavored to avoid direct competition with either retailers or record companies by organizing themselves as service enterprises (i.e., business-to-business companies).

It is important to recognize that in one or two decades, methods for world-wide music distribution are likely to become entrenched, and future users may well wish that history had taken a different turn. Given the current uncertainty and flux, it behooves us to reflect carefully and act decisively in ensuring that our musical future is the best one possible.

Assembling Taxonomic Information

In assembling taxonomic information a useful distinction can be made between proprietary (centralized) and non-proprietary (decentralized) databases. A taxonomic database may be said to be proprietary when the indexed materials themselves are under the control of the indexing service (e.g. Beatscape). The word "proprietary" here is not intended to indicate that the service owns the music. Typically the materials are accessed under a licensing agreement. A database may be said to be "non-proprietary" when the indexed materials are publically accessible and dispersed across the Internet. This distinction between proprietary and non-proprietary databases may not be useful from a business perspective, but it helps clarify the manner in which music indexing is carried out.

For proprietary musical databases, categorization can be centrally managed, and manual indexing methods are often more feasible. However, past (though limited) experience with accessing text documents via the Internet implies that non-proprietary or decentralized indexing is likely to prove more popular and more successful. That is, future music indexes will likely resemble web-wide search engines (like Infoseek or Google) rather than closed proprietary systems (like Encyclopedia Britannica). A decentralized approach is likely to be more friendly to the third and fourth criteria identified above. That is, non-proprietary systems are less likely to favor certain countries, languages, cultures, companies, or individuals, and more likely to embrace musical materials from any quarter.

Manual (human-based) methods for characterizing music are currently used in both proprietary and non-proprietary music discovery services. Uplister employs DJs to catalogue its music, while MusicBuddha (MuBu.com) uses a combination of in-house music editors and external consultants. Other music discovery services, like Gigabeat and MongoMusic.com rely on information gathered during users' interactions (Tate, 2000).

Perhaps the most extensive web-based method for gathering user-supplied music characterizations is exemplified by Jaboom.com -- a data-collection site run by MoodLogic Inc. Listeners are able to hear music in genres of their choice over the web, and are asked to provide descriptive rating of the music according to various criteria (such as tempo, voice, and mood). Raters are attracted to the Jaboom site because of the opportunities to explore new music, but Jaboom provides further incentives through free CDs and gift certificates. In this way, thousands of listeners have been recruited to help build musically-pertinent taxonomic information.

In the longer term, there are important reasons to consider automating the process of building music catalogues. Over the coming years, we can expect the number of music-related documents on the Internet to expand dramatically. The majority of these documents are likely to be of low commercial value and will fail to attract a popular following. Given the potentially enormous volume of materials, most documents are in danger of not being indexed at all. Since one of the great attractions of the Internet has been the egalitarian status of all documents, it would be regrettable if music indexing focussed exclusively on commercial or popular materials.

Using automated methods will increase the possibility of traversing the entire Internet, and so uncovering materials (even popular materials) that exist in remote or obscure locations. Finally, automated indexing methods might ultimately prove more reliable than human coders, although this speculative prediction will surely not be borne-out in the short-term.

Musical Web Crawlers

In common text-based indexing (as used in search engines like Alta Vista and LookSmart), web crawlers traverse the Internet and index the content of each document encountered. Web-wide music search engines would similarly require musical web crawlers that traverse the Internet and index all music-related documents encountered (most notably, sound files such as MP3, .wav and RMF files). The information gleaned from music web crawlers will then be used to assemble music-related indexes. Two issues are paramount in the design of music web crawlers. What kind of information can be collected? And what technology will enable the desired information to be deciphered from the musical document?

In the first instance, web crawlers can make use of any catalogue-related information that is tagged in the document itself. Consider, by analogy, the situation for HTML (text) documents. Within each document, it is possible to embed HTML meta-tags that describe the content of the document. Unfortunately, in the construction of HTML documents, document creators often fail to encode the pertinent tags, or are inconsistent in the manner in which the tagged information is encoded. In the world of text-document indexing, the web crawlers scan the entire document and carry out keyword in context (KWIC) indexing.

In the case of music files, the MPEG-V specification defines many meta-tags that aid in the characterization of sound recordings. Once again, the effectiveness of these tags depends on the willingness of document creators to encode them, and the consistency (or lack of consistency) in the type of information that is tagged. In centralized music systems, one can ensure that all of the stored documents meet some pre-defined standard for the use of these meta-tags. However, in decentralized systems we would expect inconsistent use of these tags.

Automated Characterization

In indexing such decentralized musical materials, one must process the music in a manner analogous to keyword in context indexing. That is, one must open up the document (recording) and attempt to characterize the music from the sound function itself. A useful musical web crawler might attempt to decipher information such as the following:

Clearly, each of these functions requires an enabling signal processing technology. For example, when estimating the sound quality of a recording, the algorithm might measure the signal bandwidth, determine the level of background noise, and, for stereo recordings, estimate the quality of stereo imaging. Estimating sound quality might allow users to exclude documentary or archival recordings that might be of interest only to scholars.

In the case of tempo tracking, a number of pertinent algorithms already exist (e.g. Large, 1995; Scheirer, 1998). These software programs can not only characterize the overall tempo and meter of a piece, but they can also identify dynamic changes in tempo and meter over the course of the recording. Tempo alone can be useful in a number of ways, from assembling dance mixes to characterizing styles. The degree of rubato, for example, might be used to identify interpretive performances that can be deemed more or less "romantic".

In the case of style, work by Perrott and Gjerdingen (1999) has shown that human listeners are able to estimate gross stylistic categories using only very short (less than one second) musical samples. This implies that frequency-domain measures alone might prove sufficient in estimating stylist categories. It also suggests that a web crawler may need to collect only a brief sound sample in order to successfully characterize the style.

Another potentially useful characterization is that of mood. Listeners regularly report that the principal reason they listen to music is because of "how it makes me feel." Moods cut across boundaries of style and culture. As cross-cultural studies have demonstrated, there are several basic dimensions of emotion that appear to express themselves universally, in the same way that some (though not all) aspects of facial expression and body language can be universally understood (Ekman & Friesen, 1998).

Thayer (1989) has advocated a successful two-dimensional model of mood, where one dimension is arousal and a second dimension is stress. Increased arousal is associated with faster heart rate, increased blood pressure, faster respiration, and increased glucose uptake. Increased stress is associated with increased cortisol levels in the blood. This two-dimensional model might be characterized in terms of the four quadrants shown in Figure 1.

[Place Figure 1 here.]
Fig. 1. Schematic illustration of music summarization. One or more representative segments of a work are assembled into a musical précis or sound-bite. Summarization allows users to sample a musical work without having to download an entire sound file.

The quadrants might be described as exuberance/ecstatic (high arousal, low stress), frantic (high arousal, high stress), contentment (low arousal, low stress) and ominous/depressive (low arousal, high stress). Musical works or passages appear to be readily characterizable in terms of these two dimensions. A number of auditory properties are known to correlate with the two underlying dimensions, which raises the possibility of automatic characterization of music using Thayer's mood model.

Taken together, it is possible that automated estimates of sound quality, tempo, style, mood and other features might provide a basic framework for more comprehensive and customizable forms of music classification.

User Interactions and Taxonomic Interfaces

Indexed information need not be presented or accessed in a single way. For any given taxonomic index, there are a multitude of ways of presenting this information, and of allowing users to negotiate through a "catalogue".

Music users often differ dramatically in their musical knowledge and sophistication. Some users have detailed knowledge of a repertory, are able to characterize styles and features reliably, or have advanced musical training that allows them to notate particular passages. Other users may be unable even to hum or sing a highly familiar song.

Appropriate user interfaces will need to cater to the full spectrum of users, from novice to sophisticated. For the expert, an elementary interface is apt to evoke frustration when the user's knowledge should otherwise permit rapid access to the items of interest. Conversely, complicated or incomprehensible interfaces are likely to leave more novice users confused.

User interfaces can be designed that allow users to access the same information through a variety of approaches. Consider, by way of example, rhythmic classifications. For the musician, meters can be readily characterized as simple or compound, duple, triple, or quadruple. These labels are largely meaningless for non-musicians. However, one might provide a two-dimensional "rhythm space" consisting of a simple/compound dimension and a duple(quadruple)/triple dimension, and allow users to traverse through the space with a cursor. For the benefit of novice users, exemplar rhythms could be produced as the cursor rolls over various rhythm regions.

Musical Thumbnails and Automated Music Summarization

One of the most important forms of user feedback for music is to hear a short representative segment of a musical work. Before downloading or streaming an entire work, there is great benefit to hearing a brief illustrative excerpt -- a musical equivalent of the "thumbnails" commonly used in electronic picture galleries. Not all portions of a musical work are equally representative of a given piece, and so the practice of extracting the initial few seconds ("incipit") is not optimum for identifying or recognizing a work. Most well-known musical themes do not appear within the first ten seconds of a recording. Moreover, the optimum musical thumbnail may consist of two or more brief passages edited into (say) a single five-second sound bite.

Figure 2 provides a conceptual illustration of the goal of music summarization. The intention is to create a brief précis consisting of one or more passages that are somehow representative or evocative of the work in question. Typically, this will include the main themes or "hooks" in a piece of music. Suitable passages could be identified manually, but unlike other forms of music characterization, editing a musical thumbnail is highly labor intensive. There are good incentives to try to generate musical summaries automatically.

[Place Figure 2 here.]
Fig. 2. Illustration of Thayer's two-dimensional model of mood. Arousal entails the body's capacity to act or perceive: incrased arousal is associated with faster heart rate, increased blood pressure, faster respiration, and increased glucose uptake. Stress relates to negative valence responses of tension or threat. Increased stress is associated with increased cortisol levels in the blood. The four quadrants represent calm-energy (e.g. exuberance, euphoria), calm-tiredness (e.g. contentment, serenity), tense-energy (e.g. fight/flight or freeze response), and tense-tiredness (e.g. depression, crankiness, disphoria).

One way of viewing this problem is as a musical equivalent of the problem of text summarization (Mani & Maybury, 1999). One common approach in text summarization is to extract sentences that are highly characteristic of the document itself, but highly uncommon compared with a large corpus of ordinary text. For example, in a text pertaining to mortgage re-financing, the phrase "mortgage re-financing" is apt to appear frequently within the document, but occurs relatively rarely in ordinary English text. Selecting exemplar sentences from the text involves choosing sentences that contain phrases that have a high within-document entropy, but a low between-document entropy.

A similar approach might be used in the case of music summarization. The goal would be to identify passages that are stereotypic within the music, but highly distinctive compared with other musical works.

In designing various music summarization schemes, a critical issue is how one evaluates the success of any given approach. How do we know that some algorithm has selected the best musical passages?

A useful conceptual approach to identifying salient musical passages is provided by prototype theory (Rosch & Lloyd, 1978). According to this theory, some objects within a class of objects are typically regarded as better representatives of the class in general. Consider, for example, the class of all birds. A sparrow is more typically a bird than a penguin is. Psychologists have established methods for determining which objects are construed by people as more prototypic. For example, suppose we read aloud the following list of birds: "Goldfinch, starling, blue jay, cardinal, eagle, pigeon." We then ask our listeners to recall whether certain birds were on the list. For example, was ostrich on the list? Was robin on the list? People are likely to correctly recognize that ostrich was not on the list, but are likely to mistakenly claim that robin was included on the list. That is, there is a tendency to falsely claim the presence of a prototypic object in a recollection task.

Another characteristic of prototypicality can be found in asymmetrical judgments of similarity. For example, the color red is more prototypic than the color pink. If viewers are asked to judge the similarity of these two colors in an ordered task, viewers are more likely to claim that pink is more similar to red than red is to pink. Similarly, a chicken (less prototypic) will be judged more similar to a pigeon (more prototypic) than a pigeon is to a chicken.

These two properties of prototypicality also apply in the case of musical passages. For example, themes are judged as more prototypical of a musical work than non-thematic (or variation) passages. Using this approach, an objective empirical method can be used to provide estimates of the prototypicality of all segments in a musical work.

Specifically, prototypical musical passages should tend to be (1) more easily recalled by listeners from one day to the next, and (2) generate false-positive recognition responses (i.e., listeners should incorrectly claim to have heard the passage before, having been exposed to other excerpts from the same work). Systematically collecting such data would allow us to construct databases that provide a test suite with which to assess the effectiveness of different music summarization algorithms.

Characterizing Users

An alternative approach to music discovery services is to characterize listeners rather than characterizing musical works. A popular approach has been to have users identify works they like. Musical suggestions can then be made by identifying works that are rated highly by other people who share similar tastes. One of the attractions of this approach is that it lends itself well to neural network and relational database technology. This approach has proved successful in making music, film and book recommendations (e.g. Allmusic.com, Movielens.com, Amazon.com). A significant problem with this approach is that it makes it impossible for users to encounter novel materials that have not been rated by someone else having a similar taste. The approach inherently establishes a threshold: works having limited popularity end up being excluded entirely from the system.

It is probably the case that focusing on a reduced repertory of the most popular works will prove to be the most profitable business model and the most convenient for typical users. However, the approach fails to meet desiderata 1 and 3. That is, the approach foils the egalitarian status of Internet documents and inherently facilitates the domination of particular cultures, languages, styles, etc.

In short, characterizing the user rather than characterizing the music raises the unsavory possibility of musical tyranny by the majority. To be fair, this is effectively the current (non-Internet) situation for music distribution. But the Internet provides opportunities for a richer and more pluralistic musical future. It is okay for peripheral works to remain on the periphery, but peripheral works should not, a priori, be excluded from the party. For this reason we regard music characterization as more deserving of attention than user characterization.

Conclusion

A world-wide system of Music Information Retrieval could transform the experience of music in many positive ways. It has the potential to better match musical needs with musical materials, to expand the listening habits of people, to make marginalized music more accessible, and to extend the everyday venues in which music might enhance the quality of life. At the same time, such a system raises onerous technical and psychological challenges.

One of these challenges is to identify taxonomic and cataloging criteria that best serve human needs. Another is to actually build the pertinent indexes. In particular, for decentralized systems, a standardized labeling system is unlikely to prevail. Given the extreme size of the expected materials, automated methods are apt to be the only long-term solution.

In this article, we have provided some suggestions as to how music cognition and perception research might contribute to creating the enabling technologies that will be used in future music web crawlers. Current research has sketched out a path to that technology, and although the potential is evident, the necessity of further investment into psychological research is equally clear. Each step along that path will gradually bring us closer to the desiderata for a comprehensive, flexible, and open music information retrieval system.

Footnotes

[1] An illustration of different cultural practices can be found in native American music. In many native American tribes, a system of rights and obligations has existed with respect to the singing of songs. The right to sing some songs was reserved for certain individuals. In some cases, only a single person was permitted to sing a given song. In other cases, the right to sing certain songs was reserved exclusively for members of a particular society or club. Often, songs were exchanged or purchased by individuals, or given as gifts. The right to sing some songs was sometimes passed down through hereditary lines. Public distribution of such music raises ethical issues if it encourages performances by unauthorized singers. Pertinent discussions regarding ownership in traditional cultures may be found in McCann (2001) and UNESCO (1989).

References

Cronin, C. (1997-98). Concepts of melodic similarity in music-copyright infringement suits. Computing in Musicology, Vol. 11, pp. 187-209.
Ekman, P., & Friesen, W.V. (1998). Constants across culture in the face and emotion. In: Jenkins, Otaley, & Stein (eds.), Human Emotions: A Reader. Malden, MA: Blackwell, chapter 7.
Huron, D. (1988). Error categories, detection and reduction in a musical database. Computers and the Humanities, Vol. 22, No. 4, pp. 253-264
Huron, D. (1989a). Characterizing Musical Textures. Proceedings of the 1989 International Computer Music Conference, San Francisco: Computer Music Association, pp. 131-134.
Huron, D. (1989b). Music in advertising: an analytic paradigm. Musical Quarterly, Vol. 73, No. 4, pp. 557-574
Huron, D. (2001). Data mining large musical databases. Paper presented at the 2001 AAAS Meeting. American Association for the Advancement of Science, 167th Program, p. A44
Kalva, H. (2000) Delivering MPEG-4 Based Audio-Visual Services. Dordrecht, Netherlands: Kluwer Academic Publishers.
Large, E. W. (1995). Beat tracking with a nonlinear oscillator. In Working Notes of the IJCAI Workshop on AI and Music, pp. 24- 31.
Mani & Maybury (eds.), (1999). Advances in Automatic Text Summarization. MIT Press, 1999.
McCann, A. (2001). All that is not given is lost: Irish traditional music, copyright, and common property. Ethnomusicology, Vol. 45, No. 1, pp. 89-106.
Miliano, M. (ed.) (1999). The IASA Cataloging Rules: A Manual for the Description of Sound Recordings and Related Audiovisual Media. International Association of Sound and Audiovisual Archives.
Olson, N. (ed.) (1997). Cataloging Internet Resources A Manual and Practical Guide. Dublin, OH: OCLC, Online Computer Library Center, 2nd edition. http://www.oclc.org/oclc/man/9256cat/toc.htm.
Orpen, K. & Huron, D. (1992). The measurement of similarity in music: A quantitative approach for non-parametric representations. Computers in Music Research, Vol. 4, pp. 1-44.
Perrott, D. & Gjerdingen, R. (1999). Scanning the Dial. Presentation at the 1999 Society for Music Perception and Cognition Conference, Evanston, IL.
Rosch, E. & Lloyd, B. (eds.) (1978). Cognition and Categorization. Hillsdale, NJ: L. Erlbaum Associates.
Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals . Journal of the Acoustical Society of America, Vol. 103, No. 1, pp 588-601.
Simpson, J. & Huron, D. (1993). The perception of rhythmic similarity: A test of a modified version of Johnson-Laird's theory. Canadian Acoustics, Vol. 21, No. 3, pp. 89-90.
Tate, R. Fledgling music sites jockey for position. Upside magazine, Dec. 7, 2000.
Thayer, R.E. (1989). The Biopsychology of Mood and Arousal. New York: Oxford University Press.
UNESCO. (1989). Recommendation on the Safeguarding of Traditional Culture and Folklore. Paris, UNESCO. (Also available at http://www.unesco.org/webworld/com/compendium/5414.html)
University of Rochester. (1999). Cataloging guidelines for Internet Resources - the CAT site. http://www.lib.rochester.edu/cat/


Return to David Huron's Home Page
Return to Publication List

This document is available at http://dactyl.som.ohio-state.edu/Huron/Publications/huron.aarden.MIR.html