Colin's DITA blog: DITA session 4

DITA session 4

Information retrieval

In this session we looked at Information retrieval and how it works.

we started by looking at different definitions of Information Retrieval (IR) and how there are 3 different views. The user view where in the user has a need for information. The system view which looks at the IR system and its components and the sources view which is to do with the presentation of owned information.

We then looked at Information needs and in particular the use of searching on the web. We looked at the taxonomy for web searches which included Navigational queries, where a user finds for example a home page for something, Transactional queries, where the user is looking for a place to use a service eg Amazon, and Informational queries, where a user is doing a general search for a subject but not expecting a known result.

We also looked at multimedia search needs and a possible taxonomy for this style of searching be it on the web or on a catalogue. This contains of Known item retrieval where the user knows what they want for example ‘I want the song Enter Sandman by Metallica’, Fact retrieval where the user has some knowledge but not all so for example ‘who won the FA cup in 1999’, Subject retrieval where you have a topic but not many facts for the search for example ‘headline acts at Glastonbury’ and Exploratory retrieval where the user has very little knowledge of the subject and they enter a broad topic like ‘what films do you have’. All of these schemes depend on the users knowledge base and are very subjective with searches often blending between the schemes.

We then progressed to looking at the indexing of information to allow for accurate IR. For information to be able to be retrieved the media must be in a specific format be that HTML, XML or MPEG this allows the data to be processed correctly. Then you must identify the fields that will be searched, this allows users in for example a library catalogue to search authors, titles, genre ect. Then in text preparation the words must be analysed to make sure the text is searchable this can be done automatically by computer this also includes looking at non a-z characters and the use of numbers, the removal of stop words is an important step as a word index that is too full may not work properly so high frequency words are usually taken out like, the, and, to, be ect. Finally you have to look at stemming, this is where plurals and other suffixes have to be wither included or excluded for example if you were to index the word sun then it would make sense to include sunny and sunshine. Something that may also be looked at is synonyms this becomes tricky as these usually would have to come from a controlled vocabulary so some may be included while others not.

When looking at index structures we mostly only looked at Inverted File structures as this is used most widely and gives the fastest results, which is the main point of IR.
Surrogates are documentary records of a document for example a bibliographic record is a surrogate as it holds metadata about the item without actually holding the internal data and the index is the list of surrogates for example the library catalogue is an index as it holds the bibliographic records for the books it holds. The keyword files contain the index terms and the postings files contain a list of documents that contain those keywords.

We then went on to look at the two different ways of searching, using a Boolean logic model and using natural language. Boolean uses words such as AND, NOT & OR to specify if you want certain words included excluded or possibly included, they fit in between your search words and tells the engine what to include. For example ‘Electronic AND music’ will only bring up results that include both the words, ‘Electronic NOT music’ will only bring up results that don’t have the term music in and ‘Electronic OR music’ will bring up results that contain either. Another type of Boolean entry is the double quote, which means that all the words contained must appear in that order for example “Electronic music by Aphex Twin” will only bring up results that contain that phrase. The information is then usually displayed to the user in ranked order of relevance.
We then went on to look at natural language queries (NLQ) and how they use phrases and prose to define searches, this is most commonly used on the web but can give false results on a search. If a NLQ is unsuccessful then users may then add or delete terms or switch to a Boolean query so narrow the search fields this is called query modification and some sites do this automatically for you by adding a ‘did you mean’ function to their search engine.

We then briefly looked at how you can evaluate users searches and the relevance of the documents they retrieve. To do this firstly you need to define an entirely subjective viewpoint of a documents relevance, this is often done by asking the user how helpful the document was, then you can look at how many were relevant and how fast the system was at displaying the relevant documents. We also looked at the correlation between precision and recall and how there is an inverse relationship between the two.

Colin's DITA blog

Wednesday, 27 October 2010

DITA session 4

No comments:

Post a Comment