Colin's DITA blog: October 2010

Saturday, 30 October 2010

Coursework post

Evaluating and employing appropriate technologies for the digital representation of information.

There are many ways of digitally representing information ranging from a simple webpage through to a complex database containing many tables of information. The method you use depends on the purpose of the information and the information need of the user.

From the sessions on web 1.0 I feel that each of the methods from HTML web pages to complex databases using SQL queries and the process of “Information Retrieval” (IR) are each equally useful but only when appropriated correctly.

To visually display information that may not change a huge amount then a HTML web page is a very useful way of digitally representing the information for example a simple list of contact numbers or addresses. HTML is a useful tool because it is accessible by anyone with the Internet it is relatively simple to write and is of fairly low complexity to maintain and update. Of course if the information contains a large amount of data, for example lists of employees, then a form of IR or a database is more suitable. However, for a small business with less than say 10 employees HTML would be a much simpler way of storing basic contact details. Another use of an HTML web page is as a public facing data sheet. For example it could have the all the contact details, company information and any text information included on it and with the use of css style sheets can be made to look very interesting.

When you have a large amount of information that you own and therefore can manipulate then a database is by far the best solution as it allows users to centrally access information which could be in many formats, and easily allows queries to be developed to pull specific information from the database. In 1993 the UK Government published the following definition:

“SQL is the industry accepted Interface between applications and relational databases and is increasingly used to access non-relational data. It is therefore an important tool in achieving data integration across different databases.” CCTA 1993 (pg5)

The problems of this method are that SQL is a very difficult way of creating queries which depend on pinpoint accuracy by the user to correctly write their query using the correct terminology and spelling and the over reliance of a controlled vocabulary all of which can lead to problems. The plus points are that it allows multi user interaction and can be tailored to specifically suit individual’s needs. The government centre for information systems says of SQL implementations;

“SQL is a suitable language for applications requiring to store and manipulate data that can be represented as tables. Generally, implementations of SQL are targeted towards supporting larger, multi-user applications based on mainframes, mini-computers or large workstations.” CCTA 1993 (pg 15)

The main difference between SQL databases and Information Retrieval (IR) are that databases hold a high amount of data stored in many tables that can give accurate answers to specific questions, whereas an IR system can hold as much information organised into surrogate’s stored in tables which can then be searched using non specific key words and natural language. Rosenfeld and Morville say that;

“The database model is best applied to subsets or collections of structured, homogenous information within a broader website.” Rosenfeld and Morville 1998 (pg41)

Therefore when the information you have is not your own or is fairly unstructured then the best method of retrieving the information is by using (IR). This is where the user inputs search terms into some kind of search engine either in a natural language query or using a Boolean system to search the systems data to find what they are looking for.

For example a library catalogue would not use a SQL database as the information searched for has to be specific, they would use a form of IR where the user can search depending on there knowledge base either for an exact item or a vague subject search and all that lies in between and still retrieve adequate and accurate information at a much higher speed to using a database.

“Many studies indicated that users of information systems aren’t members of a single minded monolithic audience who want the same kinds of information delivered in the same ways. Some want just a little information, while others want detailed assessments of everything there is to know on a topic” Rosenfeld and Morville 1998 (pg102)

Managing data with appropriate information technologies.

For this section I am going to be looking at search methods in music information retrieval (MIR) and how the different kinds of search can be implemented on a web search for a particular piece of information (song title/artist/composer) and how similar searches could be implemented on more specific music information resources like a music organisations archive.

From what I have learnt about textual search methods I feel the use of natural language queries in MIR would be the most common search method that the average (non scholarly) music enthusiast would use followed by the use of Boolean operators to perform search modifications if they were unsuccessful on their first NLQ as this allows more specific results to be returned and in this type of search where many responses could well come up for a search. An important point is that any user will have a varying level of knowledge of the subject matter they are searching for and therefore will either use a known item search (KIS), fact search or a subject search to find the information they desire and this will affect their search terms.
“People often first encounter new music from various media (e.g., a movie, TV commercial, radio) without ever knowing the artist name and song title, or they often forget the information over the course of time. This brings up challenges to known-item-searches for music as those users must attempt to describe the sought music other than using the bibliographic information.” Jin Ha Lee 2010 (pg1025)
An example of this is in a web search for the lyric ‘dress this city in flames’ Google brings up the correct song but also brings up results for a different site as the main result. So to attain a more accurate search the use of Boolean operators could be used. By searching for ‘dress this city in flames AND lyrics’ the use of ‘AND lyrics’ means that all the result pages must also contain the word lyrics too so all the results should be song lyrics and from there the user can find the song they are looking for. This type of search would be a fact search as the user knows a fact and therefore has some knowledge of the subject and wants to find more information.

NLQ and Boolean operators are I feel are not only very good methods of fact searching but also could be very successfully used for a subject search. For example UK Hip-Hop could be refined using UK AND Hip-Hop. KIS will more often than not be a NLQ as the user knows what they are looking for and will search for it directly for example searching for “Paris in Flames by Thursday” will bring up the exact song they are looking for. This method of KIS would be used on a site like ITunes and their search method also suggests the most likely answer as you type therefore speeding up the search time which is among the most important aspects along with accuracy of result in IR.
However problems can occur when the user assumes they know something and in systems like ITunes the suggestion of titles can be extremely helpful for example if the user thought the song ‘Paris in Flames’ by Thursday was by Weezer and called ‘City in flames’ then the results returned would be inaccurate and the user would have to return to a fact search or subject search which are not supported on a system like ITunes.
“If someone were looking for music they previously heard, but all of the information they think is relevant to finding the item and is attempting to use in the search is incorrect, the search does seem to be a known-item search yet it is difficult to say the user really knows the object beyond its existence.” Jin Ha Lee 2010 (pg1025)

On more specific music sites then further search methods can be employed but because of the varying degree of knowledge between the average music searcher and the academic researcher the more advanced systems will often be aimed at the academic researcher who might wish to access a more complex collection that specifically holds musical scores or even the music itself. This opens up more complex search systems than text searches for example being able to search using intervals or scales that occur in pieces of music to find a collection of related material.
This is method is described in a paper by Peter van Kranenburg et al at Utrecht University when discussing Folk Song research in 2009;
“The Colonial Music Institute which promotes research in early American music and dance, offers an index for about 75,000 instrumental and vocal pieces from the period 1589-1839 (sic), including social dance tunes and songs. From each melody an incipit is present in the database. There are three way to browse these incipits: a sequence of scale degrees of all notes a sequence of scale degrees of stressed notes, and a sequence of intervals” (pg 27)

In conclusion I beleive that IR is a very useful tool in music libraries, bibliographies, databases and music web searches. The methods that suit MIR best in these scenarios I feel are NQL and Boolean operators, which should both be included in any form of music catalogue. As shown in my example of web searching the user’s perception of their level of knowledge affects the success of their searches, which can cause negative results to be retrieved, and therefore any search that does not allow Boolean operators would be severely hindered by missing out the ability to modify and clarify to improve results .
The use of non text searches of music catalogues is very interesting and something that really goes beyond the scope of this paper but it is something that if it could be implemented into a normal search of a catalogue or the web would lead to much higher accuracy and relevance of the information retrieved.

References

CCTA., 1993. Datababse language SQL explained. London: HMSO

van Kranenburg, P., Garbers, J., Volk, A., Wiering, F., Grijp, L. P. and Veltkamp, R C., 2010. Collaboration Perspectives for Folk Song Research and Music Information Retrieval:The Indispensable Role of Computational Musicology Journal of Interdisciplinary Music Studies, 4 (1), 17-48. Available from: http://www.musicstudies.org/JIMS2010/Kranenburg_JIMS_10040102.pdf [Accessed 27th October 2010]

Lee, Jin Ha ., 2010. Analysis of User Needs and Information Features in Natural Language Queries Seeking Music Information. Journal of the American Society for Information Science and Technology, 61 (5), Available from: http://0-web.ebscohost.com.wam.city.ac.uk/ehost/detail?vid=4&hid=105&sid=0c6a4d05-6167-42ba-a6bd-9e2c29d95c42%40sessionmgr111&bdata=JnNpdGU9ZWhvc3QtbGl2ZQ== - db=eoah&AN=21523896 [Accessed 27th October 2010].

Rosenfeld, Morville,P., 1998. Information Architecture for the World Wide Web. Sebastopol; O’Reilly & Associates.

http://colinbeard.blogspot.com/ last accessed 28/10/10

Wednesday, 27 October 2010

DITA session 4

DITA session 4

Information retrieval

In this session we looked at Information retrieval and how it works.

we started by looking at different definitions of Information Retrieval (IR) and how there are 3 different views. The user view where in the user has a need for information. The system view which looks at the IR system and its components and the sources view which is to do with the presentation of owned information.

We then looked at Information needs and in particular the use of searching on the web. We looked at the taxonomy for web searches which included Navigational queries, where a user finds for example a home page for something, Transactional queries, where the user is looking for a place to use a service eg Amazon, and Informational queries, where a user is doing a general search for a subject but not expecting a known result.

We also looked at multimedia search needs and a possible taxonomy for this style of searching be it on the web or on a catalogue. This contains of Known item retrieval where the user knows what they want for example ‘I want the song Enter Sandman by Metallica’, Fact retrieval where the user has some knowledge but not all so for example ‘who won the FA cup in 1999’, Subject retrieval where you have a topic but not many facts for the search for example ‘headline acts at Glastonbury’ and Exploratory retrieval where the user has very little knowledge of the subject and they enter a broad topic like ‘what films do you have’. All of these schemes depend on the users knowledge base and are very subjective with searches often blending between the schemes.

We then progressed to looking at the indexing of information to allow for accurate IR. For information to be able to be retrieved the media must be in a specific format be that HTML, XML or MPEG this allows the data to be processed correctly. Then you must identify the fields that will be searched, this allows users in for example a library catalogue to search authors, titles, genre ect. Then in text preparation the words must be analysed to make sure the text is searchable this can be done automatically by computer this also includes looking at non a-z characters and the use of numbers, the removal of stop words is an important step as a word index that is too full may not work properly so high frequency words are usually taken out like, the, and, to, be ect. Finally you have to look at stemming, this is where plurals and other suffixes have to be wither included or excluded for example if you were to index the word sun then it would make sense to include sunny and sunshine. Something that may also be looked at is synonyms this becomes tricky as these usually would have to come from a controlled vocabulary so some may be included while others not.

When looking at index structures we mostly only looked at Inverted File structures as this is used most widely and gives the fastest results, which is the main point of IR.
Surrogates are documentary records of a document for example a bibliographic record is a surrogate as it holds metadata about the item without actually holding the internal data and the index is the list of surrogates for example the library catalogue is an index as it holds the bibliographic records for the books it holds. The keyword files contain the index terms and the postings files contain a list of documents that contain those keywords.

We then went on to look at the two different ways of searching, using a Boolean logic model and using natural language. Boolean uses words such as AND, NOT & OR to specify if you want certain words included excluded or possibly included, they fit in between your search words and tells the engine what to include. For example ‘Electronic AND music’ will only bring up results that include both the words, ‘Electronic NOT music’ will only bring up results that don’t have the term music in and ‘Electronic OR music’ will bring up results that contain either. Another type of Boolean entry is the double quote, which means that all the words contained must appear in that order for example “Electronic music by Aphex Twin” will only bring up results that contain that phrase. The information is then usually displayed to the user in ranked order of relevance.
We then went on to look at natural language queries (NLQ) and how they use phrases and prose to define searches, this is most commonly used on the web but can give false results on a search. If a NLQ is unsuccessful then users may then add or delete terms or switch to a Boolean query so narrow the search fields this is called query modification and some sites do this automatically for you by adding a ‘did you mean’ function to their search engine.

We then briefly looked at how you can evaluate users searches and the relevance of the documents they retrieve. To do this firstly you need to define an entirely subjective viewpoint of a documents relevance, this is often done by asking the user how helpful the document was, then you can look at how many were relevant and how fast the system was at displaying the relevant documents. We also looked at the correlation between precision and recall and how there is an inverse relationship between the two.

Tuesday, 26 October 2010

DITA session 3

DITA session 3
Databases

In this session we talked about databases and SQL.
The first point looked at was how before databases were commonplace peoples information needs were hindered in a company by different departments all having their own information stored in their own way so if another department needed the information it would be difficult to get hold of and possibly in an incompatible format. This led to redundancy in information and inconsistency in peoples data.

A database allows data to be stored in a central place and allows users access to it (via a database management system) from many locations this resolves the incompatibility and redundancy issues.

This is a good way of dealing with data when you own the information as you can structure the database to suit the needs of the users and can fairly easily create search systems.

It however is not suitable if the data is not your own as the information is often unstructured and heterogeneous.

We briefly looked at Entity Relationship Modelling which is a little beyond the scope of this course but basically it allows you to describe the content of a database at a design level and is something that is done before any database is created and looks at the relationships between the things (entities) that will be in the database.

We then progressed onto SQL and how it is a language that allows communication between a user and the database management system to query what is in the database and allow housekeeping of the information.

A database is a collection of 2 dimensional data tables with rows and columns and the complicated part is what you do with the tables.

We then looked at the relationships between the entities within a database and how there can be 3 different kinds of these relationships the one to one where only one entity can be in one other, eg a painting can only be in one gallery, many to one where many items can appear in one entity, eg many paintings in one gallery, or many to many where lots of entities could be in lots of other entities, eg many painters in different galleries.

An entity is basically any thing and can have lots of attributes infinite in fact but a database should only collect relevant information for example a personnel database would need to know your name, address and phone number but possibly not your hair colour or shoe size.

When an entity has more than one attribute problems occur so you need to have more a table for each database table using the art gallery example from class we would have one table containing artworks which has the artist and date painted in them and another table for the galleries which could have the address, city and country in each of the attributes would be in a uniquely identified id so the two tables can relate to each other for example if the Mona Lisa is ID1 and it is in gallery 3 (where 3 is in the galleries table) then it can select the information from the right line in the gallery table to assign the correct gallery.

We then looked at how you query a database and a few of the keywords that have to be used.
Select colums
From tables
Where something is true

For example again using the gallery model
my sql> select name, country from galleries where country = “uk”;

The “” marks indicate you are looking for text within not a column.

Another example would be
my sql> select title
> from artworks
> where date > 1800;

You can split the query over many lines as the query only stops with a semi colon and in this case we have used a > to mean less than.

We then went into the lab to work with a real database to try and ascertain information about different aspects of the database.

This brought together the relatively simple theory and showed how complicated it can get when working with a real database the software is very particular and everything has to be done in a certain way it is case sensitive and everything has to be the correct terminology so for example typing Gallery when the table is galleries will respond with an error. It also showed how you have to understand the nature and layout of the tables before you can begin to query the database as if you don’t know what table an entity is in then it is impossible to stumble across the information you need. It is very precise and there is no way to work it out as you go you have to know the layout well.

To do this you can first type
show tables; which will show you the different tables in the database
desc authors; where authors is a table it will show the details of that table

This is vitally important as without this information the rest of the exploration is impossible.

We then had to develop queries using all that we had learnt to ascertain different pieces of information from the database.

Sunday, 10 October 2010

DITA session 2

The Internet and the World Wide Web

This session was on how the Internet and the world wide web work and interact with each other.

We talked about how the Internet is a collection of WAN's (wide area networks)

what different DNS (domain name system) mean, ie .com = company .org = organisation ac.uk = academical uk.

How the Internet is a collection of networks and the www is the documents they contain.

how it has become a disruptive technology and changed the way we and industry works for example publishing has moved from paper based to online and the advancement of open source software.

we discussed how a URL (universal resource locater) works from the HTTP (hyper text transfer protocol) through the DNS to the local path and how the hierarchy runs from left to right in the DNS and right to left in the local path.

We then looked at hyper text and HTML and looked at how HTML is written using a variety of tags and how a basic web page can be made using it.

Then we took a brief look at CSS style sheets and how they can add to an HTML web page.

we then in our lab session wrote 2 simple web pages using HTML one linking to the other and hosted them in our student web space.

http://www.student.city.ac.uk/~abjd609/DITA