Software Engineering with an Agile Development Framework/Iteration Two/Data modelling

Data is...

Why do some argue is most important thing:

Static structure of system - like a map

Advantages of database approach Program-data independence Minimal data redundancy Improved data consistency Improved data sharing Improved data quality Reduced program maintenance

Difference between data model and database
Modelling process

Data modelling is DESIGN Creative solutions needed Data model is one solution to req’s

Data modelling is IMPORTANT Need a blueprint to build a house

Data modelling is a discipline Professional judgement required

Data modelling through the development process.

Entity relationship model: Detailed graphical representation of entities and associations

Focus is purely data Not what happens to it Depicts the form of the data Not the data values Specific to the needs of the business

Person, place, object, event, concept that is of interest to the organisation single, unambiguous names Student, book, product

Time Space Fuzzy

Association between the instances of one or more entity types name strongly make a sentence to describe the relationship may have attributes – an associative entity

A book is borrowed by a patron A patron may borrow a book

Conceptual Schema A detailed, technology independent specification of the overall structure of a database Physical Schema Specifications for how data from a conceptual schema are stored in a computer’s secondary memory

Let's say we are working on a High School records system. From our initial requirements determination we have a collection of material: class lists, individual transcripts, class results and so on. The tasks in data modelling is to extract the underlying structure of the information. This is the data model.

The core of the data model is the entity. An entity is something about which the organisation holds information. It can be an actual thing or something less tangible like a concept (or student grades). Note that an “entity” here describes the generic level of the things: hence STUDENT, rather than Bob. Bob, Jane and their friends are all instances of STUDENT.

In our simple school we can easily identify three entities: STUDENT, TEACHER and COURSE.



Each entity is described by attributes. These are what we use to describe the instances of each entity. There are strict rules about attributes, they need to be independent and not repeating information. They also need to be atomic (can’t be broken down), and not be dependent on other values.

In the example below, StudentName is unsuitable as it can be broken down into first and surname; StudentAge isn’t necessary as it can be calculated (putting it in by hand is asking for trouble), and StudentGrades is not atomic - it contains multiple values - which will not work (we’ll come back to this one).



We also need some way of uniquely identifying each instance, this becomes the Primary key for the entity. This has to be not change be unique. We could use the name but, despite us all feeling special, our names aren’t. We could combine the name with date of birth and make a concatenated key. Unfortunately, dates are very difficult to deal with (what is the significance of the 12th of September?) and it doesn’t resolve our non-unique problem. So we have to use a new value: a STUDENT_ID.

We can get the same information from these attributes and the data is structured much more powerfully.



The student grades are not information just about the student, they are information about that student's interaction with courses.

The courses a student is enrolled in is more difficult to solve, but the solution is the heart of the power of a data based (and hence database) approach to understanding the business and implementing a solution.



In the example above Bob is taking only Geography (wise choice), so that is easy to deal with. Jane, however is taking French, Geography and History. If we try and store this information in an attribute called Courses it is very difficult to extract this information - while it would be easy to get a list of Jane’s courses, a list of who is taking Geography would be very hard to extract (and involve complicated string manipulation etc).

The real strength of data modelling, is in the identification of relationships. A relationship describes the association between entities. We start just by drawing a line between the entities:



We can see that there are clear relationships between STUDENT and COURSE, and COURSE and TEACHER. We can express these relationships on the picture. Reading in the direction of the arrows, these associations can be expressed as sentences: “a STUDENT takes this COURSE”, “a COURSE is taught by a TEACHER”. (we’ll come back to how many when we look at cardinality).



The STUDENT - TEACHER association, however, is not so clear. The relationship (at least the one we are considering here) is already expressed by the model. We can extract which teacher is teaching which students via the COURSE they are teaching and enrolled in. We don’t need an explicit STUDENT-TEACHER relationship.



Even that this level of consideration, we a being forced to carefully think about the business we are dealing with. We could have described a STUDENT-TEACHER-COURSE model. Why have we chosen not to have the STUDENT-TEACHER relationship rather than not having STUDENT-COURSE?

We also need to express the numbers of instances that might be involved in the relationship. We do this with "crowsfeet", with the crow's foot on the many end.



Here we have “many STUDENTS may take many COURSES”, or conversely “many COURSES may be taken by many STUDENTS”. We express the “many” on our diagram with a “crows foot” (hanging on to the many end).

Early in the development of conceptual model it is OK to have a relationship like this, but we need to do more to make it useful. A double-ended crowsfoot like this is called a "none-specific relationship", we know we have lots of students and lots of courses, but not who is taking what.



One solution to these unconnected lists, is to include attributes for each of the courses someone is taking (below). Unfortunately this doesn’t work either. What happens when someone enrolls in more courses than you have previously thought of, where do the results actually go, and we still don’t have a way of finding out who is enrolled in Geography.



What we need is a new entity to represent this relationship. This is called an associative entity.



We might have identified it earlier as it is really quite strong in terms of ‘thingness’. The relationship between STUDENT and COURSE also has several parameters in its own right: date, result, perhaps internal assessment grades and so on. The fact that the relationship has attributes indicates that it is really an entity.



Again, we come back to the value of this process in understanding the business. Is it just one entity? We can think of several possible names for the entity - are they the same thing.

There are cases where multiple names indicate multiple associations - the Human resources manager authorises payment of salaries (one relationship), but is also on the payroll herself.

This middle entity shows the association between STUDENT and COURSE, ENROLLMENT. When we implement the model, the ENROLLMENT entity (table) does not contain the student’s names, nor the courses - the computer can quite efficiently go and get them. Instead we just use the primary keys from the other entities - they become foreign keys.

There is though, a flaw in our model. Jane Smith seems to be enrolled twice in History. This is not a mistake (except for Jane) - she failed it the first time. In order to allow this to happen we need to represent more information, the year of enrollment. Our model can generate the information required, it will be useful as the development progresses as we work to make it more tightly defined.

The primary focus in functional requirements is the process itself, by developing such a model, we are forced to think carefully about the business we are working with.

Example models
This model describes a job management system for an engineering firm.



A different group made this model for the same system.



A system for a library. When developing this model, most people think of "book", but a book means different things in different situations. It is the physical book that is loaned, yet it is the "title" that people actually want to read (and that of a particular category such as large print). Note the two relationships between item and patron for loan and reserve.



This system describes an environmental planning system for agriculture.