NoSQL

What is big data?

Data that exist in very large volume and many different varieties (data types) and that need to be processed at a very high velocity.

What is NoSQL?

Database that doesn’t use realtional model or SQL language, that ared designed to handle larger data volumes and throughput.

Features of NoSQL

Big data & 3V

Data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed).

Characteristics

NoSQL examples

CAP theorem

CAP theorem indicates there can only be this 2 of 3 at the same time.

BASE (Basically Available, Soft state, Eventual consistency)

Database Administration

4 competencies of a database administrator

Capacity planning

Capacity Planning is the process of predicting when future load levels will saturate the system and determining the most cost-effective way of delaying system saturation as much as possible.

Consider:

Estimating disk space requirements

Treat Database size as the sum of all Table sizes where table size = number of rows * average row width

Data Warehousing

Difference between transactional and informational databases

Characteristic Transactional Informational
Primary Purpose Run the day to day business Support decision making
Type of Data Current data — representing the state of business Historical data — snapshots and predictions
Primary Users Customers, clerks and other employees Managers, analysts
Scope of Usage Narrow, planned, fixed interfaces Broad, ad hoc, complex interfaces
Design Goal Performance and availability Flexible use and data accessibility
Volume Many contestant updates and queries on a few tables or rows Periodic batch updates, complex queryign on multiple or all rows

Fact table

Dimension table

Contains the details of information (in category) in each dimension

Design a dimensional model (star schema)

Database Transactions

Really will you test it?\boxed{\text{Really will you test it?}}

Why Database Transactions?

Required Properties of Transactions (ACID)

Atomicity

A transaction is treated as a single indivisible logical unit of work. All operations in a transaction must be completed; if not, then the transaction is aborted

Consistency

Database constraints that hold before a transaction must also hold after it

Isolation

Data used during execution of a transaction cannot be used by a second transaction until the first one is completed

Durability

When a transaction is complete, the changes made to the database are permanent, even if the system fails

Concurrent Access: Dealing with Multiple users

Concurrency control methods

Normalisation

Function Dependency

Example: A(X, Y, A, B)

Normalisation Process

1st normal form: Remove repeating groups

0NF:

1NF:

2nd normal form: Remove partial dependencies

1NF:

2NF:

3rd normal form: Remove transitive dependencies

2NF:

3NF:

Boyce-Codd Normal Form: Ensure determinants are candidate keys

3NF:

BCRF:

Anormality

Errors and inconsistencies in data relation

Denormalisation

Query optimisation

Query optimization/describe steps?

  1. Break query into query blocks (e.g. nested blocks)
  2. Converting query block into relational algebra expression
  3. Relational algebra equivalences

Equivalence classes

Selections:σc1c2cn(R)σc1((σcn(R)))(Cascade)σc1(σc2(R))σc2(σc1(R))(Commute)Projections:πa1(R)πa1((πan(R)))(Cascade)\begin{aligned} \color{red}{\text{Selections:}} && \sigma_{c_1\land c_2\land \ldots \land c_n}(R)&\equiv\sigma_{c_1}(\ldots(\sigma_{c_n}(R))) &&\text{(Cascade)}\\ && \sigma_{c_1}(\sigma_{c_2}(R))&\equiv\sigma_{c_2}(\sigma_{c_1}(R))&&\text{(Commute)}\\ \color{red}{\text{Projections:}}&&\pi_{a_1}(R)&\equiv\pi_{a_1}(\ldots(\pi_{a_n}(R)))&&\text{(Cascade)} \end{aligned}


R(ST)(RS)T(Assosiative)(RS)(SR)(Commutative)\begin{aligned} R\bowtie(S\bowtie T)&\equiv(R\bowtie S)\bowtie T &&\text{(Assosiative)}\\ (R\bowtie S)&\equiv(S\bowtie R) &&\text{(Commutative)} \end{aligned}

Result size/cost estimation

Formula sheet

Query Processing

File organisations

  1. Heap files : no particular order among records
  1. Sorted Files : pages and records within pages are ordered by some condition
  1. Index File Organisations : special data structure (data ordered by some condition)

Formula sheet

SQL Language

Data Definition Language (DDL)

CREATE TABLE my_table (
  id INT NOT NULL,
  person VARCHAR(200) NOT NULL,
  result CHAR(4) NOT NULL,
  comment TEXT NULL,
  grade ENUM('A', 'B', 'C'),
  dept INT NOT NULL,
  PRIMARY KEY (id),
  KEY fk_dept_idx (dept),
  CONSTRAINT fk_dept 
    FOREIGN KEY (dept)
    REFERENCES departments (id)
    ON DELETE \(RESTRICT|CASCADE|SET NULL|NO ACTION|SET DEFAULT\)
    ON UPDATE \(RESTRICT|CASCADE|SET NULL|NO ACTION|SET DEFAULT\)
);
ALTER TABLE my_table
\(
  ADD foo INT NOT NULL
|
  DROP COLUMN grade
|
  MODIFY COLUMN result CHAR(10)
\)
;
DROP TABLE [IF EXISTS] my_table;
TRUNCATE TABLE my_table;
RENAME TABLE my_table TO table_is_mine;

Data Manipulation Language (DML)

SELECT 
  column1, column2,
  column3 isalias,
  column4 'anything like this',
  COUNT(column5) / SUM(column6),
  CASE column19
    WHEN 1 THEN 1
    ELSE 0
  END AS myCol19,
  CASE 
    WHEN column20 > 10 AND column <= 20 
      THEN 1
    ELSE 0
  END
  IF column21 + column22 = column23 
    THEN 1 
    ELSE 0 
  END IF AS col212223
FROM table1
WHERE 
  column7 = '8' AND
  column8 NOT IN (
    SELECT column9 FROM table2 WHERE column10 <> 11;
  )
INNER JOIN table3 ON 
  table3.column11 = table1.column12 AND
  table3.column13 = table1.column13
INNER JOIN table4 ON 
  table4.column14 = table1.column15
GROUP BY column16
HAVING SUM(column17) > 25
ORDER BY column18 DESC
LIMIT 10 OFFSET 11;
INSERT INTO table5 (col1, col2, col3, col4, col5)
VALUES (1, 2, 3, 4, '5'), (6, 7, 8, 9, 'IO');
DELETE FROM table6
WHERE ...
ORDER BY ...
LIMIT ...;
UPDATE table7
SET col1 = 1, col2 = 2, col3 = 'E'
WHERE ...
ORDER BY ...
LIMIT ...;

Relational Algebra

R cS=σc(R×S)R\bowtie_{\ c}S =\sigma_c(R\times S)

ρ(S1(1sid1,2sname1,3rating1,4age1),Sailor)ρ(S2(1sid2,2sname2,3rating2,4age2),Sailor)πsname1, sname2(S1 age1>age2rating1<rating2S2)\begin{aligned} & \rho(\text{S1}(1\rightarrow\text{sid1},2\rightarrow\text{sname1},3\rightarrow\text{rating1},4\rightarrow\text{age1}),\text{Sailor})\\ & \rho(\text{S2}(1\rightarrow\text{sid2},2\rightarrow\text{sname2},3\rightarrow\text{rating2},4\rightarrow\text{age2}),\text{Sailor})\\ & \pi_\text{sname1, sname2}(\text{S1}\bowtie_{\ \text{age1}\gt\text{age2}\,\land\,\text{rating1}\lt\text{rating2}}\text{S2}) \end{aligned}

πsname,rating(σrating>8(S2))\pi_{\text{sname,rating}}\left(\sigma_{\text{rating}>8}(\text{S2})\right)

ER Modelling

CM Mapping

Supertype/Subtype

Develoments

Some messy stuff from unofficial sample exam

8a

A star schema is a specific relational model for use in a DW. It has a fact table which contains keys (to the dimension tables) and facts (calculated and useful information e.g. total revenue). Linked by the keys are the dimension tables which are the relevant tables with the relevant columns to whatever business process the analyst is analysing. The granularity refers to the how specific particular information should be in the dimensional tables. This is usually with respect to time or location because:

The grain is the level of specificity decided on by the Analyst or Data Scientist.

8b

  1. Decide on the business process/goal: Ticket sales, specifically in order to sell more tickets
  2. Decide on granularity of data:
    • Time: Weekly is when most venues have events so probably the lowest necessary level of granularity.
    • Location: Suburb.
  3. Determine the dimensions to be used: Ticket, Person, Collection, Event, Venue (All)
  4. Determine the facts: Total #of tickets sold. Total revenue from tickets sold. Any other facts that might be useful?

9a

9b

Partitioning generally has to be tolerated as network failures are a) likely and b) out of the DBA’s control. Thus one must choose to keep either consistency or availability.

A RDBMS such as MySQL opts for consistency. A NoSQL DB such as MongoDB opts for Availability and “eventual consistency”.

9c

No, a NoSQL database is not suitable for Tozmaster because consistency is important. Without consistency (by proper application of ACID principles) one ticket may be sold to 2 different people.

9.d

State the BASE of NoSQL in contrast to ACID of relational database