Data that exist in very large volume and many different varieties (data types) and that need to be processed at a very high velocity.
Database that doesn’t use realtional model or SQL language, that ared designed to handle larger data volumes and throughput.
- doesn’t use relational model or SQL language
- runs well on distributed servers
- Scaling out rather than scaling up
- most are open-source
- built for the modern web and natural for a cloud environment
- schema-less (though there may be an “implicit schema”)
- Supports schema on read
- Store unstructured data (JSON, XML, etc.)
- Not ACID (Atomicity, Consistency, Isolation, Durability) compliant!
- ‘eventually consistency’
Data that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed).
- Volume – much larger quantity of data than typical for relational databases
- Variety – lots of different data types and formats
- Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream)
- Schema on Read, rather than Schema on Write
- Schema on Read – data model determined later, depends on how you want to use it (XML, JSON)
- Capture and store the data, and worry about how you want to use it later
- Data Lake
- A large integrated repository for internal and external data that does not follow a predefined schema
- Capture everything, dive in anywhere, flexible
-
Document : MongoDB, Raven DB, CouchDB, Terrastore, OrientDB
Like a key-value store, but “document” goes further than “value”. Document is structured so specific elements can be manipulated separately.
- document is “examinable” by the db, so its content can be queried, and parts of it updated
-
Column-family : Cassandra, BigTable, Apache HBase, DynamoDB
Rows and columns. Distribution of data based on both key values (records) and columns, using “column groups/families”
- Columns rather than rows are stored together on disk.
- Makes analysis (not OLTP) faster, as less data is fetched.
- This is like automatic vertical partitioning.
- Related columns grouped together into ‘families’.
- Each row can store a different set of columns.
-
Graph: Neo4J, Infinite Graph, OrientDBv, FlockDB, TAO
Maintain information regarding the relationships between data items. Nodes with properties, Connections between nodes (relationships) can also have properties.
- A ‘graph’ is a node-and-arc network
- Social graphs (e.g. friendship graphs) are common examples
- Graphs are difficult to program in relational DB
- A graph DBstores entities and their relationships
- Graph queries deduce knowledge from the graph
-
Key-value : Riak, Redis, Memcached, Berkeley DB, HamsterDB, Amazon DynamoDB, Project Voldemort, Couchbase
A simple pair of a key and an associated collection of values. Key is usually a string. Database has no knowledge of the structure or meaning of the values.
CAP theorem indicates there can only be this 2 of 3 at the same time.
- Consistency: Everyone always sees the same data.
- Availability: System stays up when nodes fail
- Partition Tolerance: System styats up when network between nodes fail
- Basically Available : This constraint states that the system does guarantee the availability of the data; there will be a response to any request. But data may be in an inconsistent or changing state.
- Soft state : The state of the system could change over time — even during times without input there may be changes going on due to ‘eventual consistency’.
- Eventual consistency : The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it needs to, sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.
- Capacity planning
- Estimating disk space and transaction load
- Performance improvement
- Security
- Threats
- Web apps and SQL injection
- Backup and recovery
- Types of failures, responses to these, types of backups
- Other measures to protect data
Capacity Planning is the process of predicting when future load levels will saturate the system and determining the most cost-effective way of delaying system saturation as much as possible.
Consider:
- disk space requirements
- transaction throughput
(at go-live and throughout the life of the system)
Treat Database size as the sum of all Table sizes where table size = number of rows * average row width
- A single repository of organisational data
- Integrates data from multiple sources
- Extracts data from source systems, transforms, loads into the warehouse
- “Single version of truth”
- Makes data available to managers/users
- Supports analysis and decision-making
- Involve a large data store (often several Terabytes, Petabytes, Exabytes, ZetaBytes, YottaBytes of data)
Characteristic |
Transactional |
Informational |
Primary Purpose |
Run the day to day business |
Support decision making |
Type of Data |
Current data — representing the state of business |
Historical data — snapshots and predictions |
Primary Users |
Customers, clerks and other employees |
Managers, analysts |
Scope of Usage |
Narrow, planned, fixed interfaces |
Broad, ad hoc, complex interfaces |
Design Goal |
Performance and availability |
Flexible use and data accessibility |
Volume |
Many contestant updates and queries on a few tables or rows |
Periodic batch updates, complex queryign on multiple or all rows |
- A fact table contains the actual business measures (additive), called facts
- Also contain foreign keys for dimensions
- Granularity , or level of detail, is a key issue
- Finest level of detail for a fact table
- Determined by the finest level of each dimension
Contains the details of information (in category) in each dimension
- Choose a Business Process
- Choose the grain of the fact table = granularity
- Choose the dimensions
- Choose the measured facts (usually numeric, additive quantities)
- Complete the dimension tables
Really will you test it?
- To deal with problems when executing units of work:
- Failures of various kinds (hardware failures and system crashes)
- Concurrent execution of multiple transactions
- How to deal with these issues?
A transaction is treated as a single indivisible logical unit of work. All operations in a transaction must be completed; if not, then the transaction is aborted
Database constraints that hold before a transaction must also hold after it
Data used during execution of a transaction cannot be used by a second transaction until the first one is completed
When a transaction is complete, the changes made to the database are permanent, even if the system fails
- Concurrent execution of DML by multiple users against a shared database
- Better utilization of disk and CPU
- Reduced response time
- But what could go wrong?
- lost updates
- uncommitted data
- inconsistent retrievals
- To achieve efficient execution of transactions, the DBMS creates a schedule of read and write operations for concurrent transactions
- Several methods of concurrency control
- Locking (main method)
- Versioning (time stamping)
- A functional dependency concerns values of attributes in a relation
- A set of attributes X determines another set of attributes Y if each value of X is associated with only one value of Y
Example: A(X, Y, A, B)
-
Determinants (X, Y → Z)
the attribute(s) on the left hand side of the arrow
-
Key and Non-Key attributes
each attribute is either part of the primary key or it is not
-
Partial functional dependency (Y → Z)
a functional dependency of one or more non-key attributes upon part (but not all) of the primary key candidate keys
ABB→A⊆non-keys⊊a candidate (composite) key is a partial functional dependency
-
Transitive dependency (Y → Z)
a functional dependency between 2 (or more) non-key attributes
A, BB→A⊂non-keys is a transitive dependency
0NF:
OrderItem (Order#, Customer#, (Item#, Desc, Qty))
1NF:
OrderItem (Order#, Item#, Desc, Qty)
Order (Order#, Customer#)
1NF:
OrderItem (Order#, Item#, Desc, Qty)
2NF:
OrderItem (Order#, Item#, Qty)
Item (Item#, Desc)
2NF:
Empolyee (Emp#, Ename, Dept#, Dname)
3NF:
Empolyee (Emp#, Ename, Dept#)
Department (Dept#, Dname)
3NF:
Review (Stu#, Tchr#, Subj#, ReviewDate)
(Subj# → Tchr#
)
BCRF:
Review (Stu#, Subj#, ReviewDate)
Subject (Subj#, Tchr#)
Errors and inconsistencies in data relation
- Normalisation
- Normalised relations contains a minimum amount of redundancy and allow users to insert, modify, and delete rows in tables without errors or inconsistencies (anomalies)
- Denormalisation
- The pay-off: query speed.
- The price: extra work on updates to keep redundant data consistent.
- Denormalisation may be used to improve performance of time-critical operations.
- Break query into query blocks (e.g. nested blocks)
- Converting query block into relational algebra expression
- Relational algebra equivalences
Selections:Projections:σc1∧c2∧…∧cn(R)σc1(σc2(R))πa1(R)≡σc1(…(σcn(R)))≡σc2(σc1(R))≡πa1(…(πan(R)))(Cascade)(Commute)(Cascade)
R⋈(S⋈T)(R⋈S)≡(R⋈S)⋈T≡(S⋈R)(Assosiative)(Commutative)
Formula sheet
- Selection
- Full Scan/Heap Scan
- Index on selection attributes
- Hash-based
- B+tree-based
- Clustered/Unclustered
- Projection
- Joins
- Nested loops
- Simple NLJ
- Page-Oriented NLJ
- Block-Oriented NLJ
- Sort-merge
- Hash
- Heap files : no particular order among records
- Suitable when typical access is a file scan retrieving all records
- Sorted Files : pages and records within pages are ordered by some condition
- Best for retrieval in some order, or for retrieving a range of records
- Index File Organisations : special data structure (data ordered by some condition)
- Clustered vs. Unclustered
- Primary vs. Secondary
- Single Key vs. Composite
- B+ tree vs Hash Index
Formula sheet
CREATE TABLE my_table (
id INT NOT NULL,
person VARCHAR(200) NOT NULL,
result CHAR(4) NOT NULL,
comment TEXT NULL,
grade ENUM('A', 'B', 'C'),
dept INT NOT NULL,
PRIMARY KEY (id),
KEY fk_dept_idx (dept),
CONSTRAINT fk_dept
FOREIGN KEY (dept)
REFERENCES departments (id)
ON DELETE \(RESTRICT|CASCADE|SET NULL|NO ACTION|SET DEFAULT\)
ON UPDATE \(RESTRICT|CASCADE|SET NULL|NO ACTION|SET DEFAULT\)
);
ALTER TABLE my_table
\(
ADD foo INT NOT NULL
|
DROP COLUMN grade
|
MODIFY COLUMN result CHAR(10)
\)
;
DROP TABLE [IF EXISTS] my_table;
TRUNCATE TABLE my_table;
RENAME TABLE my_table TO table_is_mine;
SELECT
column1, column2,
column3 isalias,
column4 'anything like this',
COUNT(column5) / SUM(column6),
CASE column19
WHEN 1 THEN 1
ELSE 0
END AS myCol19,
CASE
WHEN column20 > 10 AND column <= 20
THEN 1
ELSE 0
END
IF column21 + column22 = column23
THEN 1
ELSE 0
END IF AS col212223
FROM table1
WHERE
column7 = '8' AND
column8 NOT IN (
SELECT column9 FROM table2 WHERE column10 <> 11;
)
INNER JOIN table3 ON
table3.column11 = table1.column12 AND
table3.column13 = table1.column13
INNER JOIN table4 ON
table4.column14 = table1.column15
GROUP BY column16
HAVING SUM(column17) > 25
ORDER BY column18 DESC
LIMIT 10 OFFSET 11;
INSERT INTO table5 (col1, col2, col3, col4, col5)
VALUES (1, 2, 3, 4, '5'), (6, 7, 8, 9, 'IO');
DELETE FROM table6
WHERE ...
ORDER BY ...
LIMIT ...;
UPDATE table7
SET col1 = 1, col2 = 2, col3 = 'E'
WHERE ...
ORDER BY ...
LIMIT ...;
- Selection (σ) Selects a subset of rows from relation (horizontal).
- Projection (π) Retains only wanted columns from relation (vertical).
- Cross-product (x) Allows us to combine two relations.
- Set-difference (–) Tuples in r1, but not in r2.
- Union (∪) Tuples in r1 and/or in r2.
- Rename (ρ) Rename table and columns in it.
R⋈ cS=σc(R×S)
ρ(S1(1→sid1,2→sname1,3→rating1,4→age1),Sailor)ρ(S2(1→sid2,2→sname2,3→rating2,4→age2),Sailor)πsname1, sname2(S1⋈ age1>age2∧rating1<rating2S2)
πsname,rating(σrating>8(S2))
CM Mapping
- D/O (Disjointness constraints)
- D: Disjoint
Instance of supertype can only be in at most 1 subtype
- O: Overlap
Instance of supertype can only be in at multiple subtype at the same time
- =/- (Completeness constraints)
- =: Complete
Instance must be in at least 1 subtype
- -: Incomplete
Instance can be in 0 or more subtypes
- Database planning
- Systems definition
- Requirements definition and analysis
- Conceptual design
- Logical design
- Physical design
- Application design
- Implemntation
- Data conversion and loading
- Testing
- Operational Maintaince
- ※繰り返し
A star schema is a specific relational model for use in a DW. It has a fact table which contains keys (to the dimension tables) and facts (calculated and useful information e.g. total revenue). Linked by the keys are the dimension tables which are the relevant tables with the relevant columns to whatever business process the analyst is analysing. The granularity refers to the how specific particular information should be in the dimensional tables. This is usually with respect to time or location because:
- Time: Years -> Months -> Days -> Hours -> Minutes -> Seconds
- Location: Continent -> Country -> State -> City -> Town/Suburb -> Street -> Address
The grain is the level of specificity decided on by the Analyst or Data Scientist.
- Decide on the business process/goal: Ticket sales, specifically in order to sell more tickets
- Decide on granularity of data:
- Time: Weekly is when most venues have events so probably the lowest necessary level of granularity.
- Location: Suburb.
- Determine the dimensions to be used: Ticket, Person, Collection, Event, Venue (All)
- Determine the facts: Total #of tickets sold. Total revenue from tickets sold. Any other facts that might be useful?
- Capacity planning: Estimating disk space and transaction load.
- Security: Monitor threats, and prevent web app hacks and SQL injections, controlling access via DCL
- Design/modelling: Ensure DB is well designed so that data is normalised and queries execute efficiently
- Backup and recovery: Analyse and respond to types of failures, decide on types of backups to use
- Performance improvement: Get better hardware or better software, decide on index creation
- Consistency: Every read returns the most up-to-date data
- Availability: Every request sent to the DB returns a non-error response
- Partition tolerance: The DB is able to function despite some of the nodes not being available to access due to network failure.
Partitioning generally has to be tolerated as network failures are a) likely and b) out of the DBA’s control. Thus one must choose to keep either consistency or availability.
A RDBMS such as MySQL opts for consistency. A NoSQL DB such as MongoDB opts for Availability and “eventual consistency”.
No, a NoSQL database is not suitable for Tozmaster because consistency is important. Without consistency (by proper application of ACID principles) one ticket may be sold to 2 different people.
- Atomicity -> If inconsistency in the transaction occurs, both buyers are forced to re-submit their application to purchase. “All or nothing”
- Consistency -> The data submitted by the user must be in the correct format i.e. no strings where ints are required (responsibility of both DBA and Application programmer) e.g. ensure buyer credit card numbers don’t contain letters
- Isolation -> The transactions should occur (as if) they occur one by one. That is, transaction B should only be able to change the data from transaction A once transaction A is completed. E.g. Whoever was first to click purchase is the first transaction to occur i.e. they get the ticket.
- Durability -> Any committed change to the data is logged so it is not lost in a DB or server failure.
State the BASE of NoSQL in contrast to ACID of relational database
- Basically available: data will be eventually consistent. Not all data will be fully up to date which will result in not found response.
- Soft state: data is updated without any input due to eventual consistency.
- Eventual consistency: inserted data will be propagated to all nodes in a timely fashion or not.