database icon

Introduction to Database Paradigms

database icon

Introduction to Database Paradigms

a workshop by Mahmoud Abdelrazek

Let's talk about you


Hi there 👋🏾

Thank you for joining this workshop. I hope you enjoy it.

Let's take a few minutes to talk about you.

What is your name?

This is Me

My name is Mahmoud Abdelrazek

I worked with databases here 👇🏾

and I like to read 📖, run 🏃🏾, code 💻 and sleep 💤

About this workshop

  • What is a database? And when do you need to use it?
  • What are the different types of database systems? And how to choose among them?
  • How to think about your data using database models?

The dataset

In this workshop we will use Enron's emails dataset.

The dataset and more information about are available on CMU.

Chapter one

- Bob: Maybe we should use a database? 🤔

- Alice: I don't know 😶 don't know what a database is.

You've Got Data 🎉

Once you receive a new dataset, the first step is always to inspect it or at least a sample of it visually and then conduct a quick Exploratory Data Analysis.

Some of the questions you might like to ask are:

  • What is the nature of your data? text, images, DNA sequances?
  • What is the data type of each column?
  • How is your data structured?

Inspecting the data 👀

Inspecting the data 👀

Important attributes 📢

					
						
					
				

But, why do you care about this dataset?

What questions are you trying to answer?

  • Maybe you want to find the most active employee in the company?
  • Maybe you want to find the most comman phrases in the company's communication?
  • Maybe you want to understand the connections between employees inside the company?

To Database or Not to Database ❓


  • What is the size of your data ?
    - larger than excel limit

  • How often does your data change ?
    - it gets modified every few minutes

  • Who will interact with your data ?
    - multiple researchers

Data Modeling &
Data Engineering

Before importing your data to the database we need to engineer it to match the model of the databses management system.

  • Data Model: is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. Wiki
  • Data Engineering: is the practice designing and building systems for collecting, storing, and analyzing data at scale. coursera

We have to address the elephant in the room

PostgreSQL is an open source Database Management System

Entity Relationship Model



  • An entity may be defined as a thing capable of an independent existence that can be uniquely identified.

  • A relationship captures how entities are related to one another wiki

Entity Relationship Diagram

Can you think of a different way to represent the data?

The workshop server

Our PostgreSQL is accessible here

					
						
					
				
use this password for connecting to the database
   

PG Admin interface

Find enron tables

Let's check the data

Users table

Let's check the data

Email transactions table

Let's check the data

Emails table

Find the most active day 🤸🏾

					
						
					
				

Find who recieved the highest count of emails 📬

					
						
					
				

Find who was cc'd/to'd in emails the most 📥

					
						
					
				

Find who sends the most emails and receives the least 📭

					
						
					
				

Find who sent emails about LJM the most 📮

					
						
					
				

Can you think of other questions to ask?

Chapter two

- Bob: How can we analyze all this text? 🤔

- Alice: I have an idea 💡 we can search through it? 🔎

Data organization


Normalization 🔗 is generally used with SQL databases to reduces data redundancy and eliminates structural dependency.


Denormalization 📜 is used with NoSQL database to improves read speed

Entity Relationship Model



  • An entity may be defined as a thing capable of an independent existence that can be uniquely identified.

  • A relationship captures how entities are related to one another wiki

Entity Relationship Diagram

Can you think of a different way to represent the data?

Document Data Model📃

					
						
					
				

Document Data Model📃

					
						
					
				

Document Data Model📃

					
						
					
				

Elasticsearch is a fast and scalable search and analytics engine.

The workshop server

Our Elasticsearch is accessible here

	
		
	

No user name nor password are required

Elasticsearch server 💽

Elasticsearch server 💽

Elasticsearch server 💽

Elasticsearch server 💽

Find a multi word phrase

		
			
		
	
		
			
		
	

Find a multi word phrase

		
			
		
	
		
			
		
	

Find a multi word phrase

		
			
		
	
		
			
		
	

Find a multi word phrase

		
			
		
	
		
			
		
	

Chapter three

- Alice: This looks like 🕸️ a network ?

- Bob: Who do you think is the center 🎯 of this network?

neo4j is graph database management system.

Graph Model



  • Nodes Similar to the entities, nodes are has unique conceptual identity. neo4j

  • Relationships each relationship connects two nodes, a source node and a target node neo4j

Graph Model



  • Labels represent types. each node can have one or more label. neo4j

  • Properties are name-value pairs of data that you can store on nodes or on relationships. neo4j

Graph Model





The workshop server

Our neo4j is accessible here

				
					
				
			

Graph Model vs Entity Relationship Model

				
					
				
			

One Match query

Graph Model vs Entity Relationship Model

Graph Model vs Entity Relationship Model

				
					
				
			

Two joins and one select

Centrality 🎯

Centrality is a measure of how important a node is to the graph. It can be measured using many algorithms, but here we will use the PageRank algorithm.

PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes. neo4j

PageRank

PageRank is introduced in the original Google paper as a function that solves the following equation:

where,

we assume that a page A has pages T1 to Tn which point to it.

d is a damping factor which can be set between 0 (inclusive) and 1 (exclusive). It is usually set to 0.85.

C(A) is defined as the number of links going out of page A.

This equation is used to iteratively update a candidate solution and arrive at an approximate solution to the same equation.

neo4j

PageRank

Create the graph

				
					
				
			

PageRank

Calculate PageRank

				
					
				
			

Chapter four

- Alice: Are there any other models to explore? 🤔

- Bob: Maybe 🤷🏾 Let's do some research 🔎

Database Paradigms📚

what do we know so far?

source for this section

Data validity 👌🏾

Atomicity: All or nothing

Consistency: One valid state to another

Isolation: Concurrently = Sequentially

Durability: Data is saved to disk

Key Value Database🔑

					
						
					
				
					
						
					
				
					
						
					
				

Wide Column Database🏜️

					
						
					
				
					
						
					
				

Document Database📄

					
						
					
				

Relational Database🔒


Graph Database🕸️


Search Engine Database🔎

					
						
					
				

Multi-Model Database📳


System Design🎨

Your system can consist of more one database model

It helps to think about the questions you are trying to answer, the data nature, the resources you have and the capabilities that each software offers

Data preparation📝

is the process of organizing the data into a structure and format that is easy to import into the database system

this can include modeling, cleaning, transforming, and enriching the data.

Data modeling

We used Entity Relationship, Document, and Graph models.

Data cleaning

This step depends entirely of the dataset, but it involves dealing with missing values and anomalies in the data


for example:

  • unknown sender or receiver
  • user has multiple emails
    							
    								kenneth.lay@enron.com, klay@enron
    							
    						
  • Data transformation

    This include extracting the attributes and exporting the data in format acceptable by the database system. we used JSON and CSV formats.

    Data enrichment

    This step is optional for our purpose. It involves extending the data with new features from external sources or by inferring relationships between the data.


    for example:

  • adding role from another dataset
  • extracting company from emails
  • Building this workshop🚧

    it is all in one command

    					
    						
    					
    				

    connect with me 🤝🏾

    You can find me as razekmh on: