Introduction to Database Paradigms

a workshop by Mahmoud Abdelrazek

Let's talk about you

Hi there 👋🏾

Thank you for joining this workshop. I hope you enjoy it.

Let's take a few minutes to talk about you.

What is your name?

This is Me

My name is Mahmoud Abdelrazek

I worked with databases here 👇🏾

and I like to read 📖, run 🏃🏾, code 💻 and sleep 💤

About this workshop

What is a database? And when do you need to use it?
What are the different types of database systems? And how to choose among them?
How to think about your data using database models?

The dataset

In this workshop we will use Enron's emails dataset.

The dataset and more information about are available on CMU.

Chapter one

- Bob: Maybe we should use a database? 🤔

- Alice: I don't know 😶 don't know what a database is.

You've Got Data 🎉

Once you receive a new dataset, the first step is always to inspect it or at least a sample of it visually and then conduct a quick Exploratory Data Analysis.

Some of the questions you might like to ask are:

What is the nature of your data? text, images, DNA sequances?
What is the data type of each column?
How is your data structured?

Inspecting the data 👀

Important attributes 📢

But, why do you care about this dataset?

What questions are you trying to answer?

Maybe you want to find the most active employee in the company?
Maybe you want to find the most comman phrases in the company's communication?
Maybe you want to understand the connections between employees inside the company?

To Database or Not to Database ❓

What is the size of your data ?
- larger than excel limit

How often does your data change ?
- it gets modified every few minutes

Who will interact with your data ?
- multiple researchers

Data Modeling &
Data Engineering

Before importing your data to the database we need to engineer it to match the model of the databses management system.

Data Model: is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. Wiki
Data Engineering: is the practice designing and building systems for collecting, storing, and analyzing data at scale. coursera

We have to address the elephant in the room

PostgreSQL is an open source Database Management System

Entity Relationship Model

An entity may be defined as a thing capable of an independent existence that can be uniquely identified.

A relationship captures how entities are related to one another wiki

Entity Relationship Diagram

Can you think of a different way to represent the data?

The workshop server

Our PostgreSQL is accessible here

use this password for connecting to the database

PG Admin interface

Find enron tables

Let's check the data

Users table

Let's check the data

Email transactions table

Let's check the data

Emails table

Find the most active day 🤸🏾

Find who recieved the highest count of emails 📬

Find who was cc'd/to'd in emails the most 📥

Find who sends the most emails and receives the least 📭

Find who sent emails about LJM the most 📮

Can you think of other questions to ask?

Chapter two

- Bob: How can we analyze all this text? 🤔

- Alice: I have an idea 💡 we can search through it? 🔎

Data organization

Normalization 🔗 is generally used with SQL databases to reduces data redundancy and eliminates structural dependency.

Denormalization 📜 is used with NoSQL database to improves read speed

Entity Relationship Model

An entity may be defined as a thing capable of an independent existence that can be uniquely identified.

A relationship captures how entities are related to one another wiki

Entity Relationship Diagram

Can you think of a different way to represent the data?

Document Data Model📃

Elasticsearch is a fast and scalable search and analytics engine.

The workshop server

Our Elasticsearch is accessible here

No user name nor password are required

Elasticsearch server 💽

Find a multi word phrase

Chapter three

- Alice: This looks like 🕸️ a network ?

- Bob: Who do you think is the center 🎯 of this network?

neo4j is graph database management system.

Graph Model

Nodes Similar to the entities, nodes are has unique conceptual identity. neo4j

Relationships each relationship connects two nodes, a source node and a target node neo4j

Graph Model

Labels represent types. each node can have one or more label. neo4j

Properties are name-value pairs of data that you can store on nodes or on relationships. neo4j

Graph Model

The workshop server

Our neo4j is accessible here

Graph Model vs Entity Relationship Model

One Match query

Graph Model vs Entity Relationship Model

Two joins and one select

Centrality 🎯

Centrality is a measure of how important a node is to the graph. It can be measured using many algorithms, but here we will use the PageRank algorithm.

PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes. neo4j

PageRank

PageRank is introduced in the original Google paper as a function that solves the following equation:

where,

we assume that a page A has pages T₁ to T_n which point to it.

d is a damping factor which can be set between 0 (inclusive) and 1 (exclusive). It is usually set to 0.85.

C(A) is defined as the number of links going out of page A.

This equation is used to iteratively update a candidate solution and arrive at an approximate solution to the same equation.

neo4j

PageRank

Create the graph

PageRank

Calculate PageRank

Chapter four

- Alice: Are there any other models to explore? 🤔

- Bob: Maybe 🤷🏾 Let's do some research 🔎

Database Paradigms📚

what do we know so far?

source for this section

Data validity 👌🏾

Atomicity: All or nothing

Consistency: One valid state to another

Isolation: Concurrently = Sequentially

Durability: Data is saved to disk

Key Value Database🔑

Wide Column Database🏜️

Document Database📄

Relational Database🔒

Graph Database🕸️

Search Engine Database🔎

Multi-Model Database📳

System Design🎨

Your system can consist of more one database model

It helps to think about the questions you are trying to answer, the data nature, the resources you have and the capabilities that each software offers

Data preparation📝

is the process of organizing the data into a structure and format that is easy to import into the database system

this can include modeling, cleaning, transforming, and enriching the data.

Data modeling

We used Entity Relationship, Document, and Graph models.

Data cleaning

This step depends entirely of the dataset, but it involves dealing with missing values and anomalies in the data

for example:

unknown sender or receiver

user has multiple emails

							
								kenneth.lay@enron.com, klay@enron

Data transformation

This include extracting the attributes and exporting the data in format acceptable by the database system. we used JSON and CSV formats.

Data enrichment

This step is optional for our purpose. It involves extending the data with new features from external sources or by inferring relationships between the data.

for example:

adding role from another dataset

extracting company from emails

Building this workshop🚧

it is all in one command

connect with me 🤝🏾

You can find me as razekmh on:

Introduction to Database Paradigms

Introduction to Database Paradigms

Let's talk about you

This is Me

About this workshop

The dataset

Chapter one

You've Got Data 🎉

Inspecting the data 👀

Inspecting the data 👀

Important attributes 📢

But, why do you care about this dataset?

To Database or Not to Database ❓

Data Modeling & Data Engineering

We have to address the elephant in the room

Entity Relationship Model

Entity Relationship Diagram

The workshop server

PG Admin interface

Let's check the data

Let's check the data

Let's check the data

Find the most active day 🤸🏾

Find who recieved the highest count of emails 📬

Find who was cc'd/to'd in emails the most 📥

Find who sends the most emails and receives the least 📭

Find who sent emails about LJM the most 📮

Chapter two

Data organization

Entity Relationship Model

Entity Relationship Diagram

Document Data Model📃

Document Data Model📃

Document Data Model📃

The workshop server

Elasticsearch server 💽

Elasticsearch server 💽

Elasticsearch server 💽

Elasticsearch server 💽

Find a multi word phrase

Find a multi word phrase

Find a multi word phrase

Find a multi word phrase

Chapter three

Graph Model

Graph Model

Graph Model

The workshop server

Graph Model vs Entity Relationship Model

Graph Model vs Entity Relationship Model

Graph Model vs Entity Relationship Model

Centrality 🎯

PageRank

PageRank

PageRank

Chapter four

Database Paradigms📚

Data validity 👌🏾

Key Value Database🔑

Wide Column Database🏜️

Document Database📄

Relational Database🔒

Graph Database🕸️

Search Engine Database🔎

Multi-Model Database📳

System Design🎨

Data preparation📝

Data modeling

Data cleaning

Data transformation

Data enrichment

Building this workshop🚧

connect with me 🤝🏾

Data Modeling &
Data Engineering