Hi there 👋🏾
Thank you for joining this workshop. I hope you enjoy it.
Let's take a few minutes to talk about you.
What is your name?
My name is Mahmoud Abdelrazek
I worked with databases here 👇🏾
and I like to read 📖, run 🏃🏾, code 💻 and sleep 💤
In this workshop we will use Enron's emails dataset.
The dataset and more information about are available on CMU.
- Bob: Maybe we should use a database? 🤔
- Alice: I don't know 😶 don't know what a database is.
Once you receive a new dataset, the first step is always to inspect it or at least a sample of it visually and then conduct a quick Exploratory Data Analysis.
Some of the questions you might like to ask are:
What questions are you trying to answer?
Before importing your data to the database we need to engineer it to match the model of the databses management system.
PostgreSQL is an open source Database Management System
Can you think of a different way to represent the data?
Our PostgreSQL is accessible here
use this password for connecting to the database
Find enron tables
Users table
Email transactions table
Emails table
Can you think of other questions to ask?
- Bob: How can we analyze all this text? 🤔
- Alice: I have an idea 💡 we can search through it? 🔎
Normalization 🔗 is generally used with SQL databases to reduces data redundancy and eliminates structural dependency.
Denormalization 📜 is used with NoSQL database to improves read speed
Can you think of a different way to represent the data?
Elasticsearch is a fast and scalable search and analytics engine.
Our Elasticsearch is accessible here
No user name nor password are required
- Alice: This looks like 🕸️ a network ?
- Bob: Who do you think is the center 🎯 of this network?
neo4j is graph database management system.
Our neo4j is accessible here
One Match query
Two joins and one select
Centrality is a measure of how important a node is to the graph. It can be measured using many algorithms, but here we will use the PageRank algorithm.
PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes. neo4j
PageRank is introduced in the original Google paper as a function that solves the following equation:
where,
we assume that a page A has pages T1 to Tn which point to it.
d is a damping factor which can be set between 0 (inclusive) and 1 (exclusive). It is usually set to 0.85.
C(A) is defined as the number of links going out of page A.
This equation is used to iteratively update a candidate solution and arrive at an approximate solution to the same equation.
Create the graph
Calculate PageRank
- Alice: Are there any other models to explore? 🤔
- Bob: Maybe 🤷🏾 Let's do some research 🔎
what do we know so far?
Atomicity: All or nothing
Consistency: One valid state to another
Isolation: Concurrently = Sequentially
Durability: Data is saved to disk
Your system can consist of more one database model
It helps to think about the questions you are trying to answer, the data nature, the resources you have and the capabilities that each software offers
is the process of organizing the data into a structure and format that is easy to import into the database system
this can include modeling, cleaning, transforming, and enriching the data.
We used Entity Relationship, Document, and Graph models.
This step depends entirely of the dataset, but it involves dealing with missing values and anomalies in the data
for example:
kenneth.lay@enron.com, klay@enron
This include extracting the attributes and exporting the data in format acceptable by the database system. we used JSON and CSV formats.
This step is optional for our purpose. It involves extending the data with new features from external sources or by inferring relationships between the data.
for example:
it is all in one command
You can find me as razekmh on: