Python Dictionary
A dictionary in Python is a data structure to store data in Key: value format. There are several similarities between python lists and dictionaries, however, they differ in how their elements are accessed. List elements are ordered and are accessed by numerical index, dictionary’s items on the other hand are unordered and are accessed by key.
Each individual data point in a dictionary is called an item. It’s separated by a comma, An example dictionary in python would look like below.
dictionary {
'key': 'value',
'key2' :'value2'
}
Key Properties of Python Dictionary
A python dictionary has the following key properties
Ordered
Dictionaries are ordered. What it means is that order of the items within a dictionary is fixed. And that order will not change. Dictionaries are ordered from python version 3.7 onwards. In python version 3.6 and prior, dictionaries are unordered.
Changeable
Dictionaries are changeable. What it means is that we can add, remove items from the dictionaries after they have been created.
Duplicates are not allowed
Dictionaries do not allow duplicate keys. Any duplicate key will be overwritten by the later key.
Melbourne = {
'Area': '9,993 sq.km',
'Population' :'4,963,349',
2021 :'impacted by covid',
2021 :'Year of coffee'
}
This will print 2021 as ‘Year of coffee”, overwriting the previous value of the key.
Creating Dictionary in Python
Empty dictionary using {}
We can create an empty dictionary in python using the {} operator.
# Creating an empty dictionary
Melbourne = {}
Dictionary with items
We can also create a dictionary by specifying any items list separated by commas.
#Creating dictionary with items
Melbourne = {
'Area': '9,993 sq.km',
'Population' :'4,963,349',
2021 :'impacted by covid'
}
Notice the if the key is text, we have to include it in quotes. If it’s a numeric value, it does not need to be in quotes.
Dictionary with dict()
We can also use dict() function in python to create a dictionary.
Melbourne = dict({
'Area': '9,993 sq.km',
'Population' :'4,963,349',
2021 :'impacted by covid'
})
Accessing Dictionary Items
Python dictionaries provide a few functions to access both keys and values of dictionary items.
Accessing All Key:Value items using .items()
dictionary.items() function gives a list of all key-value pairs in the dictionary.
Melbourne = dict({
'Area': '9,993 sq.km',
'Population' :'4,963,349',
2021 :'impacted by covid'
})
result = Melbourne.items()
print(result)
dict_items([('Area', '9,993 sq.km'), ('Population', '4,963,349'), (2021, 'impacted by covid')])
Accessing Keys Only using .keys()
.keys() function is used to get only the keys of all items. If we replace the .items() function with .keys() function in the code above. It will only return keys this time.
result = Melbourne.keys()
print(result)
dict_keys(['Area', 'Population', 2021])
Accessing Item Values
There are two main functions for accessing item values
Using []
We can access specific dictionary item by providing it’s key either explicitly or in a loop. In the example below, I am providing an explicit key.
Using .get()
Another way of accessing an item value is by using dictionary.get() function
print(Melbourne.get('Area'))
Removing items
Python provides two main function for removing either the key or the key:value pair from the dictionary.
Removing item using .pop()
dictionary.pop() function removes an item from the dictionary and returns its associated value.
#poping key Area
print( Melbourne.pop('Area') )
#resulting dictionary after pop
print(Melbourne)
The dictionary.pop() call removed the key and returned the value of the item, whereas the next line prints the remaining dictionary which is now without Area.
9,993 sq.km
{'Population': '4,963,349', 2021: 'impacted by covid'}
Removing last item using .popitem()
dictionary.popitem() removes the last key-value paid added and returns it as a tuple.
#poping key Area
print(Melbourne.popitem())
#resulting dictionary after pop
print(Melbourne)
The first function removed the item and returned the key: value item in a tuple format. When we print the dictionary after .popitem(), we see that last items is no longer there.
(2021, 'impacted by covid')
{'Area': '9,993 sq.km', 'Population': '4,963,349'}
Loading data from kaggle directly into S3 is a two step process. In first step we configure Kaggle to be able to download. And in second step, we extract data from Kaggle into S3 bucket.
Get data from Kaggle
To get data from kaggle, we setup Kaggle command line tool and then generate an API token to get the data.
Setup Kaggle Command Line
To get data from Kaggle, we will install kaggle-cli.
pip install kaggle
Create API tokens
From top right, click on your account name and then click on Account
Account page has a panel for creating new Api token.
It will download kaggle.json which we can then use as a token. Move this file to Kaggle’s Environment folder. By default its in user’s home directory /.kaggle/
Download data on local
To demonstrate, I am using this dataset from Kaggle.
From the three dots on the right side, select the Copy API command.
[su_box title=”Copy data” style=”glass” radius=”1″]kaggle datasets download -d ajaypalsinghlo/world-happiness-report-2021[/su_box]
This will download the file on your local desktop.
Copy data to S3
Setup AWS Command Line
To copy data from local desktop to AWS s3 bucket, AWS provides CLI tools. To use AWS CLI tools, we first need to generate aws access credentials. This can be done from AWS web console.
Generate S3 AWS secret
In the AWS S3 interface. Select IAM -> Users from services.
Click on Add User.
Check Programmatic access.
It will generate an access key id and secret access key.
Configure Keys in local system
To check AWS key configuration, type :
aws configure list
Use aws configure command to enter the key and secret from the previous step.
Succesful configuration will result in a configuration list that looks like the following :
Create S3 Bucket
In the AWS console, select service S3 and click on Create Bucket.
URI to access the bucket is not displayed in the console. However, it’s s3://bucket name. In our case, it will be s3://world-happiness-data
Copy data from local to S3
Now that bucket is created and our CLI is configured, we can run the copy command.
aws s3 cp world-happiness-report-2021.zip s3://world-happiness-data
verifying upload
We can verify the upload by going to Amazon AWS console > S3 > Bucket name
What is Correlation Coefficient
Correlation means a mutual relationship or connection between two or more things (variables). The correlation coefficient is a numeric measure to quantify this relationship. The coefficient describes two aspects of a relationship.
- Strength of the relationship
- Direction of the relationship.
Correlation is used for predicting the relationship between two variables. It is also used for concurrent validity ( correlation between a new measure and an established measure). One more use of correlation is in reliability testing, e.g. Test-retest reliability (measure consistent).
Strength of relationship
For the strength of the relationship, the value ranges from 0 to 1. A perfectly strong relationship between variables will be 1, whereas no relationship between two variables will be 0. Values in the vicinity of 0.5 ( middle of 0 & 1 will represent a medium correlation).
Perfect Correlation
The perfect correlation will have a value of 1. The Scatter plot below shows a perfect correlation, as one variable increases, the other variable increases as well.
Strong & Weak Correlation
A correlation will mean stronger relationship between two variables under study. There is no rule for determining what’s considered Strong, medium or week. The interpretation of coefficient depends on subject of study.
In one area of study 0.4 and above could be considered as relatively strong, whereas in another subject 0.4 can be considered relatively week, and only vales of 0.75 and up are considered as strong correlation.
The scatter below has a correlation value of 0.944. It’s direction is positive. It represent a very strong correlation indicating that as X increase, Y will strongly increase with it.
The scatterplot below has a correlation value of 0.0478. Its direction is positive, however, it’s a weak correlation.
Direction of relationship
The direction of the relationship indicates an inverse relationship between the variables. An increase in one variable is associated with a decrease in the other variable. An example of negative correlation will be Boyle’s law as the volume of a container increases, the pressure of gas decreases.
The Scatter plot below shows a perfect correlation of (-1), as one variable increases, the other variable decreases with it.
Methods of calculating Correlation Coefficient
There are multiple methods for calculating correlation coefficient e.g.
- Pearson’s method
- Kendal’s methods
- Pearson’s methods.
For this article, I will focus on Pearson’s correlation coefficient.
Pearson’s Correlation Coefficient
Let’s begin by looking at the formulae for Pearson’s Correlation coefficient
r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}
Where,
x_i is the individual value of the x
\overline{y}] is the mean of all x values
Numerator in Pearson’s equation
r = \frac{{}\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})} {}
All we are doing here is finding the error by subtracting the predicted x and y values from their respective mean. For examples, (x_i - \overline{x}) finds the difference of each x value from it’s mean. We repeat it for each value of y_i and sum the resultant number. In a more statistical language, we are essentially normalizing the chart around the mean.
Denominator in Pearson’s equation
r = \frac{}
{\sqrt{\sum_{i=1}^{n} (x_i - \overline{x})^2(y_i - \overline{y})^2}}
(x_i - \overline{x})^2 is the sum of errors squared. It calculated the distance between individual predicted x values and mean of all x values. Similarly
Any relational database for example ( MYSQL, POSTGRESQL) can only support structured data in the forms of rows and columns. Snowflake, however, can support multiple types of data loads.
Snowflake Supported Formats
Supported File |
Encoding |
Delimited files (CSV, TSV, etc.) |
UTF-8 |
JSON |
UTF-8 |
Avro |
UTF-8 |
ORC |
UTF-8 |
Parquet |
UTF-8 |
XML |
|
Concept of stages
Internal Stages
Snowflake has the following stage types. Data can be uploaded to any of the internal stage type from local file system using PUT command.
Snowflake Internal Stages
Stage |
Description |
User |
A user stage is allocated to each user for storing files. This stage type is designed to store files that are staged and managed by a single user but can be loaded into multiple tables. User stages cannot be altered or dropped. |
Table |
A table stage is available for each table created in Snowflake. This stage type is designed to store files that are staged and managed by one or more users but only loaded into a single table. Table stages cannot be altered or dropped. |
Named |
A named internal stage is a database object created in a schema. This stage type can store files that are staged and managed by one or more users and loaded into one or more tables. Because named stages are database objects, the ability to create, modify, use, or drop them can be controlled using security access control privileges. Create stages using the CREATE STAGE command. |
External Stages
Snowflake supports direct data load from cloud platforms. The table below summarizes the supported platforms.
Snowflake Platforms for Data Load
Supported Platform |
Loading Stages |
Local environment |
Files are first staged in a Snowflake stage, then loaded into a table. |
Amazon S3 |
Files can be loaded directly from any user-supplied S3 bucket. |
Google Cloud Storage |
Files can be loaded directly from any user-supplied Cloud Storage container. |
Microsoft Azure |
Files can be loaded directly from any user-supplied Azure container. |
Loading data from Amazon AWS
Creating an S3 Stage
An S3(external) stage is a reference to a location in the snowflake platform where data files are stored before they are loaded into a table.
Loading all files within the folder:
If the path ends with /, all of the objects in the S3 folders are loaded in the stage
Loading data from a file
If specific file path is given, it loads data from file into the stage. The example below loads CSV data into an S3 stage.
To create a stage, Select the database > Click on Stages Tab -> Click on Create
This loads a create stage interface where we can specify file information to load data from
CREATE STAGE “ANALYTICS_DEMO_DATA_2″.”PUBLIC”.My_First_External_Stage URL = ‘s3://world-happiness-data/world-happiness-report-2021.csv’ CREDENTIALS = (AWS_KEY_ID = ‘AKIAZFJYSWSXK7Y6BAEG’ AWS_SECRET_KEY = ‘****************************************’);
Viewing Stages:
We can view available stages in our database using the following command.
Show stages;
This shows the stage we created in previous step
Copying data from Stage
Now that the data is loaded in the Snowflake stage. We can load it on our table.
copy into “ANALYTICS_DEMO_DATA_2″.”PUBLIC”.”S3DATA” from @MY_First_External_Stage;
A select statement to confirm loading shows that data is successfully loaded from the stage into our table.
Bulk vs continuous loading
External tables (Data Lake)
What is snowflake?
Snowflake is a data warehouse platform that pitches the native cloud ( or cloud-first ) approach as its primary value proposition. let’s learn about snowflake by going hands-on in few basic operations in snowflake.
Creating Database in snowflake
Creating a database in snowflake is fairly easy and standard. There are two approaches of doing it.
Graphical Interface
Clicking on the Databases tabs from the main menu loads the database panel. Click on create to create a database.
SQL Command line
The other way of creating a database in snowflake is via the command line. For that, click on the Worksheets tab. Worksheet in snowflake is an interface for viewing all data objects. Click on + sign to create a new worksheet if you prefer, or continue to work in the default worksheet.
1. The left side of worksheet lists all the database objects.
2. SQL space is where we can write SQL queries ( in this case to create a database)
3. On top right, it gives basic information about role, instance type and the current database selected.
Coming back to query for creating database.
- Write query in the worksheet.
- Results show up at the bottom pane
- Refresh the left pane and we have our database ready.
[su_box title=”Creating Database with SQL” style=”glass” radius=”1″]CREATE DATABASE “ANALYTICS_DEMO_DATA_2”;[/su_box]
Creating Schema in snowflake
A schema is a logical grouping of database objects ( tables, views, etc. ). In contrast, a database is a logical grouping of schemas. Each schema belongs to a single database.
To create schema in snowflake, go to the Databases tab. Click on the database name for which you want to create the schema. Then click on Schemas sub-menu.
- Select database for which you want to create the schema
- Click on Schemas sub-menu
- Click on Create button
- Enter schema name
SQL alternative for schema creation will be.
[su_box title=”Creating Schema with SQL” style=”glass” radius=”1″]CREATE SCHEMA “ANALYTICS_DEMO_DATA”.”DEMO_SCHEMA”;[/su_box]
Creating a table in snowflake database
To create a table in snowflake via Graphical interface.
- Select the database you want to create a table in, from the Databases menu.
- Tables sub-menu is selected by default. Click on + Create.
- Enter the table name.
- Add a column ( you need atleast 1 column to create the table ).
The SQL way of creating a table will be.
[su_box title=”Creating Table with SQL” style=”glass” radius=”1″]CREATE TABLE “ANALYTICS_DEMO_DATA_2″.”PUBLIC”.”MY_FIRST_TABLE” (“C1” STRING) COMMENT = ‘Hello World Table’;[/su_box]
Creating views in snowflake database
To create a view in snowflake via Graphical interface.
- Select the database you want to create a table in, from the Databases menu.
- Select the views sub-menu. Click on + Create icon.
- Enter the view name.
- Enter the view definition.
SQL way of creating a way in snowflake would be :
[su_box title=”Creating views with SQL” style=”glass” radius=”1″]CREATE VIEW “ANALYTICS_DEMO_DATA_2″.”PUBLIC”.My_First_view AS Select C1 from My_First_Table;[/su_box]
Previewing data in snowflake
We can preview the data in snowflake from the worksheet menu.
- From the left side database object menu, select the table to preview
- Click on the preview data button
- Data preview will show in bottom pane
I have selected a sample Citibank database and trips table in it for preview.
This was a high-level overview of getting started with snowflake. We’ll look into how to load data in snowflake and the concept of stages.