A Step-by-Step Guide to Hosting Machine Learning Notebooks in Databricks
Understanding Databricks Plans
Hands-on
Step 1: Sign Up for Databricks Free Edition
Step 2: Create a Compute Cluster
Step 3: Import or Create a Notebook
Step 4: Install Dependencies
Step 5: Run the Notebook
Step 6: Coding in Databricks
Step 7: Save and Share Your Work
Things to Know About Free Edition
Conclusion
Frequently Asked Questions
Q1: How do I start using Databricks for free?
Q2: Do I need to install anything locally on my ML notebook to run Databricks?
Q3: How do I install Python libraries in my ML notebook on Databricks?
By Janvi, Data Science Enthusiast at Analytics Vidhya
Login to continue reading and enjoy expert-curated content.
Keep Reading for Free
Hosting Your Machine Learning Notebook in Databricks: A Step-by-Step Guide
Databricks has emerged as one of the leading platforms for building and executing machine learning (ML) notebooks at scale. It combines the power of Apache Spark with a user-friendly notebook interface, integrated data tooling, and efficient experiment tracking capabilities. Whether you’re a data scientist, student, or just starting your journey into machine learning, this guide will take you through the steps to host your ML notebook in Databricks using the Free Edition.
Understanding Databricks Plans
Before diving in, it’s important to understand the various Databricks plans:
-
Free Edition:
- Best for individuals and small projects.
- Features include:
- A single-user workspace
- Access to a small compute cluster
- Support for languages like Python, SQL, and Scala
- MLflow integration for experiment tracking
- Drawbacks: Limited resources, timeouts after idle usage, and some enterprise features are disabled.
-
Standard Plan:
- Suitable for small teams.
- Offers larger compute clusters and collaboration features.
-
Premium Plan:
- Introduces advanced security features and user management.
-
Enterprise/Professional Plan:
- Designed for production environments requiring advanced governance and automation.
This tutorial will focus on the Free Edition, perfect for testing and learning without a financial commitment.
Hands-On: Hosting Your ML Notebook in Databricks
Step 1: Sign Up for Databricks Free Edition
- Visit Databricks Free Edition.
- Sign up using your email, Google, or Microsoft account.
- Once signed in, a workspace is automatically created, serving as your command center for controlling notebooks and clusters.
Step 2: Create a Compute Cluster
To execute code, you’ll need to create a compute cluster:
- Navigate to Compute in the sidebar.
- Click on Create Cluster.
- Name your cluster and select the default runtime (preferably, Databricks Runtime for Machine Learning).
- Click Create and wait for it to show a status of Running.
Note: Clusters may shut down after a period of inactivity in the Free Edition, but you can restart them as needed.
Step 3: Import or Create a Notebook
You can use an existing ML notebook or create a new one:
-
To import a notebook:
- Navigate to Workspace.
- Use the dropdown beside your folder → Import → File and upload your .ipynb or .py file.
-
To create a new one:
- Click on Create → Notebook.
- Bind the notebook to your running cluster using the dropdown at the top.
Step 4: Install Dependencies
If your notebook requires libraries like scikit-learn, pandas, or matplotlib, you can install them directly within the notebook:
%pip install scikit-learn pandas xgboost matplotlib
Tip: Databricks may restart the environment after installing libraries, so you might need to restart the kernel to use updated packages.
Step 5: Run the Notebook
You’re now ready to execute your code:
- Press Shift + Enter to run a cell or Run All to execute the entire notebook.
- Outputs will appear similarly to those in Jupyter notebooks.
Step 6: Coding in Databricks
With your environment set up, let’s look at a brief example using regression modeling to predict customer satisfaction (NPS score):
- Load and Inspect Data:
from pathlib import Path
import pandas as pd
DATA_PATH = Path("/Workspace/Users/[email protected]/nps_data_with_missing.csv")
df = pd.read_csv(DATA_PATH)
df.head()
- Train/Test Split:
from sklearn.model_selection import train_test_split
TARGET = "NPS_Rating"
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
- Quick Exploratory Data Analysis (EDA):
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(train_df["NPS_Rating"], bins=10, kde=True)
plt.title("Distribution of NPS Ratings")
plt.show()
- Data Preparation with Pipelines:
Set up pipelines for data preprocessing and model training, evaluate model performance, visualize the predictions, and even analyze feature importance.
Step 7: Save and Share Your Work
Databricks automatically saves your notebooks. To export and share:
- Navigate to File → Click on the three dots and select Download to save as .ipynb, .dbc, or .html.
- You can also link to a GitHub repository for version control.
Things to Know About Free Edition
While the Free Edition is great for experimentation, keep these limitations in mind:
- Clusters shut down after idle time (approximately 2 hours).
- Storage capacity is limited.
- Some enterprise capabilities are not included.
- Not ideal for production workloads.
Nevertheless, it’s an excellent environment for learning ML and testing models.
Conclusion
Databricks simplifies the cloud execution of ML notebooks. With no local installation required, the Free Edition is a perfect entry point to develop and test your machine learning models. As your projects grow or require more collaboration, you can easily upgrade to a paid plan.
Ready to get started? Sign up for the Databricks Free Edition today and unleash the potential of your machine learning notebooks in a seamless environment.
Frequently Asked Questions
Q1: How do I start using Databricks for free?
A: Sign up for the Databricks Free Edition at databricks.com/learn/free-edition to access a single-user workspace, small compute cluster, and MLflow support.
Q2: Do I need to install anything locally to run Databricks?
A: No, the Free Edition is completely browser-based; you can create clusters and run ML code online.
Q3: How do I install Python libraries in my notebook?
A: Use %pip install library_name
inside a notebook cell, or install from a requirements.txt file using %pip install -r requirements.txt
.
Hi, I’m Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data started with a curiosity about how to extract meaningful insights from complex datasets.