Search Docs by Keyword

Table of Contents

Python Programming Language

What is Python?

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a “batteries included” language due to its comprehensive standard library.

Package Managers

FASRC clusters use mamba.  Mamba is available on the FASRC cluster as a software module either as  python/3* , Miniforge3, or Mambaforge and is aliased to mamba. For more information on Mamba and it’s usage on FASRC clusters, see the Python Package Installation entry.

Best Practices

Managing Python in an HPC environment requires careful handling of dependencies, performance optimization, and resource management. Below are some best practices to keep your Python code efficient and scalable on HPC clusters.   As with any coding language in HPC, familiarize yourself with our Job Efficiency and Optimization Best Practices page.  Another great resource is Python for HPC: Community Materials.

Use Mamba with Miniforge for Environment Management

We cannot emphasize this enough!  To maintain a clean and efficient workspace, use Miniforge with Mamba for creating and managing Python environments. This ensures that your dependencies are isolated and reduces the risk of conflicts. For example, to set up an environment for data analysis with pandas and related libraries, you can use:

mamba create -n data_env python=3.9 pandas numpy matplotlib scikit-learn
mamba activate data_env

This approach ensures your Python environment is isolated, optimizing your workflows on HPC clusters.

Code Quality: “Code is read much more often than it is written”

Focus on clean, quality code.  You may need support running your program, or troubleshooting an issue.  This will help others grok your code.  Take a look at the PEP 8 Style Guide for Python.   Consider using a linter for VS Code, such as Flake8 or Ruff.  Another option is Mypy for VS Code which runs mypy on Python code cells in Jupyter notebooks.  For more resources, check out Python for HPC: Community Materials

Testing

TBD.  Add tests to your code.

Chunking: Pandas? Dask? Both?

Use Chunking! If you’re dealing with moderately large datasets, Dask can enhance Pandas by parallelizing operations. For very large datasets that exceed memory constraints, using Dask alone as a substitute for Pandas is a more effective solution.

Using Dask with Pandas:

Dask can work seamlessly with Pandas to parallelize operations on large datasets that don’t fit into memory. You can convert a Pandas DataFrame to a Dask DataFrame using dd.from_pandas() to distribute computations across multiple cores or even nodes. This approach allows you to scale up Pandas workflows without changing much of your existing code.

Using Dask Instead of Pandas:

Dask can be used as a drop-in replacement for Pandas when working with larger-than-memory datasets. It provides a familiar DataFrame API that mimics Pandas but works lazily, meaning computations are broken down into smaller tasks that are executed when you call .compute(). This makes it possible to handle datasets that would be too large for Pandas alone.

Manage Cluster Resources Dask DataFrame for Parallel and Distributed Processing

Here is a tutorial for handling larger datasets or utilizing multiple cores/nodes, Dask DataFrames scale pandas operations to larger-than-memory datasets and parallelize them across your cluster.

Dask’s compatibility with pandas makes it a powerful tool in HPC, allowing familiar pandas operations while scaling up to handle larger data loads efficiently.

More on Compressed Chunks

Switching from traditional formats like CSV to more efficient ones like Parquet [Project site][Python Package] can greatly enhance I/O performance, reducing load times and storage requirements. Here’s how you can work with Parquet using pandas:

import pandas as pd
# Reading from a Parquet file
df = pd.read_parquet('large_dataset.parquet')
# Writing to Parquet
df.to_parquet('output_data.parquet', compression='snappy')

Parquet’s columnar storage format is much faster and more efficient for reading and writing operations, crucial for large-scale data handling in HPC environments.

Become One With Your Data: Ydata-Profiling

“In a dark place we find ourselves, and a little more knowledge lights our way.” – Yoda

Ydata Profiling, formerly known as pandas profiling, is a powerful tool for automating exploratory data analysis (EDA) in Python, generating comprehensive reports that provide a quick overview of data types, distributions, correlations, and missing values. In an HPC setting, it accelerates data exploration by leveraging to handle large datasets efficiently, saving time and computational resources for researchers and data scientists.

Want to talk about easy to use?

df = pd.DataFrame(...)
profile = ProfileReport(df, title="Profiling Report"

Done!

The tool enhances data quality and understanding by highlighting issues like missing values and incorrect data types, allowing users to make informed decisions for preprocessing and feature selection. It integrates seamlessly with HPC workflows, supporting Python-based Pandas data manipulation and analysis, and generates HTML reports that are easy to share and document, facilitating collaboration within your lab.
Ydata-profiling’s ability to provide deep insights and optimize data preprocessing helps avoid redundant computations, making it ideal for complex simulations and high-dimensional data analysis in HPC environments. Check out the strong community support over at Stack Overflow.

For more a short tutorial, here is a link: Learn how to use the ydata-profiling library.

Examples

Serial Python

examples: https://github.com/fasrc/User_Codes/tree/master/Languages/Python

Parallel Computing

Take a look at the python documentation for the multiprocessing library.   We have a few examples for parallel computing using it: https://github.com/fasrc/User_Codes/tree/master/Parallel_Computing/Python

References

© The President and Fellows of Harvard College
Except where otherwise noted, this content is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.