Search Docs by Keyword
Python Programming Language
What is Python?
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a “batteries included” language due to its comprehensive standard library.
Package Managers
FASRC clusters use mamba. Mamba is available on the FASRC cluster as a software module either as python/3*
, Miniforge3
, or Mambaforge
and is aliased to mamba
. For more information on Mamba and it’s usage on FASRC clusters, see the Python Package Installation entry.
Best Practices
Managing Python in an HPC environment requires careful handling of dependencies, performance optimization, and resource management. Below are some best practices to keep your Python code efficient and scalable on HPC clusters. As with any coding language in HPC, familiarize yourself with our Job Efficiency and Optimization Best Practices page. Another great resource is Python for HPC: Community Materials.
Use Mamba with Miniforge for Environment Management
We cannot emphasize this enough! To maintain a clean and efficient workspace, use Miniforge with Mamba for creating and managing Python environments. This ensures that your dependencies are isolated and reduces the risk of conflicts. For example, to set up an environment for data analysis with pandas and related libraries, you can use:
mamba create -n data_env python=3.9 pandas numpy matplotlib scikit-learn
mamba activate data_env
This approach ensures your Python environment is isolated, optimizing your workflows on HPC clusters.
Code Quality: “Code is read much more often than it is written”
Focus on clean, quality code. You may need support running your program, or troubleshooting an issue. This will help others grok your code. Take a look at the PEP 8 Style Guide for Python. Consider using a linter for VS Code, such as Flake8 or Ruff. Another option is Mypy for VS Code which runs mypy on Python code cells in Jupyter notebooks. For more resources, check out Python for HPC: Community Materials
Testing
TBD. Add tests to your code.
Chunking: Pandas? Dask? Both?
Use Chunking! If you’re dealing with moderately large datasets, Dask can enhance Pandas by parallelizing operations. For very large datasets that exceed memory constraints, using Dask alone as a substitute for Pandas is a more effective solution.
Using Dask with Pandas:
Dask can work seamlessly with Pandas to parallelize operations on large datasets that don’t fit into memory. You can convert a Pandas DataFrame to a Dask DataFrame using dd.from_pandas()
to distribute computations across multiple cores or even nodes. This approach allows you to scale up Pandas workflows without changing much of your existing code.
Using Dask Instead of Pandas:
Dask can be used as a drop-in replacement for Pandas when working with larger-than-memory datasets. It provides a familiar DataFrame API that mimics Pandas but works lazily, meaning computations are broken down into smaller tasks that are executed when you call .compute()
. This makes it possible to handle datasets that would be too large for Pandas alone.
Manage Cluster Resources Dask DataFrame for Parallel and Distributed Processing
Here is a tutorial for handling larger datasets or utilizing multiple cores/nodes, Dask DataFrames scale pandas operations to larger-than-memory datasets and parallelize them across your cluster.
Dask’s compatibility with pandas makes it a powerful tool in HPC, allowing familiar pandas operations while scaling up to handle larger data loads efficiently.
More on Compressed Chunks
Switching from traditional formats like CSV to more efficient ones like Parquet [Project site][Python Package] can greatly enhance I/O performance, reducing load times and storage requirements. Here’s how you can work with Parquet using pandas:
import pandas as pd
# Reading from a Parquet file
df = pd.read_parquet('large_dataset.parquet')
# Writing to Parquet
df.to_parquet('output_data.parquet', compression='snappy')
Parquet’s columnar storage format is much faster and more efficient for reading and writing operations, crucial for large-scale data handling in HPC environments.
Become One With Your Data: Ydata-Profiling
“In a dark place we find ourselves, and a little more knowledge lights our way.” – Yoda
Ydata Profiling, formerly known as pandas profiling, is a powerful tool for automating exploratory data analysis (EDA) in Python, generating comprehensive reports that provide a quick overview of data types, distributions, correlations, and missing values. In an HPC setting, it accelerates data exploration by leveraging to handle large datasets efficiently, saving time and computational resources for researchers and data scientists.
Want to talk about easy to use?
df = pd.DataFrame(...)
profile = ProfileReport(df, title="Profiling Report"
Done!
The tool enhances data quality and understanding by highlighting issues like missing values and incorrect data types, allowing users to make informed decisions for preprocessing and feature selection. It integrates seamlessly with HPC workflows, supporting Python-based Pandas data manipulation and analysis, and generates HTML reports that are easy to share and document, facilitating collaboration within your lab.
Ydata-profiling’s ability to provide deep insights and optimize data preprocessing helps avoid redundant computations, making it ideal for complex simulations and high-dimensional data analysis in HPC environments. Check out the strong community support over at Stack Overflow.
For more a short tutorial, here is a link: Learn how to use the ydata-profiling library.
Examples
Serial Python
examples: https://github.com/fasrc/User_Codes/tree/master/Languages/Python
Parallel Computing
Take a look at the python documentation for the multiprocessing library. We have a few examples for parallel computing using it: https://github.com/fasrc/User_Codes/tree/master/Parallel_Computing/Python
References
- Official Python Documentation
- The Python Tutorial
- Software Carpentry Python Lesson
- NumPy User Guide
- SciPy Official Documentation
- Matplotlib
- Seaborn
- Pandas
- Python Multiprocessing Library
- Python Pandas Dataframe Tutorial in 10 minutes
Bookmarkable Section Links
- 1 What is Python?
- 2 Package Managers
- 3 Best Practices
- 3.1 Use Mamba with Miniforge for Environment Management
- 3.2 Code Quality: “Code is read much more often than it is written”
- 3.3 Testing
- 3.4 Chunking: Pandas? Dask? Both?
- 3.5 Manage Cluster Resources Dask DataFrame for Parallel and Distributed Processing
- 3.6 More on Compressed Chunks
- 3.7 Become One With Your Data: Ydata-Profiling
- 4 Examples
- 5 References