Chemeleon: Generative AI for Crystal Structure Exploration#
In this notebook, we’ll explore how to use Chemeleon for generating crystal structures. We’ll cover:
Crystal Structure Prediction (CSP) - generating structures for specific formulas
De Novo Generation (DNG) - generating novel structures without constraints
Analysing and visualizing the generated structures
Comparing with traditional structure prediction methods
Setup and Installation#
# First, let's check if we're in Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
print("Running in Google Colab")
# Clone the CHEMELEON repository
!git clone https://github.com/hspark1212/chemeleon-dng.git
%cd chemeleon-dng
!pip install -e .
else:
print("Running locally")
# Ensure you have the package installed locally
# You should have already cloned and installed chemeleon-dng
Running locally
# Import required libraries
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pymatgen.core import Structure, Composition
from pymatgen.analysis.structure_matcher import StructureMatcher
from pymatgen.io.cif import CifWriter
import warnings
warnings.filterwarnings('ignore')
# Add chemeleon-dng to path if running locally
if not IN_COLAB:
chemeleon_path = Path('/home/ryan/informatics/chemeleon-dng')
if chemeleon_path.exists():
sys.path.insert(0, str(chemeleon_path))
print("Libraries imported successfully!")
Libraries imported successfully!
Part 1: Crystal Structure Prediction (CSP)#
CSP mode allows us to generate crystal structures for specific chemical formulas. This is particularly useful when you know the composition but want to explore possible crystal structures.
# Let's start with a simple example - generating structures for NaCl
# We'll use the command-line interface first to understand the process
import subprocess
import tempfile
# Create a temporary directory for outputs
output_dir = tempfile.mkdtemp(prefix="chemeleon_csp_")
print(f"Output directory: {output_dir}")
# Generate 5 structures for NaCl
if IN_COLAB:
cmd = f"python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='{output_dir}' --device=cpu"
else:
cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='{output_dir}' --device=cpu"
print(f"Running: {cmd}")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
print("Errors:", result.stderr)
Output directory: /tmp/chemeleon_csp_la59hm5h
Running: cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='/tmp/chemeleon_csp_la59hm5h' --device=cpu
Errors: /bin/sh: 1: cd: can't cd to /home/ryan/informatics/chemeleon-dng
# Let's examine the generated structures
from pathlib import Path
import glob
# Find all generated CIF files
cif_files = glob.glob(os.path.join(output_dir, "*.cif"))
print(f"Found {len(cif_files)} generated structures:")
# Load and analyze the structures
generated_structures = []
for cif_file in cif_files:
structure = Structure.from_file(cif_file)
generated_structures.append(structure)
print(f"\n{Path(cif_file).name}:")
print(f" Formula: {structure.composition.reduced_formula}")
print(f" Space group: {structure.get_space_group_info()[0]}")
print(f" Volume: {structure.volume:.2f} Ų")
print(f" Density: {structure.density:.2f} g/cmÂł")
Found 0 generated structures:
Multiple Formulas#
CHEMELEON can generate structures for multiple formulas in a single run. Let’s try some more complex examples:
# Generate structures for multiple formulas relevant to energy materials
formulas_list = ['LiCoO2', 'LiFePO4', 'Li2MnO3']
# Create output directory
output_dir_multi = tempfile.mkdtemp(prefix="chemeleon_multi_")
# Generate 3 structures for each formula
formulas_str = ','.join(formulas_list)
if IN_COLAB:
cmd = f"python scripts/sample.py --task=csp --formulas='{formulas_str}' --num_samples=3 --output_dir='{output_dir_multi}' --device=cpu"
else:
cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='{formulas_str}' --num_samples=3 --output_dir='{output_dir_multi}' --device=cpu"
print(f"Generating structures for: {formulas_list}")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Generating structures for: ['LiCoO2', 'LiFePO4', 'Li2MnO3']
# Analyze the generated battery material structures
battery_structures = {}
cif_files_multi = glob.glob(os.path.join(output_dir_multi, "*.cif"))
for cif_file in cif_files_multi:
structure = Structure.from_file(cif_file)
formula = structure.composition.reduced_formula
if formula not in battery_structures:
battery_structures[formula] = []
battery_structures[formula].append(structure)
# Compare structures for each formula
matcher = StructureMatcher()
for formula, structures in battery_structures.items():
print(f"\n{formula}: Generated {len(structures)} structures")
# Check for unique structures
unique_structures = []
for s in structures:
is_unique = True
for u in unique_structures:
if matcher.fit(s, u):
is_unique = False
break
if is_unique:
unique_structures.append(s)
print(f" Unique structures: {len(unique_structures)}")
for i, s in enumerate(unique_structures):
print(f" Structure {i+1}: SG {s.get_space_group_info()[0]}, V={s.volume:.1f} Ĺł")
Part 2: De Novo Generation (DNG)#
DNG mode generates completely novel crystal structures without specifying the composition. This is useful for exploring chemical space and discovering unexpected materials.
# Generate 20 random crystal structures
output_dir_dng = tempfile.mkdtemp(prefix="chemeleon_dng_")
if IN_COLAB:
cmd = f"python scripts/sample.py --task=dng --num_samples=20 --batch_size=10 --output_dir='{output_dir_dng}' --device=cpu"
else:
cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=dng --num_samples=20 --batch_size=10 --output_dir='{output_dir_dng}' --device=cpu"
print("Generating novel crystal structures...")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Generating novel crystal structures...
# Analyze the diversity of generated structures
dng_cif_files = glob.glob(os.path.join(output_dir_dng, "*.cif"))
print(f"Generated {len(dng_cif_files)} novel structures")
# Collect composition and structural information
compositions = []
space_groups = []
volumes = []
elements_count = {}
for cif_file in dng_cif_files:
structure = Structure.from_file(cif_file)
comp = structure.composition
compositions.append(comp.reduced_formula)
space_groups.append(structure.get_space_group_info()[0])
volumes.append(structure.volume / len(structure))
# Count elements
for element in comp.elements:
elem_str = str(element)
elements_count[elem_str] = elements_count.get(elem_str, 0) + 1
# Display statistics
print(f"\nUnique compositions: {len(set(compositions))}")
print(f"Unique space groups: {len(set(space_groups))}")
print(f"\nMost common elements:")
sorted_elements = sorted(elements_count.items(), key=lambda x: x[1], reverse=True)[:10]
for elem, count in sorted_elements:
print(f" {elem}: {count} structures")
Generated 0 novel structures
Unique compositions: 0
Unique space groups: 0
Most common elements:
# Visualize the distribution of generated structures
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Volume distribution
axes[0].hist(volumes, bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Volume per atom (Ĺł)')
axes[0].set_ylabel('Count')
axes[0].set_title('Volume Distribution of Generated Structures')
# Number of elements per structure
n_elements = [len(Composition(comp).elements) for comp in compositions]
unique_counts = list(set(n_elements))
count_freq = [n_elements.count(i) for i in unique_counts]
axes[1].bar(unique_counts, count_freq, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of unique elements')
axes[1].set_ylabel('Count')
axes[1].set_title('Chemical Complexity Distribution')
axes[1].set_xticks(unique_counts)
plt.tight_layout()
plt.show()

Part 4: Practical Exercise - Exploring Polymorphs#
One of CHEMELEON’s strengths is finding polymorphs (different crystal structures with the same composition). Let’s explore this for TiO2.
# Generate multiple TiO2 structures to find polymorphs
output_dir_tio2 = tempfile.mkdtemp(prefix="chemeleon_tio2_")
if IN_COLAB:
cmd = f"python scripts/sample.py --task=csp --formulas='TiO2' --num_samples=10 --output_dir='{output_dir_tio2}' --device=cpu"
else:
cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='TiO2' --num_samples=10 --output_dir='{output_dir_tio2}' --device=cpu"
print("Searching for TiO2 polymorphs...")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Searching for TiO2 polymorphs...
# Analyze TiO2 polymorphs
tio2_files = glob.glob(os.path.join(output_dir_tio2, "*.cif"))
tio2_structures = []
for cif_file in tio2_files:
structure = Structure.from_file(cif_file)
tio2_structures.append(structure)
# Group by space group
polymorph_groups = {}
for s in tio2_structures:
sg = s.get_space_group_info()[0]
if sg not in polymorph_groups:
polymorph_groups[sg] = []
polymorph_groups[sg].append(s)
print(f"Found {len(polymorph_groups)} different space groups:")
for sg, structs in polymorph_groups.items():
avg_density = np.mean([s.density for s in structs])
print(f"\n{sg}: {len(structs)} structures")
print(f" Average density: {avg_density:.2f} g/cmÂł")
# Check if this matches known polymorphs
if "P4_2/mnm" in sg or "P42/mnm" in sg:
print(" → Likely rutile-type")
elif "I4_1/amd" in sg or "I41/amd" in sg:
print(" → Likely anatase-type")
elif "Pbca" in sg:
print(" → Likely brookite-type")
Found 0 different space groups:
Summary and Next Steps#
In this notebook, we’ve explored:
Crystal Structure Prediction (CSP): Generating structures for specific formulas
De Novo Generation (DNG): Creating entirely new materials
Polymorph Discovery: Finding different structures with the same composition
Comparison with traditional methods: Understanding the advantages of generative approaches
đź’ˇKey Takeaways:#
Chemeleon can generate diverse, chemically reasonable structures
The model captures both common and rare structural motifs
Generated structures should be validated with DFT or other methods
The approach complements traditional structure prediction methods
🚶‍♂️Next Steps:#
Validate structures: You can then use DFT or MLFFs to calculate energies and check stability
Property prediction: Screen generated structures for desired properties
Targeted generation: Focus on specific chemical systems of interest
Combine approaches: Use Chemeleon with SMACT for comprehensive exploration
⏰ Exercise:#
Try generating structures for a material system you’re interested in. Consider:
What compositions might have interesting properties?
How many polymorphs can you find?
Which generated structures pass chemical feasibility tests?