Chemeleon: Generative AI for Crystal Structure Exploration#

In this notebook, we’ll explore how to use Chemeleon for generating crystal structures. We’ll cover:

  1. Crystal Structure Prediction (CSP) - generating structures for specific formulas

  2. De Novo Generation (DNG) - generating novel structures without constraints

  3. Analysing and visualizing the generated structures

  4. Comparing with traditional structure prediction methods

Setup and Installation#

# First, let's check if we're in Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab")
    # Clone the CHEMELEON repository
    !git clone https://github.com/hspark1212/chemeleon-dng.git
    %cd chemeleon-dng
    !pip install -e .
else:
    print("Running locally")
    # Ensure you have the package installed locally
    # You should have already cloned and installed chemeleon-dng
Running locally
# Import required libraries
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pymatgen.core import Structure, Composition
from pymatgen.analysis.structure_matcher import StructureMatcher
from pymatgen.io.cif import CifWriter
import warnings
warnings.filterwarnings('ignore')

# Add chemeleon-dng to path if running locally
if not IN_COLAB:
    chemeleon_path = Path('/home/ryan/informatics/chemeleon-dng')
    if chemeleon_path.exists():
        sys.path.insert(0, str(chemeleon_path))

print("Libraries imported successfully!")
Libraries imported successfully!

Part 1: Crystal Structure Prediction (CSP)#

CSP mode allows us to generate crystal structures for specific chemical formulas. This is particularly useful when you know the composition but want to explore possible crystal structures.

# Let's start with a simple example - generating structures for NaCl
# We'll use the command-line interface first to understand the process

import subprocess
import tempfile

# Create a temporary directory for outputs
output_dir = tempfile.mkdtemp(prefix="chemeleon_csp_")
print(f"Output directory: {output_dir}")

# Generate 5 structures for NaCl
if IN_COLAB:
    cmd = f"python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='{output_dir}' --device=cpu"
else:
    cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='{output_dir}' --device=cpu"

print(f"Running: {cmd}")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("Errors:", result.stderr)
Output directory: /tmp/chemeleon_csp_la59hm5h
Running: cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='NaCl' --num_samples=5 --output_dir='/tmp/chemeleon_csp_la59hm5h' --device=cpu

Errors: /bin/sh: 1: cd: can't cd to /home/ryan/informatics/chemeleon-dng
# Let's examine the generated structures
from pathlib import Path
import glob

# Find all generated CIF files
cif_files = glob.glob(os.path.join(output_dir, "*.cif"))
print(f"Found {len(cif_files)} generated structures:")

# Load and analyze the structures
generated_structures = []
for cif_file in cif_files:
    structure = Structure.from_file(cif_file)
    generated_structures.append(structure)
    print(f"\n{Path(cif_file).name}:")
    print(f"  Formula: {structure.composition.reduced_formula}")
    print(f"  Space group: {structure.get_space_group_info()[0]}")
    print(f"  Volume: {structure.volume:.2f} Ų")
    print(f"  Density: {structure.density:.2f} g/cmÂł")
Found 0 generated structures:

Multiple Formulas#

CHEMELEON can generate structures for multiple formulas in a single run. Let’s try some more complex examples:

# Generate structures for multiple formulas relevant to energy materials
formulas_list = ['LiCoO2', 'LiFePO4', 'Li2MnO3']

# Create output directory
output_dir_multi = tempfile.mkdtemp(prefix="chemeleon_multi_")

# Generate 3 structures for each formula
formulas_str = ','.join(formulas_list)
if IN_COLAB:
    cmd = f"python scripts/sample.py --task=csp --formulas='{formulas_str}' --num_samples=3 --output_dir='{output_dir_multi}' --device=cpu"
else:
    cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='{formulas_str}' --num_samples=3 --output_dir='{output_dir_multi}' --device=cpu"

print(f"Generating structures for: {formulas_list}")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Generating structures for: ['LiCoO2', 'LiFePO4', 'Li2MnO3']
# Analyze the generated battery material structures
battery_structures = {}
cif_files_multi = glob.glob(os.path.join(output_dir_multi, "*.cif"))

for cif_file in cif_files_multi:
    structure = Structure.from_file(cif_file)
    formula = structure.composition.reduced_formula
    
    if formula not in battery_structures:
        battery_structures[formula] = []
    battery_structures[formula].append(structure)

# Compare structures for each formula
matcher = StructureMatcher()

for formula, structures in battery_structures.items():
    print(f"\n{formula}: Generated {len(structures)} structures")
    
    # Check for unique structures
    unique_structures = []
    for s in structures:
        is_unique = True
        for u in unique_structures:
            if matcher.fit(s, u):
                is_unique = False
                break
        if is_unique:
            unique_structures.append(s)
    
    print(f"  Unique structures: {len(unique_structures)}")
    for i, s in enumerate(unique_structures):
        print(f"  Structure {i+1}: SG {s.get_space_group_info()[0]}, V={s.volume:.1f} Ĺł")

Part 2: De Novo Generation (DNG)#

DNG mode generates completely novel crystal structures without specifying the composition. This is useful for exploring chemical space and discovering unexpected materials.

# Generate 20 random crystal structures
output_dir_dng = tempfile.mkdtemp(prefix="chemeleon_dng_")

if IN_COLAB:
    cmd = f"python scripts/sample.py --task=dng --num_samples=20 --batch_size=10 --output_dir='{output_dir_dng}' --device=cpu"
else:
    cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=dng --num_samples=20 --batch_size=10 --output_dir='{output_dir_dng}' --device=cpu"

print("Generating novel crystal structures...")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Generating novel crystal structures...
# Analyze the diversity of generated structures
dng_cif_files = glob.glob(os.path.join(output_dir_dng, "*.cif"))
print(f"Generated {len(dng_cif_files)} novel structures")

# Collect composition and structural information
compositions = []
space_groups = []
volumes = []
elements_count = {}

for cif_file in dng_cif_files:
    structure = Structure.from_file(cif_file)
    comp = structure.composition
    
    compositions.append(comp.reduced_formula)
    space_groups.append(structure.get_space_group_info()[0])
    volumes.append(structure.volume / len(structure))
    
    # Count elements
    for element in comp.elements:
        elem_str = str(element)
        elements_count[elem_str] = elements_count.get(elem_str, 0) + 1

# Display statistics
print(f"\nUnique compositions: {len(set(compositions))}")
print(f"Unique space groups: {len(set(space_groups))}")
print(f"\nMost common elements:")
sorted_elements = sorted(elements_count.items(), key=lambda x: x[1], reverse=True)[:10]
for elem, count in sorted_elements:
    print(f"  {elem}: {count} structures")
Generated 0 novel structures

Unique compositions: 0
Unique space groups: 0

Most common elements:
# Visualize the distribution of generated structures
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Volume distribution
axes[0].hist(volumes, bins=20, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Volume per atom (Ĺł)')
axes[0].set_ylabel('Count')
axes[0].set_title('Volume Distribution of Generated Structures')

# Number of elements per structure
n_elements = [len(Composition(comp).elements) for comp in compositions]
unique_counts = list(set(n_elements))
count_freq = [n_elements.count(i) for i in unique_counts]

axes[1].bar(unique_counts, count_freq, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of unique elements')
axes[1].set_ylabel('Count')
axes[1].set_title('Chemical Complexity Distribution')
axes[1].set_xticks(unique_counts)

plt.tight_layout()
plt.show()
../../_images/e2060f4379c97243887ad9121d7ce4deccad1fa6293ab8e70e89abd2c8600bad.png

Part 4: Practical Exercise - Exploring Polymorphs#

One of CHEMELEON’s strengths is finding polymorphs (different crystal structures with the same composition). Let’s explore this for TiO2.

# Generate multiple TiO2 structures to find polymorphs
output_dir_tio2 = tempfile.mkdtemp(prefix="chemeleon_tio2_")

if IN_COLAB:
    cmd = f"python scripts/sample.py --task=csp --formulas='TiO2' --num_samples=10 --output_dir='{output_dir_tio2}' --device=cpu"
else:
    cmd = f"cd /home/ryan/informatics/chemeleon-dng && python scripts/sample.py --task=csp --formulas='TiO2' --num_samples=10 --output_dir='{output_dir_tio2}' --device=cpu"

print("Searching for TiO2 polymorphs...")
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
print(result.stdout)
Searching for TiO2 polymorphs...
# Analyze TiO2 polymorphs
tio2_files = glob.glob(os.path.join(output_dir_tio2, "*.cif"))
tio2_structures = []

for cif_file in tio2_files:
    structure = Structure.from_file(cif_file)
    tio2_structures.append(structure)

# Group by space group
polymorph_groups = {}
for s in tio2_structures:
    sg = s.get_space_group_info()[0]
    if sg not in polymorph_groups:
        polymorph_groups[sg] = []
    polymorph_groups[sg].append(s)

print(f"Found {len(polymorph_groups)} different space groups:")
for sg, structs in polymorph_groups.items():
    avg_density = np.mean([s.density for s in structs])
    print(f"\n{sg}: {len(structs)} structures")
    print(f"  Average density: {avg_density:.2f} g/cmÂł")
    
    # Check if this matches known polymorphs
    if "P4_2/mnm" in sg or "P42/mnm" in sg:
        print("  → Likely rutile-type")
    elif "I4_1/amd" in sg or "I41/amd" in sg:
        print("  → Likely anatase-type")
    elif "Pbca" in sg:
        print("  → Likely brookite-type")
Found 0 different space groups:

Summary and Next Steps#

In this notebook, we’ve explored:

  1. Crystal Structure Prediction (CSP): Generating structures for specific formulas

  2. De Novo Generation (DNG): Creating entirely new materials

  3. Polymorph Discovery: Finding different structures with the same composition

  4. Comparison with traditional methods: Understanding the advantages of generative approaches

đź’ˇKey Takeaways:#

  • Chemeleon can generate diverse, chemically reasonable structures

  • The model captures both common and rare structural motifs

  • Generated structures should be validated with DFT or other methods

  • The approach complements traditional structure prediction methods

🚶‍♂️Next Steps:#

  1. Validate structures: You can then use DFT or MLFFs to calculate energies and check stability

  2. Property prediction: Screen generated structures for desired properties

  3. Targeted generation: Focus on specific chemical systems of interest

  4. Combine approaches: Use Chemeleon with SMACT for comprehensive exploration

⏰ Exercise:#

Try generating structures for a material system you’re interested in. Consider:

  • What compositions might have interesting properties?

  • How many polymorphs can you find?

  • Which generated structures pass chemical feasibility tests?