Jump to content

Linux for AI: Automating Your Workflow with Bash Scripting

From JOHNWICK

This article highlights how Linux Bash scripting enables powerful automation for data scientists, allowing you to clean datasets, organize files, and run entire experiment pipelines with just a few commands. Bash transforms repetitive tasks into fast, reliable, and reproducible workflows, making it an essential tool for efficient AI and data analysis.

As AI and data science projects grow, one pattern becomes obvious: most of your work is repetitive. You clean datasets using similar steps, run the same training scripts with small variations, move files across directories, and rerun the same preprocessing pipelines over and over. Instead of manually repeating these tasks, Bash scripting allows you to automate your workflow, improving speed, consistency, and reproducibility. For machine learning and data pipelines, this is essential. This article explains foundational Bash concepts and provides expanded, detailed examples tailored to AI and data-science workflows.


What Bash Scripting Is and Why It Matters

Bash is a command-line shell on Linux (and macOS/WSL) that lets you talk to the operating system using text commands like ls, cd, mkdir, or python train.py. Bash scripting is just taking those same commands and saving them in a .sh file, so they run automatically in order instead of you typing them one by one. You can add variables, loops, and conditions to that file, which turns simple terminal commands into a repeatable, automated workflow for tasks like cleaning data, preparing datasets, or running multiple training experiments.

A Bash script is simply a plain-text file containing commands you usually type in the terminal. Instead of typing them manually every day, a script runs them for you automatically.

Automation is valuable in AI development because:

  • Experiments become reproducible
  • Human errors decrease
  • Processing becomes faster
  • Batch experiments run without supervision

Here is the simplest Bash script:

#!/bin/bash
# Prints a message to the terminal
echo "Hello, AI workflow!"

This simple example captures the purpose of Bash: do exactly what you tell it, the same way, every time.

The Shebang: Making Sure the Script Uses Bash The first line of a Bash script should be a shebang:

#!/bin/bash
# This tells the operating system to use the Bash interpreter.
# Without this, your script may run with a different shell (sh, dash, zsh).
echo "This script is running with Bash."

Make the script executable:

chmod +x test.sh ./test.sh

Using Variables to Store Information Variables help make scripts flexible and easier to modify.

#!/bin/bash
# Store a dataset name in a variable
dataset="students.csv"
# Print using variable expansion
echo "Cleaning dataset: $dataset"
# Store directory paths
raw_dir="data/raw"
clean_dir="data/clean"
# Create directory if missing
mkdir -p "$clean_dir"
echo "Raw data directory: $raw_dir"
echo "Cleaned data will be saved to: $clean_dir"
# Create an output file using a variable path
output_file="$clean_dir/cleaned_$dataset"
echo "Output will be written to: $output_file"

Loops: Automating Repeated Tasks Loops allow you to process many files or parameters consistently.

For Loop Example: Clean All CSV Files

A for loop is a way to repeat a command or set of commands for each item in a list. In Bash, you can think of it as: “Take each file (or value) one by one, and do something with it.”

Example: the loop goes through every .csv file in the folder, and for each one, it runs the commands between do and done.

for file in *.csv; do
    echo "Processing $file"
done
#!/bin/bash
echo "Starting CSV cleaning..."
mkdir -p clean
# Loop through all CSV files in the current directory
for file in *.csv; do
    echo "Cleaning file: $file"
    # Print file size before cleaning
    du -h "$file"
    # Count number of lines
    wc -l "$file"
    # The actual processing (placeholder for your workflow)
    cp "$file" "clean/$file"
    echo "Finished cleaning $file"
    echo "--------------------------"
done
echo "All files processed."

Additional commands illustrate loop behavior:

  • du -h to display file size
  • wc -l to count rows
  • cp as a placeholder for cleaning logic

While Loop Example: Run Steps Until a Condition A while loop runs a set of commands as long as a condition remains true. You can think of it as: “Keep doing this until the condition is no longer met.”

Example: This loop keeps running while count is less than or equal to 5, increasing the counter each time. It stops automatically when the condition is no longer true.

count=1
while [ $count -le 5 ]; do
    echo "Step $count"
    count=$((count + 1))
done
#!/bin/bash
count=1
# Run 5 steps
while [ $count -le 5 ]; do
    echo "Processing step $count"
    # Simulate a task
    sleep 1
    # Show current timestamp
    date
    # Increase counter
    count=$((count + 1))
done
echo "Loop completed."

Using Conditions to Make Scripts Intelligent Conditions allow scripts to adapt to the environment. Check if a File Exists

To check if a file exists in Bash, you use an if statement with the -f flag, which tests whether a file is present and is a regular file.

  • -f "$file" Checks whether a file with that name exists.
  • If it exists, the script runs the commands inside the then section.
  • If it does not, it runs the else section.
file="data.csv"

if [ -f "$file" ]; then
    echo "$file exists."
else
    echo "$file does not exist."
fi

This is one of the most common Bash checks used in data workflows to make sure input files are available before processing.

#!/bin/bash
file="data.csv"
if [ -f "$file" ]; then
    echo "$file exists."
    # Show number of rows
    wc -l "$file"
    # Show first 3 lines
    head -n 3 "$file"
else
    echo "$file not found. Creating a placeholder file..."
    # Create a file with a header
    echo "name,score" > "$file"
fi

Ensure a Directory Exists

In Bash, you check whether a directory exists and create it only if it’s missing. This prevents errors and keeps your script safe and repeatable.

  • -d "$dir" Checks if the directory exists.
  • ! -d means “the directory does not exist.”
  • mkdir "$dir" creates it only when needed.
  • The script prints a clear message either way.
dir="clean"

if [ ! -d "$dir" ]; then
    echo "Directory '$dir' does not exist. Creating it now..."
    mkdir "$dir"
else
    echo "Directory '$dir' already exists."
fi
#!/bin/bash
dir="clean"
if [ ! -d "$dir" ]; then
    echo "Directory '$dir' missing. Creating it now."
    mkdir "$dir"
    # Show permissions
    ls -ld "$dir"
else
    echo "Directory '$dir' already exists."
fi

Automated Dataset Cleaning Example An Automated Dataset Cleaning script lets you clean many CSV files at once using a consistent set of rules. Instead of manually opening each file to remove empty lines, fix spacing, or correct formatting issues, the script applies the same cleaning steps to every file automatically. This ensures that all datasets become uniform, tidy, and ready for analysis with a single command. It saves time, avoids human mistakes, and keeps your data preparation process fully reproducible.

#!/bin/bash
mkdir -p clean
for file in *.csv; do
    echo "Cleaning $file..."
    # Remove empty lines
    # Remove carriage returns from Windows-formatted files
    # Normalize multiple spaces into one
    # Remove space before/after commas
    # Display before/after row count
    before_rows=$(wc -l < "$file")
    sed '/^$/d' "$file" | tr -d '\r' | tr -s ' ' | \
    sed 's/ ,/,/g' | sed 's/, /,/g' > "clean/$file"
    after_rows=$(wc -l < "clean/$file")
    echo "Rows before: $before_rows"
    echo "Rows after:  $after_rows"
done
echo "Cleaning completed for all files."

Automating Multiple Training Runs Automatically training multiple models means running several machine-learning experiments with a Bash script, without manually changing parameters each time. Instead of running your training script repeatedly with different model names, a loop iterates over a list of models and executes the training command for each one. This approach saves time, ensures consistent experimental settings, and makes it easier to compare model performance because every run follows the same structure.

#!/bin/bash
models=("lstm" "cnn" "transformer")
for model in "${models[@]}"; do
    echo "Running experiment for model: $model"
    # Log file for tracking results
    log_file="logs/train_${model}.log"
    # Ensure logs exist
    mkdir -p logs
    # Training command
    python train.py \
        --epochs 10 \
        --lr 0.001 \
        --model "$model" \
        --batch-size 32 \
        | tee "$log_file"
    echo "Training for $model completed."
    echo "Logs saved to $log_file"
done

Added features:

  • Logging
  • More arguments
  • Organized output folder

Sweep Over Learning Rates

#!/bin/bash
learning_rates=(0.1 0.01 0.001)
for lr in "${learning_rates[@]}"; do
    echo "Testing learning rate: $lr"
    python train.py \
        --epochs 20 \
        --lr "$lr" \
        --dropout 0.3 \
        --optimizer "adam"
    echo "Run completed for lr=$lr"
done


Creating Global Tools

#!/bin/bash
echo "Cleaning project workspace..."
# Remove Python cache
rm -rf __pycache__
# Remove model checkpoints
rm -rf checkpoints/
# Remove temporary files
rm -f *.tmp *.log *.bak
# Remove empty directories
find . -type d -empty -delete
echo "Cleanup completed."

Local vs Global vs System-wide For a new learner, the word “global” can be confusing. There are three common levels:

  • Local to a folder
  • Script only runs with ./script.sh
  • Only works when you are in that directory

2. Global for your user (what we usually mean in tutorials)

  • Script is in ~/bin or ~/.local/bin
  • Only your user account can run it
  • No admin rights needed

3. System-wide global

  • Script is in /usr/local/bin or /usr/bin
  • All users on the system can run it
  • Usually needs sudo to install or modify

For learning and for your own tools, it is safer and simpler to use user-level global (like ~/bin), not system-wide.

Why is installing globally useful?

  • You avoid repeating full paths like python scripts/etl/cleanup.sh.
  • Your custom tools feel like “real” commands.
  • You keep your scripts organized (for example, all in ~/bin).
  • It is very convenient when working on many projects.

How Bash finds commands: the PATH

When you type a command in the terminal, like:

cleanup

Bash does not magically know where cleanup is. It searches through a list of directories called. $PATH.

You can see your PATH with:

echo $PATH

It will show something like:

/home/yourname/.local/bin:/usr/local/bin:/usr/bin:/bin:...

Bash looks in these directories, in order, to find a program or script with that name.

So to “install globally” for your user, you usually:

  • Put your script (for example cleanup.sh) in one of the directories in $PATH
  • Add a directory (like ~/bin) to your $PATH, and then put your script there.

After that, you can just type:

cleanup.sh

Or if you name it simply cleanup:

cleanup

From any folder.

#create cleanup file
echo "#!/bin/bash" >cleanup.sh
echo "Cleaning project..." >> cleanup.sh
# Create a personal bin directory if it doesn't exist
mkdir -p ~/bin

# Move your script into the bin directory so it becomes a "global" command for your user
mv cleanup.sh ~/bin/

# Make the script executable so the system can run it like a normal command
chmod +x ~/bin/cleanup.sh

# Add ~/bin to your PATH so Bash knows where to find your script
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc

# Reload your Bash configuration so the PATH update takes effect immediately
source ~/.bashrc

Run:

cleanup.sh

What Is .bashrc, and Why Do We Add the PATH There? When you open a terminal on Linux or WSL, Bash (the command-line program) needs to know how to set up your working environment. The file .bashrc is simply a startup file that Bash reads every time you open a new terminal window. You can think of .bashrc as a list of instructions that say:

  • “When the terminal starts, set things up this way.”
  • “Use these shortcuts.”
  • “Look for programs in these folders.”

One of the most essential things we set in .bashrc is the PATH. The PATH is a list of folders that the system searches when you type a command. For example, when you type python or git.The system checks the folders in your PATH to find those programs. When you create your own script, such as a cleanup.sh tool, and save it in a folder~/bin, the system will not automatically know where it is. To make it easy to run this script from any location, we add this line to .bashrc:

export PATH="$HOME/bin:$PATH"

This line tells the terminal, “Also look in my ~/bin folder when I type a command.” Because this line is stored in .bashrc, it is applied automatically every time you open a terminal. That means your custom scripts become available like any other command, without extra steps.

In short, .bashrc is where you save your terminal’s “startup settings,” and adding your script folder to the PATH there makes your own tools easy to use, every time you open the terminal.

Minimal .bashrc for Data Scientists (Basic & Practical)

# ~/.bashrc — Minimal Data Science Version

# Load system-wide bash settings if available
if [ -f /etc/bashrc ]; then
    . /etc/bashrc
fi

# Simple prompt showing user, machine, and current directory
export PS1="\u@\h:\w$ "

# Add local bin folders (for custom scripts and pip installs)
export PATH="$HOME/bin:$HOME/.local/bin:$PATH"

# ------------------------------
# Useful aliases for daily work
# ------------------------------
alias ll="ls -lah"
alias py="python3"
alias jn="jupyter notebook"
alias jl="jupyter lab"
alias act="source venv/bin/activate"  # activate virtual environment
alias ..="cd .."
alias gst="git status"

# Make delete operations safer
alias rm="rm -i"

# ------------------------------
# Basic Python cleanup function
# ------------------------------
clean_python() {
    find . -name "__pycache__" -type d -exec rm -rf {} +
    find . -name "*.pyc" -delete
    echo "Python cache cleaned."
}

# Startup message
echo "Data Science Bash environment loaded."


Example 1: Small Dataset Cleaning Pipeline

Input files:students_a.csv;students_b.csv;students_c.csv

Cleaning Script

This example processes every CSV file in the folder and applies a full cleaning pipeline to each. It removes empty lines, fixes unwanted Windows characters, normalizes spacing, corrects extra spaces around commas, converts the header to lowercase, and sorts the rows alphabetically. After processing, it saves each cleaned file in a separate clean/ directory and prints the number of rows in each cleaned file. The result is a set of consistent, tidy, analysis-ready CSV files generated automatically by a single script.

# Tell Linux to use the Bash interpreter to run this script.
# Every Bash script should start with this line so the system knows how to execute it.
#!/bin/bash

# Create a directory named "clean". 
# -p → prevents errors if the directory already exists.
# This is where all cleaned CSV files will be stored.
mkdir -p clean

# Start a loop that processes every file ending with .csv in the current folder.
# The variable $file will hold the name of each CSV file during each iteration.
for file in *.csv; do

    # Print a message to show which file is being cleaned.
    echo "Cleaning $file..."

    # Begin a text-processing pipeline that transforms the file step by step.
    # The data flows through each command using pipes (|), each modifying the text.

    # STEP 1: sed '/^$/d' "$file"
    # - sed is a stream editor for text.
    # - /^$/ → matches empty lines.
    # - d → delete.
    # Result: All blank lines are removed.
    sed '/^$/d' "$file" |

    # STEP 2: tr -d '\r'
    # - tr (translate) modifies characters.
    # - -d deletes characters.
    # - '\r' is a carriage return used in Windows files.
    # Result: Removes Windows CR characters so the file behaves like Linux format.
    tr -d '\r' |

    # STEP 3: tr -s ' '
    # - -s squeezes repeated characters into one.
    # - ' ' means collapse multiple spaces into a single space.
    # Result: "Alice    A  Math" → "Alice A Math"
    tr -s ' ' |

    # STEP 4: sed 's/ ,/,/g'
    # Fix spacing BEFORE a comma.
    # Example: "Alice , A" → "Alice,A"
    sed 's/ ,/,/g' |

    # STEP 5: sed 's/, /,/g'
    # Fix spacing AFTER a comma.
    # Example: "Alice, A" → "Alice,A"
    sed 's/, /,/g' |

    # STEP 6: awk 'NR==1 {print tolower($0)} NR>1 {print}'
    # - awk is used for structured text processing.
    # - NR==1 → applies only to the first line (header).
    # - tolower($0) converts the whole header row to lowercase.
    # - NR>1 prints all other lines unchanged.
    # Result: "Name,Grade,Major" → "name,grade,major"
    awk 'NR==1 {print tolower($0)} NR>1 {print}' |

    # STEP 7: sort
    # - Sort rows alphabetically (default is lexicographical sorting).
    # - Does not affect header because header is row 1 and processed first.
    # Result: Ensures consistent row ordering across all files.
    sort > "clean/$file"
    # Output is redirected (>) into clean/<filename>

    # Count and display how many rows the cleaned file contains.
    wc -l "clean/$file"

# End of loop: this closes the "for file in *.csv" block.
done

# Final message after all files are processed.
echo "All files cleaned."

Quick Analysis with AWK

This example reads all the cleaned CSV files, combines them into one data stream, and uses AWK to calculate two summaries:

  • How many students received each grade?
  • How many students belong to each major?

It skips the header rows, counts grades and majors across all files, and then prints a clean summary showing the distribution of grades and majors. The output gives you an immediate overview of the dataset after cleaning.

# Tell the system to run this script using the Bash interpreter
#!/bin/bash

# Print a header message before showing any analysis results
echo "Grade summary across all datasets:"

# Combine all cleaned CSV files into one stream of text.
# cat clean/*.csv → reads every CSV file in the 'clean' directory.
# The output is then piped (|) to the awk command for processing.
cat clean/*.csv | \

# Use AWK for text analysis. 
# -F',' → sets the field delimiter to a comma, so $1=name, $2=grade, $3=major
awk -F',' '

    # NR>1 → skip all headers (which are always line 1 of each CSV file)
    NR>1 {
        # Extract the grade from column 2
        grade = $2
        
        # Increase the count for this grade
        count[grade]++
        
        # Extract the major from column 3 and count majors as well
        major_count[$3]++
    }

    # When all lines are processed, print the summary
    END {
        # First section: print grade distribution
        print "Grade distribution:"
        
        # Loop over all grade keys and print "Grade : Count"
        for (g in count)
            printf "%s : %d\n", g, count[g]

        # Second section: print major distribution
        print "\nMajor distribution:"
        
        # Loop over all majors found and print "Major : Count"
        for (m in major_count)
            printf "%s : %d\n", m, major_count[m]
    }

' > prepared/train/train_labeled.csv

# Redirect (>) sends all printed results to a new file:
# prepared/train/train_labeled.csv
# This stores the analysis output instead of printing it on the screen.


Example 2: Automated Data Preparation

This example takes several small CSV files containing student scores, merges them into a single dataset, shuffles the rows, and then splits the data into training and test sets (70% training, 30% test). It also creates organized folders, writes summary information (how many rows go into each split), and saves all output files into a clean, structured directory.

# Tell the system to run this script using the Bash interpreter
#!/bin/bash

# Print a message so the user knows the process has started
echo "Starting dataset preparation..."

# Create the directory structure for the prepared dataset.
# -p means "create parent folders if they do not exist"
mkdir -p prepared/train prepared/test

# Create the output file that will hold all merged data.
# Write the header manually to avoid header duplication later.
echo "name,score" > prepared/all_scores.csv

# Loop through all CSV files that match scores_part*.csv
# Each file has the same format: name,score
for file in scores_part*.csv; do
    
    # Print which file is currently being merged
    echo "Merging $file"

    # tail -n +2 skips the first line (header)
    # >> appends the remaining lines to the combined file
    tail -n +2 "$file" >> prepared/all_scores.csv
done

# Shuffle the merged dataset for randomness.
# --random-source=<(yes 42) sets a reproducible seed using "42"
shuf --random-source=<(yes 42) prepared/all_scores.csv > prepared/shuffled_scores.csv

# Count the total number of lines in the shuffled file
# wc -l prints only the number of lines when used with < redirection
lines=$(wc -l < prepared/shuffled_scores.csv)

# Calculate how many lines should go into the training set.
# (lines - 1) removes the header before calculating percentage
# 70% of the data goes into training
train_lines=$(( (lines - 1) * 7 / 10 ))

# Split the shuffled data into train and test sets.
# head: take header + N training rows
head -n $((train_lines + 1)) prepared/shuffled_scores.csv > prepared/train/train.csv

# tail: remaining rows go to test set
# +2 ensures we skip header and continue from the test portion
tail -n +$((train_lines + 2)) prepared/shuffled_scores.csv > prepared/test/test.csv

# Print summary information for the user
echo "Total records: $((lines - 1))"
echo "Training size: $train_lines"
echo "Test size: $((lines - 1 - train_lines))"

Feature Engineering with AWK

This script adds a new column that automatically labels students as High or Low performers based on their score, and saves the enriched dataset to a new CSV file.

# Use awk to process the CSV file. 
# -F',' sets the field separator to a comma so $1=name, $2=score.
awk -F',' '

    # NR==1 → this block runs only for the first line (the header row)
    NR==1 {
        # Add a new column name "label" to the header
        print $0",label"
        
        # Skip to the next line so the rest of the script does not process the header as data
        next
    }

    # This block runs for every data row (NR > 1)
    {
        # Create a new variable "label"
        # If score (column $2) is >= 85, label = "High", else label = "Low"
        label = ($2 >= 85 ? "High" : "Low")

        # Print the original fields plus the new label column
        # $1 = name, $2 = score
        print $1","$2","label
    }

# End of awk script. Now specify which file to read and where to write the output.
' prepared/train/train.csv > prepared/train/train_labeled.csv

What This Script Does.

  • Reads the file prepared/train/train.csv.
  • Keeps the header, but adds a new column called label.
  • For every row:
  • If the score is 85 or higher, it is assigned as High.
  • Otherwise, it assigns Low.
  • Writes the enriched dataset to: prepared/train/train_labeled.csv

In summary, Bash Scripting is a powerful tool for automating AI and data-science workflows. With Bash, repetitive tasks become single commands, experiments become consistent, and your workflow becomes scalable.

Read the full article here: https://medium.com/@ryassminh/linux-for-ai-automating-your-workflow-with-bash-scripting-6d44020d42c7