Abstract
In this short article, we'll see how to store the plays of William Shakespeare in XML format in SingleStoreDB. We'll then connect from a client application and run some XPath expressions.
Introduction
In this article, we'll take 37 plays by William Shakespeare in an XML format and store them in SingleStoreDB after some data cleanup. Using Python, SQLAlchemy and an XML library, we'll connect to SingleStoreDB and run several example XPath expressions over the data.
Create a Deepnote account
We'll create a free account on the Deepnote website. Once logged in, we'll create a new Deepnote project to give us a new notebook.
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use XML Demo Group as our Workspace Group Name and xml-demo as our Workspace Name. We'll make a note of our password and host name.
Create a Database and Table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database and table:
CREATE DATABASE IF NOT EXISTS xml_db;
USE xml_db;
CREATE TABLE plays (
id INT PRIMARY KEY NOT NULL,
play LONGTEXT NOT NULL
);
Data Download, Formatting and Loading
Data Download
We can obtain all the plays of William Shakespeare in XML format in a single zip file courtesy of Jon Bosak of Sun Microsystems. After unpacking the zip file, there is important information on copyright and distribution in an htm
file. All the xml
files are well-formed.
Data Formatting
In each xml
file, we'll remove all hidden control characters and the meta information at the beginning of a file. The following bash
script will achieve this and write the combined output into a single file called all_plays.csv
:
#!/bin/bash
# Set the input directory
input_dir="/path/to/input_dir"
# Set the output directory
output_dir="/path/to/output_dir"
# Check if input_dir and output_dir are the same
if [ "$input_dir" -ef "$output_dir" ]; then
echo "Error: input_dir and output_dir cannot be the same directory"
exit 1
fi
# Initialise the loop counter
counter=1
# Loop over all files in the input directory
for file in "$input_dir"/*.xml
do
# Check if the file is a regular file
if [[ -f "$file" ]]; then
# Extract the file name without the extension
filename=$(basename "$file" .xml)
# Remove control and newline characters from the file using tr command
tr -dc '[:print:]' < "$file" > "$output_dir/$filename.csv.tmp"
# Remove XML declaration and DOCTYPE declaration from the file using sed command
sed -e '1s/<?xml version="1.0"?>//' -e '1s/<!DOCTYPE PLAY SYSTEM "play.dtd">//'< "$output_dir/$filename.csv.tmp" > "$output_dir/$filename.csv"
# Add the loop counter at the beginning of the output file
sed -i "1s/^/$counter|/" "$output_dir/$filename.csv"
# Remove the temporary file
rm "$output_dir/$filename.csv.tmp"
# Increment the loop counter
counter=$((counter+1))
fi
done
# Concatenate all output files into a single file
for file in "$output_dir"/*.csv
do
if [[ -f "$file" ]] && [[ "$file" != "$output_dir/all_plays.csv" ]]; then
cat "$file" >> "$output_dir/all_plays.csv"
echo >> "$output_dir/all_plays.csv"
rm "$file"
fi
done
We'll replace /path/to/
with the actual path.
Data Loading
From output_dir
, we'll launch a MySQL CLI client:
mysql -u admin -h <host> -P 3306 -p<password> --local-infile
We'll replace the <host>
and <password>
with the values from our SingleStoreDB Cloud account.
We'll load the data, as follows:
USE xml_db;
LOAD DATA LOCAL INFILE 'all_plays.csv'
INTO TABLE plays
COLUMNS TERMINATED BY '|';
This will create 37 rows in the database.
Deepnote notebook
First, we'll install the package:
!pip install sqlalchemy-singlestoredb
Deepnote will prompt us to add this to the requirements.txt
file.
Next, we'll create a connection:
import sqlalchemy as s2
eng = s2.create_engine(
"singlestoredb://admin:<password>@<host>:3306/xml_db"
)
We'll replace the <password>
and <host>
with the values from our SingleStoreDB Cloud account.
We'll now get one of the plays:
with eng.connect() as conn:
query = s2.text("SELECT play FROM plays WHERE play LIKE '%Romeo and Juliet%'")
res = conn.execute(query)
row = res.fetchone()
play = row.play
We'll use the lxml
library and get the root element of the document:
from lxml import etree
root = etree.fromstring(play)
Next, let's get the title of the play:
xpath_q1 = "/PLAY/TITLE"
play_title = root.xpath(xpath_q1)[0]
print(play_title.text)
The output should be, as follows:
The Tragedy of Romeo and Juliet
Now, let's get the personae:
xpath_q2 = "/PLAY/PERSONAE/PERSONA"
personae = root.xpath(xpath_q2)
for persona in personae:
print(persona.text)
The result should be, as follows:
ESCALUS, prince of Verona.
PARIS, a young nobleman, kinsman to the prince.
An old man, cousin to Capulet.
ROMEO, son to Montague.
MERCUTIO, kinsman to the prince, and friend to Romeo.
BENVOLIO, nephew to Montague, and friend to Romeo.
TYBALT, nephew to Lady Capulet.
BALTHASAR, servant to Romeo.
PETER, servant to Juliet's nurse.
ABRAHAM, servant to Montague.
An Apothecary.
Three Musicians.
Page to Paris; another Page; an officer.
LADY MONTAGUE, wife to Montague.
LADY CAPULET, wife to Capulet.
JULIET, daughter to Capulet.
Nurse to Juliet.
Citizens of Verona; several Men and Women, relations to both houses; Maskers, Guards, Watchmen, and Attendants.
Let's find the number of times that Juliet is the speaker:
xpath_q3 = "count(//SPEECH[SPEAKER='JULIET'])"
juliet_count = root.xpath(xpath_q3)
print(juliet_count)
The result should be, as follows:
118.0
We could also try the response to a question on Stackoverflow to find the number of times that Juliet speaks immediately after Romeo:
xpath_q4 = "count(//SPEECH[SPEAKER='JULIET' and preceding-sibling::SPEECH[1][SPEAKER='ROMEO']])"
romeo_and_juliet_count = root.xpath(xpath_q4)
print(romeo_and_juliet_count)
The result should be, as follows:
37.0
The lxml
library provides many more capabilities, such as streaming large documents, for example.
Summary
In this short article, we have seen one method of storing XML data in SingleStoreDB. We have also seen how to successfully connect to our database using SQLAlchemy and how to use XPath expressions to query our data.