Optimizing MySQL insertions at scale (Python 3.7)

Subhayan Ghosh
Analytics Vidhya
Published in
4 min readJan 30, 2021

--

Photo by insung yoon on Unsplash

MySQL transactions can be pretty expensive when we are talking about scale (say inserting 2.5M rows into a table). Though the largest possible packet that can be transmitted to or from a MySQL 8.0 server or client is 1GB, it may not be an efficient transaction. As told by my Data Structures and Algorithms professor:

As Engineers/Computer Scientists we are not only interested in finding out a possible solution to a problem (as Mathematicians do), but also in developing an efficient and optimized solution to the problem.

In this blog, we are going to learn how to optimize expensive MySQL transactions at scale in an efficient way. Before diving in, I would like you to read my previous blog on — Using AWS Lambda environment variables from local .env file.

Let’s assume we have a text file containing 2.5 million names and contact numbers of random people around the world (for simplicity we will consider only one entity i.e. “name” throughout our example) which we want to transfer into a MySQL table — random.student (random is the DB name). We can insert them into the table in three possible ways:

  • Inserting one at a time — This transaction will be too expensive in terms of time complexity but still feasible.
  • Inserting all at once — This is not feasible since building the insert query for such a heavy transaction will be too expensive.
  • Inserting in small packets — This is efficient as well as feasible in terms of both time complexity and execution.

Let’s begin by creating a python file and writing a function to connect to our remote or local MySQL instance (I will show the example with a remote AWS RDS instance). I have my DB credentials stored in a separate config file which I will fetch using the python dotenv module. You can do the same or hardcode your credentials; the first one is a good practice though.

import pymysql
from dotenv import load_dotenv
import os
# loading environment variables
laod_dotenv()
hostName = os.environ.get('RDS_HOST')
name = os.environ.get('NAME')
password = os.environ.get('PASSWORD')
port = os.environ.get('PORT')
def connectDB():
try:
conn = pymysql.connect(host=hostName, port=int(port),
user=name, passwd=password)
print("SUCCESS: Connection to RDS MySQL instance succeeded")
return conn
except Exception as e:
print("ERROR: Unexpected error: Could not connect to MySQL
instance.")
raise Exception(e)

Now we have to read the student names from the text file and insert them to the table in small packets i.e 250 rows at a time. We will first read the names from the text file and store them in a list which we will later iterate in small batches to create the insert query.

names = []with open('studentFile.txt') as f:
lines = f.readlines()
for data in lines:
name = data[:-1]
# print(name)
names.append(name)

Once we have the list of names, we will create two functions — insertNames() and buildQuery(range) . The insertNames() function calculates the total number of iterations required for inserting all the names based on our BATCH_SIZE and creates a for loop to execute the same. In the loop, it calls buildQuery(range) in each iteration to create the VALUES of the INSERT query by passing the range of names to use every time as an argument to the function.

def insertNames():
BATCH_SIZE = 250
iterationNumber = round(len(names)/BATCH_SIZE)+1
conn = connectDB()
cursor = conn.cursor()
print(f"Total iterations to make: {iterationNumber}")
insertQuery = ""
for i in range(0, iterationNumber):
try:
namesToUse = names[i*BATCH_SIZE
:(i+1)*BATCH_SIZE]
insertQuery = buildInsertQuery(namesToUse)
# run sql query
query = (
f"INSERT INTO random.student(name)
VALUES{insertQuery}"
)
cursor.execute(query)
conn.commit()
print(f"MySQL insertion {i} completed")
except Exception as e:
print(f"Exception while inserting row {i}
- ignore. Error: {e}")

buildQuery() returns the VALUES of the INSERT query which is essentially used by insertNames() to make the MySQL insertion —

def buildQuery(range):
insertQuery = ""
for name in range:
insertQuery = insertQuery + (
f"('{name}'), "
)
insertQuery = insertQuery[0:-2]
return insertQuery

We just need to call insertNames() now to carry out the insertion and we are done

To summarize:

  1. We read the data from a text file and stored it in a list.
  2. We developed a function — buildQuery() to iterate the list and build our insertion query for a certain range.
  3. We calculated the total number of iterations required for transferring the whole data in small batches and called our buildQuery() function in each iteration to make the insertion.

Additional tip: You can profile each section of the code using Python’s time module to check the efficiency of the code and also to keep a track of every process.

Thanks for reading through. I will highly appreciate your reviews and suggestions. Do let me know in the comments if you have some other optimization techniques in mind.

If you found it useful, please give me a clap and share it with your fellow devs. Don’t hardcode, code hard!

Subhayan Ghosh

LinkedIn | Twitter | Instagram

--

--

Subhayan Ghosh
Analytics Vidhya

Engineering & Product| Writes about Data and Full Stack optimizations | Building for the Web