There’s a lot said on this topic, mostly against. However, I find it to be an unfortunately overlooked function that is useful if you have the right use case. Among advantages are that a number of things are thread safe when they are not so for multiprocessing, such as logging.
Threading got easier with the multiprocessing.dummy. It uses the multiprocessing syntax wrapped around Python threads. This syntax is a lot simpler.
A couple arguments in favor of threads is reduced memory usage as they are sharing a process space instead of identical copies of the processes, and this means shared access to objects in the same process, such as variables. One must take care to watch for actions that are not thread-safe. Builtin data types overall are but there are some s
One example I did this year was a port check across multiple servers. It’s not a real-world one as the correct answer is “get a monitoring system,” but it was an interview code challenge exercise and threading was an easy way to hit the “execute quickly for 100, 1000, or 10000 hosts” target. In that case, though, my function didn’t return anything. Instead I used a global dict with the separate hosts as keys and the threaded function directly updated the values (be very careful of your input if you do that! I had de-duplicated mine.)
Use case: Read the inventory API fast!
A prior job had a new in-house service inventory system and ran Zabbix for monitoring. Updating Zabbix had long been a completely manual process. I had recently been moved from an OPS Automation team to the Zabbix team, but also was planning to leave in the coming year, so focused on what we could accomplish in automation.
The goal was to be able to run the job at least once per hour. This meant it needed to be able to execute in at most two minutes, preferably a lot less. When done serially the inventory retrieval alone took three minutes. Understand here that monitoring, while vital, wasn’t near the front of the queue for resources. We were running on some of the oldest hardware still in that server room!
The inventory retrieval process was to connect to an API endpoint that returned the IDs for every customer, than for each of those connect to an endpoint that would return the information about that customer. The latter was where we could take advantage of parallel execution and pool.map() was a good solution.
Map has two required arguments: a function it can call and a list of the data to pass to that function.
In a fancy mixture of Python and pseudocode, here’s what the final result was:
from multiprocessing.dummy import Pool
threads = 20
def retrieve_customer_list():
[ retrieve customer ID list JSON and clean into a list ]
return custid_list # will be empty if we got nothing
def retrieve_each_customer(custid):
[ here goes code to retrieve the JSON and process the result]
[ create a dict with key of custid and result as value]
return this_custid_dict
customer_list = retrieve_customer_list
mypool = pool(processes=threads) # creates a pool of X threads
all_custid_results_list = poolmap(retrieve_each_customer, \
customer_list)
pool.close() # won't be reusing this pool
pool.join() # block until pool completion
The result is a list of dictionaries which is easily consolidated into a single dict by:
for this_dict in all_custid_results:
all_custid_results_dict.update(this_dict)
Notes: 20 is my standard for threads count. I’ve tried a variety on various configurations, from 8 cores and 32GB of RAM to 1 core and 2GB, and found that the best runtime was at 19-20.
Use case: port monitor.
No, this isn’t something that should ever be in production. The right answer is “use a proper monitoring service.” However, it was an interview coding challenge.
The thing here is to be able to check thousands of ports rapidly and consolidate the results. For this I took a shortcut. My data structure was a dictionary where the keys were the hostnames and the values were a list that would store the last 5 check results (was considered to be “down” with 5 consecutive failures.)
from multiprocessing.dummy import Pool import time threads = 20 repeat_seconds = 300 def parse_input(hostlist): [ takes the list of hosts provided on the console, de-duplicates them, and builds the dictionary ] return this_dictionary, deduped_hosts def do_checks(host): [ check the port append the result to the list for this host purge old results if the list contains more than 5 ] return # nothing to return as it was updated in a global dictionary def report(): [ checks the results dictionary structure each dictionary value looking for failures to report as either flapping or down ] return # Nothing as it's console output results_dict, deduped_hosts = parse_input(cli_hosts) mypool = pool(processes=threads) # creates a pool of X threads while true: poolmap(do_checks, deduped_hosts) report() time.sleep(repeat_seconds)
I did testing on my desktop, an ASUS VivoMini VC66, using hosts file entries and nginx to mock the hosts and ports. I got it to be able to do 10,000 checks with iterations of 15 seconds. It got me to the final round of interviews.