In former article How to build a scaleable crawler to crawl million pages, I wrote something about building a scaleable crawler with Docker-compose. Just imagine a scenario that you have lots of servers around the world, do you still need to install requirements , configure system and run the script one by one? Docker and Docker-Compose couldn’t help you out, and docker swarm and k8s seems more complicated for such a project.
So which software should we choose to accomplish it quickly?
As mentioned in the last paragraph of previous post, it’s Fabric.
Fabric, a tool to automate administration tasks and deployments
Fabric is a Python (2.5–2.7) library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
It provides a basic suite of operations for executing local or remote shell commands (normally or via sudo) and uploading/downloading files, as well as auxiliary functionality such as prompting the running user for input, or aborting execution.
As mentioned above in official website, it’s a tool that can be used in a very large scale of scenarios.
Most of the code in ‘test_celery’ is the same as last one, with slight changes, I will talk about it later.
.
├── _test_celery
| ├── __init__.py
| ├── celeryapp.py
| ├── run_tasks.py
| └── tasks.py
└── fabfile.py
Structure of the project
How to code the fabric file
Note: Before coding fabric we should install it with pip in your machine.
pip install fabric
First, configure the servers
Fabric use env to store servers’ data , such as ip, port, username and password. If all the servers have the same user name, password and default ssh server port (22), the configuration above works well for you. Otherwise, you have to change it as below:
from fabric.api import *
env.hosts = [
'10.211.55.12:32772',
'10.211.55.12:32773',
'10.211.55.12:32774',
]
# Set the username
env.user = "root"
# Set the password [NOT RECOMMENDED]
env.password = "test123"
Second, write the funtion
@parallel
def install():
cmd = "celery -A test_celery worker --app=test_celery.celeryapp:app --concurrency=10 --loglevel=debug"
run("apt-get install dtach && apt-get install -y python-pip && pip install celery && pip install requests && pip install pymongo")
with settings(warn_only=True):
runResult = run("mkdir -p /root/code/test_celery")
if runResult.return_code == 1:
print 'file exists'
put("./test_celery/celeryapp.py", "/root/code/test_celery")
put("./test_celery/run_tasks.py", "/root/code/test_celery")
put("./test_celery/tasks.py", "/root/code/test_celery")
put("./test_celery/ua.txt", "/root/code/test_celery")
put("./test_celery/__init__.py", "/root/code/test_celery")
with cd("/root/code"):
result = run("ls -l")
run('dtach -n `mktemp -u /tmp/%s.XXXX` %s' % ('dtach', cmd))
return result
Something we should note here is @parallel
, without it servers will run these commands one by one , which is very time-consuming.
· The command behind “cmd” will execute when all installation and file transfer in background with dtach.
· The shell command in run method will install softwares of requirement in your remote server.
· With settings(warn_only=True ), once the command in ‘runResult = run(“mkdir -p /root/code/test_celery”)’ failed , you will catch the error with runResult.return_code == 1.
· Copy all the files into “test_celery” , with ‘cd(“/root/code”)’ enter into the directory you should run some script there.
Because of some unknown problems, some “in-shell” tolls like nohup doesn’t work well with fabric, so if you want to execute command in background , dtach and screen is a better choice. We choose dtach for this case.
Note: I haven’t found any solutions to save stdout to local file, if you have any better one please let me know, thanks.
Finally, Let's run it.
from fabric.api import *
env.hosts = [
'10.211.55.12:32772',
'10.211.55.12:32773',
'10.211.55.12:32774',
]
# Set the username
env.user = "root"
# Set the password [NOT RECOMMENDED]
env.password = "test123"
@parallel
def install():
cmd = "celery -A test_celery worker --app=test_celery.celeryapp:app --concurrency=10 --loglevel=debug"
run("apt-get install dtach && apt-get install -y python-pip && pip install celery && pip install requests && pip install pymongo")
with settings(warn_only=True):
runResult = run("mkdir -p /root/code/test_celery")
if runResult.return_code == 1:
print 'file exists'
put("./test_celery/celeryapp.py", "/root/code/test_celery")
put("./test_celery/run_tasks.py", "/root/code/test_celery")
put("./test_celery/tasks.py", "/root/code/test_celery")
put("./test_celery/ua.txt", "/root/code/test_celery")
put("./test_celery/__init__.py", "/root/code/test_celery")
with cd("/root/code"):
result = run("ls -l")
run('dtach -n `mktemp -u /tmp/%s.XXXX` %s' % ('dtach', cmd))
return result
With all the configuration are finished , the code should be like above. Then back to the directory of where fabric.py located.
fab install
When the command is executed , you will see the print of your terminal like below, all the three servers executing the script parallelly.
tony@tony-MBP:~/Desktop/Test$ fab install
[root@10.211.55.12:32772] Executing task 'install'
[root@10.211.55.12:32773] Executing task 'install'
[root@10.211.55.12:32774] Executing task 'install'
[root@10.211.55.12:32774] run: apt-get install dtach && apt-get install -y python-pip && pip install celery && pip install requests && pip install pymongo
[root@10.211.55.12:32773] run: apt-get install dtach && apt-get install -y python-pip && pip install celery && pip install requests && pip install pymongo
[root@10.211.55.12:32772] run: apt-get install dtach && apt-get install -y python-pip && pip install celery && pip install requests && pip install pymongo
If you haven’t set an decorator to the install function, the command “fab -P -z 3 install” will execute task parallelly too. (Check details of parallelly here)
By now , with command ps -ef | grep celery
in these three remote machine , you will find out there are 10 celery workers running in background.
All the workers are ready to go!
python -m test_celery.run_tasks
Let’s send some tasks to these workers. Once the command above is executed , you will find something interesting in message broker , RabbitMQ.
When you enter into RabbitMQ server and run the command below , details of all the messages will show up here. There are 9164 messages with 9004 are ready for consumer , and there are 816 messages have been consumed by workers.
root@rabbit:/# rabbitmqctl -q list_queues name messages messages_ready messages_unacknowledged
celery 9164 9004 160
celeryev.a38b8fc2-7cc9-4cea-acc1-50cf6120601d 0 0 0
celeryev.fdc57883-aa62-4623-95b5-29b713ff51f4 0 0 0
celery@33a8b4dd4e57.celery.pidbox 0 0 0
celery@ee6b8257af81.celery.pidbox 0 0 0
celeryev.0a6a638a-9f94-4767-a4e0-a708467a5363 0 0 0
celeryev.be44ef35-a82f-4461-aea4-8f82a5363d9b 0 0 0
celery@99d2501f2d9e.celery.pidbox 0 0 0
37df9fdf-2d62-3106-8157-8800695cb9d4 816 816 0
Conclusion
With Celery , RabbitMQ, and Fabric, and distributed crawler system running in the machine all over the world becomes more simple to develop and deploy. If you want it works faster, just deploy more workers with Fabric. Nothing in the code need to change, and everything is so scaleable.
In the next series tutorials , I will talk more about some advanced usage of Celery , RabbitMQ , Fabric and Docker. If you have any questions or sharing about them or distributed system, welcome to discuss it here!