程式扎記

Preface
Learn how to publish your own python packages

In this article, let us understand how to develop a python package and then publish it to PyPi for distribution. This is a technical article with some advanced concepts but I will take some time to introduce each of the concepts in detail and provide a walkthrough performing all the steps as we proceed. It is very often for programmers in python to use various packages like Pandas, NumPy, etc. within your python application to make it more robust and leverage the rich functionalities of the SDKs. In order to use any such packages within your code, you first need to get these packages installed on your machine and then import it into your code.

The idea might seem familiar when we talk about PIP — the famous package management tool in python. In order to install any packages in python, we use pip install <package_name> and the package gets installed on our machine. However, these packages are maintained in a central repository known as PyPi or Python Package Index. PyPi is the official third-party software repository for python packages. Whenever you run the pip install command, the pip tool searches the package in this repository and then downloads and installs it to your machine or virtual environment.

What is a Python Package
The very first thing when it comes to developing packages in python, is that you should know what a package and module in python is. Any code that you write in a .py file is known as a module in python. Modules can be imported into another modules. A collection of more than one modules that target any specific actions can be grouped together to form a package. A package can also contain code organized into directories and subdirectories. You can read more about python modules and packages from the official documentation.

The steps to publish a python package are quite simple as follows.

1. Write your python module and save it under a directory.
2. Create the setup.py file with the necessary information.
3. Choose a LICENSE and provide a README file for your project.
4. Generate the distribution archives on the local machine.
5. Try installing the package on your local machine.
6. Publish the package to the TestPyPi repository to check if all works well.
7. Finally, publish the package to the PyPi repository.

Let us now perform each of the above-mentioned steps one by one to publish our package to the PyPi repository.

Create the python package and the directory structure and other files.
You should first decide the name of your package and then create the directory name with the name of the package. Let us assume we are going to publish a package with the name “ml_data_gen”, so the directory should be of the same name. Place a data_gen.py file under it.

Additionally, you should also include a setup.py file, a README.md and a LICENSE file under the root directory for your project. You can read more about licensing from the GitHub link. At this time, your project structure should look something as below:

Now, let us start editing the setup.py file. You can use the following snippet to update your setup file:

view plaincopy to clipboardprint?
import setuptools  
  
with open("README.md", "r") as fh:  
    long_description = fh.read()  
  
setuptools.setup(  
    name="ml_data_gen",                     # This is the name of the package  
    version="0.0.1",                        # The initial release version  
    author="John Lee",                      # Full name of the author  
    description="ml_data_gen Package for generating testing data",  
    long_description=long_description,      # Long description read from the the readme file  
    long_description_content_type="text/markdown",  
    packages=setuptools.find_packages(),    # List of all python modules to be installed  
    classifiers=[  
        "Programming Language :: Python :: 3",  
        "License :: OSI Approved :: MIT License",  
        "Operating System :: OS Independent",  
    ],                                      # Information to filter the project on PyPi website  
    python_requires='>=3.6',                # Minimum version requirement of the package  
    py_modules=["ml_data_gen"],             # Name of the python package  
    package_dir={'':'ml_data_gen/src'},     # Directory of the source code of the package  
    install_requires=[]                     # Install other dependencies if any  
)  

Once the setup file is ready, the final step is to update the README.md file. It is just markdown file that you can use to document your package when it is deployed or also on GitHub for your project.

view plaincopy to clipboardprint?
## <font color='darkblue'>Introduction</font>  
This repo is used to generate testing data used in ML for quick evaluation.  

Generate the distribution archives on local machine.
Now that the code for the python package is almost complete, you can start building the distribution archives. Archives are compressed files that help your package to be deployed across multiple platforms and also make it platform independent. In order to generate the distribution archives, run the following command from your terminal:

# pip install --upgrade setuptools wheel

This will upgrade your setuptools library on your machine to use the latest version. After this, you need to run the following command from the root directory of your package to generate the distribution files.

# python setup.py sdist bdist_wheel

Once you run the above command, you can see that the distribution packages will be delivered under the directories — build and dist, that are newly created as below. In addition to that, you can also see that the egg file information has been updated in the project source code as well.

Install the package on local machine.
Now that we have our distribution files ready, we can go ahead and try installing and importing the package to test if it works fine. In order to install the package on your local machine, run the following command from the root directory:

# virtualenv env
# source env/bin/activate

// -e, --editable <path/url>: Install a project in editable mode (i.e. setuptools "develop mode") from a local project path or a VCS url.
# pip install -e .
Obtaining file:///root/Github/ml_data_gen
Installing collected packages: ml-data-gen
Running setup.py develop for ml-data-gen
Successfully installed ml-data-gen

Next, let's see how to use installed package:

>>> from ml_data_gen.data_gen import *
>>> hmm_data = HMMData(['a', 'b'], [[0.2, 0.8], [0.7, 0.3]])
>>> list(hmm_data.get_gen(10))
['a', 'a', 'b', 'a', 'b', 'a', 'b', 'a', 'b', 'a']

Publish the package to the TestPyPi
Once the package is installed on local and works fine, it is now ready to be shipped to the TestPyPi repository. This is a test repository for all python packages to test and see if all code works fine and there are no issues within the package code. This keeps it isolated from the official PyPi repository and makes sure that only thorough tested packages are deployed to production.

Navigate to https://test.pypi.org/ and register yourself as a user:

Once you are registered, open your terminal and run the following command. This will install a package called “twine” on your machine that will help ship the python package to the repositories:

# pip install --upgrade twine

You can read the official documentation about packaging python applications and also about twine here. After the twine package is installed, you have to configure some credentials for login PyPi/TestPyPi:

# vi ~/.pypirc
view plain copy to clipboard print ?
[distutils]
index-servers =
    pypi
    testpypi

[pypi]
repository = https://upload.pypi.org/legacy/
username = <account_name>
password = <account_credential>

[testpypi]
repository = https://test.pypi.org/legacy/
username = <account_name>
password = <account_credential>

and then run the following command to ship the code to TestPyPi first. When you run the command, you will be asked to provide the same credentials using which you have registered your account in the previous step:

# python3 -m twine upload --repository testpypi dist/*

Back to your TestPyPi Page:

As you can see in the figure above, the python package has now been shipped to the TestPyPi repository. In order to install the package from the test repository, first we will uninstall the already existing package and then run the following command to install it:

# pip uninstall ml-data-gen
# pip install -i https://test.pypi.org/simple/ ml-data-gen
# pip freeze | grep ml
ml-data-gen==0.0.1

This will install the package on the local system from the TestPyPi repository.

Publish the package to the PyPi repository
Now that everything works well with our package, it’s time that we publish it to the official PyPi repository. Follow the same steps to register an account and then run the following command to ship the package to the official repository:

# python3 -m twine upload dist/*

Congratulation! You have completed it and create your first Python package!

Preface
This article is going to teach you, How to apply python multiprocessing for your long-running functions

What is multiprocessing,
Basically, multiprocessing means run two or more tasks parallelly. So in Python, We can use python’s inbuilt multiprocessing module to achieve that. Imagine you have ten functions that take ten seconds to run and you're in a situation that you want to run that long-running function ten times. Without a doubt, It will take hundred seconds to finish if you run it sequentially. That is where multiprocessing comes into action. By using multiprocessing, you can separate those ten processes into ten sub-processes and complete them all in ten seconds.

Different between multiprocessing and multithreading,
So didn’t you wonder why we use multiprocessing instead of multithreading? It is good to use multithreading in the above example, but if your function required more processing power and more memory, It is ideal to use multiprocessing because when you use multiprocessing, each sub-process will have a dedicated CPU and Memory slot. So it is ideal to use multiprocessing instead of multithreading (multi-threading has another issue called GIL) if your long-running function required more processing power and memory:

Let’s see multiprocessing in action,
Imagine this is your long-running function:

view plaincopy to clipboardprint?
def factorize(number):  
    for i in range(1, number + 1):  
        if number % i == 0:  
            yield i  

If you want to run this function ten times without using multiprocessing or multithreading it will look something like this:

view plaincopy to clipboardprint?
from time import time  
  
numbers = [8402868, 2295738, 5938342, 7925426, 98761244, 87129945, 14789235, 66543218, 53218950, 33218765]  
start = time()  
for number in numbers:  
    list(factorize(number))  
end = time()  
print ('Took %.3f seconds' % (end - start))  

Output:

Took 17.259 seconds

Let’s see how to apply multiprocessing to this simple example. First of all, you will have to import python’s multiprocessing module,

view plaincopy to clipboardprint?
import multiprocessing  

Then you have to make an object from the Process and pass the target function and arguments if any. e.g.:

view plaincopy to clipboardprint?
def print_factorize(num, q):  
    q.put((num, list(factorize(num))))  
  
q = mp.Queue()  
process = mp.Process(target=print_factorize, args=(8402868, q, ))  

So now we can call its start method to start the execution of the function factorize:

view plaincopy to clipboardprint?
process.start()  
process.join()  
  
while not q.empty():  
    print(q.get())  

Output:

(8402868, [1, 2, 3, 4, 6, 9, 12, 18, 36, 41, 82, 123, 164, 246, 369, 492, 738, 1476, 5693, 11386, 17079, 22772, 34158, 51237, 68316, 102474, 204948, 233413, 466826, 700239, 933652, 1400478, 2100717, 2800956, 4201434, 8402868])

Then our for loop will look like this:

view plaincopy to clipboardprint?
import multiprocessing as mp  
from time import time  
  
def factorize(number):  
    for i in range(1, number + 1):  
        if number % i == 0:  
            yield i  
  
def print_factorize(num, q):  
    start = time()  
    ans = list(factorize(num))  
    end = time()  
    q.put((num, ans, end - start))  
  
start = time()  
  
numbers = [8402868, 2295738, 5938342, 7925426, 98761244, 87129945, 14789235, 66543218, 53218950, 33218765]  
plist = []  
q = mp.Queue()  
for n in numbers:  
    process = mp.Process(target=print_factorize, args=(n, q, ))  
    plist.append(process)  
    process.start()  
  
for p in plist:  
    p.join()  
  
while not q.empty():  
    num, flist, et = q.get()  
    print(f"{num} took {et} seconds!")  
  
end = time()  
print ('Total took %.3f seconds' % (end - start))  

Execution result:

2295738 took 0.3090219497680664 seconds!
5938342 took 0.5050191879272461 seconds!
7925426 took 0.776221513748169 seconds!
8402868 took 0.9802758693695068 seconds!
14789235 took 1.1664249897003174 seconds!
33218765 took 2.0890510082244873 seconds!
53218950 took 3.24381947517395 seconds!
66543218 took 3.6708388328552246 seconds!
87129945 took 4.638180255889893 seconds!
98761244 took 4.760260343551636 seconds!
Total took 4.795 seconds

If you run the calculation sequentially, you will take 0.309 + 0.505 + ... + 4.638 + 4.76 >> 4.795 seconds!

程式扎記

標籤

2021年1月31日星期日

[ Python 文章收集 ] How to publish a Python Package to PyPi

2021年1月30日星期六

[ Python 文章收集 ] Python Multiprocessing For Beginners

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2021年1月31日 星期日