一种用gzip文件作为数据库的思路

1. python 压缩和解压gzip

gzip是一种比较常用的压缩方式，对文本文件有60%到70%的压缩率，http 的请求和响应都可以打开gzip压缩以达到减小传输数据大小的目的。

使用python利用gzip对文件进行压缩，值需要几行简单的代码

import gzip

# 压缩数据
content = b'I like python'
with gzip.open('test.gz', 'wb')as f:
    f.write(content)

# 解压数据
with gzip.open('test.gz', 'rb')as f:
    print(f.read())     # b'I like python'

gzip.open方法里有一个compresslevel，默认是9，最高的压缩级别，压缩级别越高，压缩率月大，耗费CPU越多。

2. 把gzip当成数据库来使用

把gzip当成数据库来使用，并不是什么噱头，在特定的场景下，配合适当的技术，可以提供非常稳定高效的服务。

具体思路是这样的，假设你有数以千万个网页数据需要存储，你应该用哪种数据库才能提供非常稳定快速的索引查询呢？考虑到一个网页的内容有很多重复的标签，如果能做压缩，那将节省大量的空间。对于网页的查询操作，只需要提供其编号就可以了，典型的key-value存储结构。

下面是三个网址

https://rushter.com/blog/gzip-indexing/
https://docs.python.org/zh-cn/3/library/gzip.html
http://www.coolpython.net/

第一个是介绍如何使用gzip做为数据库存储数据，第二个是介绍gzip的python官方文档，第三个是我的博客主页。

下面的代码会将这三个url的网页内容下载并存储

import requests

url_lst = [
    'https://rushter.com/blog/gzip-indexing/',
    'https://docs.python.org/zh-cn/3/library/gzip.html',
    'http://www.coolpython.net/'
]

with open('html.txt', 'ab+')as f:
    for url in url_lst:
        res = requests.get(url)
        print(len(res.content))
        f.write(res.content)

三篇文章的内容一共有123KB，如果对文章进行压缩，压缩到同一个gz文件里，将大大节省空间

import gzip
import requests

url_lst = [
    'https://rushter.com/blog/gzip-indexing/',
    'https://docs.python.org/zh-cn/3/library/gzip.html',
    'http://www.coolpython.net/'
]

with gzip.open('html.gz', 'ab+')as f:
    for url in url_lst:
        res = requests.get(url)
        f.write(res.content)

压缩后的大小只有24.7KB，文件大小是不压缩的五分之一。问题体积确实减小了，但是如何从压缩文件中读取指定的网页内容呢？

三个文件内容都是以二进制的方式写入到html.gz文件中，只要记录每个网页在html.gz文件中的开始位置和结束位置就能够通过指针偏移，读取指定的网页内容。

import gzip
import requests
from io import BytesIO

url_lst = [
    'https://rushter.com/blog/gzip-indexing/',
    'https://docs.python.org/zh-cn/3/library/gzip.html',
    'http://www.coolpython.net/'
]

start = 0
end = 0
url_dict = {}

with open('html.gz', 'ab+')as f:
    for url in url_lst:
        res = requests.get(url)
        text = BytesIO()
        with gzip.open(text, 'wb')as b_file:
            b_file.write(res.content)

        b_text = text.getvalue()
        end = start + len(b_text)
        url_dict[url] = {'start': start, 'end': end}    # 记录一个url的网页内容在html.gz中的开始位置和结束位置
        start = end
        f.write(b_text)

print(url_dict)

三个url的存储信息为

url_dict = {
        'https://rushter.com/blog/gzip-indexing/': {'start': 0, 'end': 9072},
        'https://docs.python.org/zh-cn/3/library/gzip.html': {'start': 9072, 'end': 16419},
        'http://www.coolpython.net/': {'start': 16419, 'end': 25624}
}

以二进制方式打开html.gz，seek到指定位置，读取指定长度的内容，创建出BytesIO对象，BytesIO对象，你可以理解为open函数返回的文件对象，只不过它存在与内存中，这样就可以通过gzip.open打开并进行操作了。

import gzip
from io import BytesIO

url_dict = {
        'https://rushter.com/blog/gzip-indexing/': {'start': 0, 'end': 9072},
        'https://docs.python.org/zh-cn/3/library/gzip.html': {'start': 9072, 'end': 16419},
        'http://www.coolpython.net/': {'start': 16419, 'end': 25624}
}


def get_html(url):
    start = url_dict[url]['start']
    end = url_dict[url]['end']

    with open('html.gz', 'ab+')as f:
        f.seek(start)
        data = f.read(end - start)

    text = BytesIO(data)
    with gzip.open(text, 'rb')as f:
        return f.read()


html = get_html('http://www.coolpython.net/')
print(html)

一种用gzip文件作为数据库的思路

1. python 压缩 和 解压gzip

2. 把gzip当成数据库来使用

1. python 压缩和解压gzip