PDF avläsning med python

PDF avläsning med python

Har ni ett gammalt eller ett ekonomisystem som är svårt att integrera?
Nedan finns kod för att läsa av innehållet i en PDF. Detta kommer inte fungera rakt av för din lösning, men kan ge inspiration.

Om du behöver hjälp så finns mina kontaktuppgifter på den här sidan eller Linkedin.

#Created by Karl Sjökvist

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

import os
import glob
import xml.etree.ElementTree as ET


def convert_pdf(path, format='xml', codec='utf-8', password=''):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    laparams = LAParams()
    if format == 'text':
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    elif format == 'html':
        device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    elif format == 'xml':
        device = XMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    else:
        raise ValueError('provide format, either text, html or xml!')
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue().decode()
    fp.close()
    device.close()
    retstr.close()
    return text

#Analyserar och flyttar de analyserade filerna till klara efter att en xml har skapats av innehållet

PATH = "C:/Users/karl.sjokvist/Desktop/read pdf/"

if not os.path.exists('klara'):
    os.makedirs('klara')

pdffiles = (glob.glob(PATH + "*.pdf")) #Listar alla filer
for pdf in pdffiles:
    text = convert_pdf(pdf) #Anropet till konverteringen
    filename = pdf.split("\\")
    filename = filename[1].split(".")
    filename = filename[0]
    print ("påbörjar analys av " + filename)
    f = open(filename + ".xml", "w")
    text = text.replace("utf-8", "ISO-8859-1")
    text+="</pages>" #Var tvungen att lägga till detta då det finns en bugg
    f.write(text)
    f.close()
    
    os.replace(pdf, PATH+"klara/" + filename + ".pdf") #Flyttar filen

    #XML query för att få fram datan
    tree = ET.parse(filename + ".xml")
    root = tree.getroot()

    raknadesidor = 0
    sidor = 1

    #Räknar sidor i PDFn
    for value in root.iter('page'):
        raknadesidor+=1

    #Går genom alla sidor enskillt
    while sidor <= raknadesidor:
        print ("Sida" + str(sidor))
        for value in root.iter('page'):
            if value.attrib['id'] == str(sidor): #Sidorna på PDFen
    
                #För att få fram alla kostnadsrader
                #Artikelrader
                kostnadsrader = []
                for subvalue in value.iter('textline'):
                    if subvalue.attrib['bbox'].startswith("53.150"): #Kordinaterna på PDFen
                        textstring = ""
                        for text in subvalue.iter('text'): #Slår ihop indeviduella tecken till rader
                            textstring+=text.text
                        textstring = textstring[:-1]#Tar bort nya radbytet
                        kostnadsrader.append(textstring)
                firstindex = kostnadsrader.index("Artikel") #Första radindex
                secondindex = [i for i in kostnadsrader if i.startswith('Moms')]
                secondindex = kostnadsrader.index(secondindex[0])
                kostnadsrader = kostnadsrader[firstindex + 1:secondindex] #tar bort allt innan och efter index

                print (kostnadsrader)

                #Counts the numbers of type rows i.e. rows that has numbers
                posoffirstnum = None
                countrows = 0
                for i in kostnadsrader:
                    if i[0].isnumeric():
                        if i[-1].isnumeric():
                            if posoffirstnum == None:
                                posoffirstnum = countrows
                    countrows+=1

                #Benämning
                benamningsrader = []
                for subvalue in value.iter('textline'):
                    if subvalue.attrib['bbox'].startswith("130.400"): #Kordinaterna på PDFen
                        textstring = ""
                        for text in subvalue.iter('text'): #Slår ihop indeviduella tecken till rader
                            textstring+=text.text
                        textstring = textstring[:-1]#Tar bort nya radbytet
                        benamningsrader.append(textstring)
                firstindex = benamningsrader.index("Benämning") #Första radindex
                benamningsrader = benamningsrader[firstindex + 1:] #tar bort allt ibörjan och slutet av array

                print (benamningsrader)
            
                #antal
                antalrader = []
                for subvalue in value.iter('textline'):
                    if subvalue.attrib['bbox'].startswith("360.200") or subvalue.attrib['bbox'].startswith("360.650") or subvalue.attrib['bbox'].startswith("365.150") or subvalue.attrib['bbox'].startswith("364.700"): #Kordinaterna på PDFen
                        textstring = ""
                        for text in subvalue.iter('text'): #Slår ihop indeviduella tecken till rader
                            textstring+=text.text
                        textstring = textstring[:-1]#Tar bort nya radbytet
                        textstring = textstring.replace(" St","");
                        textstring = textstring.replace(" ","");
                        antalrader.append(textstring)

                print (antalrader)
            
                #enhet
                enheter = []
                for subvalue in value.iter('textline'):
                    if subvalue.attrib['bbox'].startswith("365.150") or subvalue.attrib['bbox'].startswith("391.400") or subvalue.attrib['bbox'].startswith("360.650") or subvalue.attrib['bbox'].startswith("360.200"): #Kordinaterna på PDFen
                        textstring = ""
                        for text in subvalue.iter('text'): #Slår ihop indeviduella tecken till rader
                            textstring+=text.text
                        textstring = textstring[:-1]#Tar bort nya radbytet
                        enheter.append(textstring)
                
                #Rensar onödig informaton från enhet
                enheterloop = 0
                while enheterloop < len(enheter):
                    if "st" in enheter[enheterloop]:
                        enheter[enheterloop] = "st"
                    if "tim" in enheter[enheterloop]:
                        enheter[enheterloop] = "tim"
                    if "skif" in enheter[enheterloop]:
                        enheter[enheterloop] = "skif"
                    if "St" in enheter[enheterloop]:
                        enheter[enheterloop] = "St"
                    if enheter[enheterloop] != "st" and enheter[enheterloop] != "tim" and enheter[enheterloop] != "skif" and enheter[enheterloop] != "St":
                        del enheter[enheterloop]
                        if len(enheter) >=1:
                            enheterloop-=1
                    enheterloop+=1

                print (enheter)
            
                #pris
                prisrader = []
                for subvalue in value.iter('textline'):
                    if subvalue.attrib['bbox'].startswith("417.150") or subvalue.attrib['bbox'].startswith("421.650") or subvalue.attrib['bbox'].startswith("426.150"): #Kordinaterna på PDFen
                        textstring = ""
                        for text in subvalue.iter('text'): #Slår ihop indeviduella tecken till rader
                            textstring+=text.text
                        textstring = textstring[:-1]#Tar bort nya radbytet
                        prisrader.append(textstring)
                print (prisrader)

                #skriva ut raderna
                rader = 0
                offset = 0
                print ("Skriver ut raderna")
                while rader < len(kostnadsrader):
                    rad = kostnadsrader[rader]
                    if rader >= posoffirstnum:
                        if offset < len(benamningsrader):
                            rad = rad + " " + benamningsrader[offset] + " " + antalrader[offset] + " " + enheter[offset] + " " + prisrader[offset]
                            offset+=1
                    rader+=1
                    print (rad)
                print ("klar med analys\n")
        sidor+=1
        

Raspberry Pi Digital fotoram

Nedan presenteras min spaghettikod i python för att precentera bilder som ligger i /boot/foto/ på en hdmi skärm.

import sys, os
if sys.version_info[0] == 2:
    import Tkinter
    tkinter = Tkinter
else:
    import tkinter
from PIL import Image, ImageTk
import time
import datetime
import touchphat
touchphat.all_off()

if os.environ.get('DISPLAY','') == '':
    print('no display found. Using :0.0')
    os.environ.__setitem__('DISPLAY', ':0.0')

root = tkinter.Tk()
w, h = root.winfo_screenwidth(), root.winfo_screenheight()
root.overrideredirect(1)
root.geometry("%dx%d+0+0" % (w, h))
root.focus_set()
canvas = tkinter.Canvas(root,width=w,height=h)
canvas.pack()
canvas.configure(background='black')

textid = 0
imageid = 0
add24h = 0
add1h = 0
add10m = 0
add1m = 0
then = 0

@touchphat.on_touch("A")
def handle_touch():
    print("Button A pressed!")
    global add24h
    add24h = add24h + 1

@touchphat.on_touch("B")
def handle_touch():
    print("Button B pressed!")
    global add1h
    add1h = add1h + 1

@touchphat.on_touch("C")
def handle_touch():
    print("Button C pressed!")
    global add10m
    add10m = add10m + 1

@touchphat.on_touch("D")
def handle_touch():
    print("Button D pressed!")
    global add1m
    add1m = add1m + 1

def showPIL(pilImage):
    global textid
    global imageid
    global add24h
    global add1h
    global add10m
    global add1m
    if textid != 0:
        print ("removing old picture")
        canvas.delete(imageid)
    else:
        print ("not removing old picture, no old picture")

    print ("Calculating sceen for photo")

    imgWidth, imgHeight = pilImage.size
#   resize photo to full screen
    ratio = min(w/imgWidth, h/imgHeight)
    imgWidth = int(imgWidth*ratio)
    imgHeight = int(imgHeight*ratio)
    pilImage = pilImage.resize((imgWidth,imgHeight), Image.ANTIALIAS)
    image = ImageTk.PhotoImage(pilImage)
    imagesprite = canvas.create_image((w/2)+35,h/2,image=image)

    imageid = imagesprite

    print ("Sprite nr",imagesprite)

    t0 = time.time() # now (in seconds)
    t1 = t0 + 60*60*10*add24h  # now + 12 houres
    t1 = t1 + 60*60*add1h  # now + 60 minutes
    t1 = t1 + 60*10*add10m # now + 10 min
    t1 = t1 + 60*add1m # now + 1 min

    print ("Calculated time")

    current_time = time.strftime("%H:%M",time.localtime(t1))
    if textid != 0:
        print ("removing old clock")
        canvas.delete(textid)
    else:
        print ("not removing old time, no old time")
    textid = canvas.create_text(w-180, h-40, font=("Purisa", 80), text=current_time, fill="red")
    print ("updating screen")
    root.update_idletasks()
    root.update()
#   root.bind("<Escape>", lambda e: (e.widget.withdraw(), e.widget.quit()))

names = os.listdir("/boot/foto/")
i = 0
while i <= len(names):
    now = time.time()
    if now >= then+10:
        if names[i][-4:] == ".jpg":
            print(i)
            print (len(names))
            print (names[i])
            file=Image.open("/boot/foto/"+names[i])
            showPIL(file)
            then = time.time()
        i += 1
        if i == len(names):
            i = 0

Certbot

Today i found something interesting. If you are running a webserver privatly you can use a free tool named Certbot to automatically get a certificate created by Let’s Encrypt.

To quote there own website:

Certbot is a free, open source software tool for automatically using Let’s Encrypt certificates on manually-administrated websites to enable HTTPS.

Certbot is made by the Electronic Frontier Foundation (EFF), a 501(c)3 nonprofit based in San Francisco, CA, that defends digital privacy, free speech, and innovation.

look it up at https://certbot.eff.org/

Getting XML data from site to Zabbix

This is a horrible problem that i solved in a very dirty way.

And here is how!

First i get the required information from the website and store it in a file.
This is automatically done witch crontab on linux with the command sudo crontab -e
I will deconstruct bash line i wrote
*/1 * * * *  – Crontab information that the bash command should run every minute

curl -s ”http://10.10.20.2/index.html” emerson | tac | tac – Get the data from the site and pipe it through tac. We do this so the grep command have time to regex the XML table. when this is done the data looks like this:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
<?xml-stylesheet type=”text/xsl” href=”/simon.xsl” ?>
<icom:SimpleMonitoring schemaVersion=”1.01″ appVersion=”PA 104037″ fileVersion=”(null)” xmlns:icom=”http://www.emersonnetworkpower.com/icom” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=”http://www.emersonnetworkpower.com/icom icom-1-02.xsd”>
<Item id=”354″ name=”SinglState”>
<Label>Unit Status</Label>
<Value valueType=”4″ offset=”0.000″ gain=”1.000″ precision=”0″>Unit On</Value>
<Unit></Unit>
</Item>
<Item id=”361″ name=”LocTemp”>
<Label>Return Air Temperature</Label>
<Value valueType=”6″ offset=”0.000″ gain=”1.000″ precision=”1″>26.8</Value>
<Unit>&#176;C</Unit>
</Item>
<Item id=”379″ name=”Std. Sensor Humidity”>
<Label>Return Air Humidity</Label>
<Value valueType=”6″ offset=”0.000″ gain=”1.000″ precision=”1″>40.1</Value>
<Unit>%rH</Unit>
</Item>
<Item id=”380″ name=”Supply Air Temperature”>
<Label>Supply Air Temperature</Label>
<Value valueType=”6″ offset=”0.000″ gain=”1.000″ precision=”1″>—</Value>
<Unit></Unit>

grep -oP ’(?<=\”>).*?(?![a-zA-Z]|\d|[.]|\s)’  – Regex the output of the curl output. after we have done this the data looks like this:
Unit On
26.8
40.1
23.0
24
18.0
On
On
Off
Off
Off
Off
Off
Off

We then store this data in a file somewhere where zabbix_client can easily get it:
> /home/zabbix/zabbixdata/emerson.txt

The complete constructed bash line looks like this:
*/1 * * * * curl -s ”http://10.10.20.2/index.html” emerson | tac | tac | grep -oP ’(?<=\”>).*?(?![a-zA-Z]|\d|[.]|\s)’ > /home/zabbix/zabbixdata/emerson.txt

Now we need to configure the zabbix agents user defined script. Its located in the /etc/zabbix/zabbix_agentd.d directory.
There i created a file with nano,
command: nano etc/zabbix/zabbix_agentd.d/userparameter_emerson.conf
Here i wrote the following user parameters. i Use head and tail to get the specified line and the correct value.
UserParameter=emerson.unit.status,head -1 /home/zabbix/zabbixdata/emerson.txt
UserParameter=emerson.unit.returntemp,head -2 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.returhum,head -3 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.supplytemp,head -4 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.returntempsetpoint,head -5 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.returnhumsetpoint,head -6 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.supplytempsetpoint,head -7 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.fanstatus,head -8 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.coolingstatus,head -9 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.freecoolingstatus,head -10 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.electricalheatingstatus,head -11 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.hotwaterstatus,head -12 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.dehumstatus,head -13 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.humstatus,head -14 /home/zabbix/zabbixdata/emerson.txt | tail -1
UserParameter=emerson.unit.maintstatus,head -15 /home/zabbix/zabbixdata/emerson.txt | tail -1

The syntax is: UserParameter=key,value
The key needs to correspond to the zabbix item i’m going to create and the value is the bash script to get the correct value from the file we are creating every minute.

Now we need to restart the agent so it detects the userparameters: sudo service zabbix-agent restart

In Zabbix we then on the zabbix server add an items with corresponding keys

DONE!

VM problem

A couple of years ago i was up late and my heart was beating. After a move of a VM it didn’t boot up correctly. After som research i fund out that the vmdk file (virtual hard-drive) was corrupt. The horror was that this server was the domain controller.

Early the next morning i get a hold of my boss, the company’s CTO and i tell him what happened. He is calm, tells me ”that can happen” and he tells me to load the backup.

Backup!

30 minutes later the machine was restored. That day i learned two things.

1 – His clam was transferred to me. He lead through example. That kind of boss is what i want to be.

2 – Backups are invaluable. If you think that you wont need any or that they are a costly investment, think again. There are systems like Veeam that is very easy to use, saves time and money.

Radioprojekt – matning av ström

Jag vill att allt ska snyggt ligga inne i radions låda så all strömmatning ska byggas in. För det så behöver jag en 220 v växelström till 5 v likströmstransformator för att ge ström åt Raspberry pien.

För det så tog jag isär en gammal telefonladdare.

Detta var transformatorns insida.

På baksidan så fanns det två tenn ytor som jag lödde fast en elkabel på som jag sedan lödde fast på inmatningen till radion.

Nu var elmatningen för Raspberry pien klar.