Love.Law.Robots. by Ang Hou Fu

pdpc

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

It’s that time of the year! I wonderfully tackle trying to compress an entire year into a few paragraphs. If you are an ardent fan of data protection here, you will already have several events in mind. Here is just a few — the decision on the SingHealth data breach, the continuing spate of data breaches in Singapore, the new NRIC guidelines, or maybe the flurry of proposals from the PDPC ranging from AI to data portability.

You obviously can get all that from reading a newspaper. For this blog, I would like to dig a little deeper. This brings us to my ongoing data project analyzing data protection enforcement and practice in Singapore.

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

For now, I have been trying to reproduce what the DPEX survey does (but with automation and no interns). See the results for today’s post below:

You are now looking at every decision that the PDPC has published to date.

You can see from the decision marked as “Singhealth” that this is where 2019 starts. Here’s three observations I can draw from the data.

A. Singhealth was an outlier, the real change was in June 2019

You can see from the chart above that the average/mean length in January 2019 (the blue crosses) was nearly 50,000 characters. The Singhealth Decision was a very long decision, but it has proven to be an outlier. No decision since then has been afforded that level of detail.

Instead, one finds that in June 2019, there was a deluge of decisions, hitting a new high of 14 in that month (follow the red lines). Sudden peaks aren’t exactly rare. There are peaks in April 2016 (when the PDPC first released their decisions) and again in May 2018.

What did change since 2019 is that we find a regular monthly schedule of decisions Prior to this, the release schedule was punctuated with months without any decisions. Save for the aforementioned peaks, there was no month with more than 5 decisions. This appears to be the norm.

B. The decisions since June 2019 have been pretty short.

Coupled with the sudden deluge of decisions, the data shows that the decisions have been relatively short. That being said, they appear in line with the previous mean lengths per month. Follow the blue crosses, and you find that many of the numbers are well below 20,000 characters.

Statistically , this would appear not to be a big change. However, coupled with the number of decisions during that same period must be different. It’s easy to follow one decision a month. That’s different when you have to follow several a month.

I also noticed though this is not shown in the data, that the PDPC has more regularly used short case summaries these days. (See this example)

What do I think? The ground is shifting. A release schedule of 1 per month is great for studying particular decisions and jurisprudential concepts. A more regular schedule with several decisions is not conducive to studies but shows that enforcement has been a key focus.

C. More evidence of a shift in enforcement priorities by the PDPC?

In a previous post, I mused over whether the PDPC was now focusing on punishing companies which do not have data protection officers or policies. I even wondered whether there is a going rate of penalties for not appointing a DPO.

Now I wonder whether the PDPC feels that its role in educating the public about personal data concepts through its publication of decisions is done, and is now moving to publicise data breaches to educate the public on personal data enforcement. Simply put, no more Mr Nice PDPC anymore. It’s a side effect of the debacle caused by SingHealth and recent public sector breaches. You can cynically put it as trying to show the public that there are no angels in the private sector. My own belief is that more needs to be done to show the public that bad data practices are prevalent, and more work needs to be done. This is hardly a problem limited to Singapore, but it now feels real here.

For the record, I do not think that the PDPC’s role in publishing decisions to educate the public is done. Many grey areas still persist with regard to what “reasonable” means or what exactly the PDPC require organisations for third party diligence.

A new challenge for 2020

What do these budding trends mean? If the PDPC is indeed moving away from exploring the concepts of personal data through publishing decisions, and issuing short notes on enforcement instead, it becomes even more important to be able to suss out trends and analysis. Data Protection Officers have a lot of materials courtesy of the PDPC and the ever-expanding courses offered. However, the ability to analyze the messages coming out from enforcement, to prioritise which actions a DPO should focus on in order to protect their organizations, may be elusive.

I guess to some extent I provided a raison d’etre for my data project. By using automated means, hopefully, I can make this sustainable. What are you seeing out there? What kind of information would you be curious about? Feel free to let me know!

Check out the updated version of the graph:

Presenting: The PDPC Decision Star Map (Version 2)Networks are one of the most straightforward ways to analyse judgements and cases. We can establish relationships between cases and transform them into data. Computers crunch data. Computers produce a beautiful graph. Now I have the latest iteration of the network data of the Personal Data Protectio…Love.Law.Robots.Houfu

#PDPC-Decisions #Singapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

The life of a budding data science enthusiast.

You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.

The Problem

Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.

It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.

I used PDFminer to extract the text from the PDF, and this is a sample output that I get:

The operator, in the mistaken belief that the three statements belonged to the same individual, removed the envelope from the reject bin and moved it to the main bin. Further, the operator completed the QC form in a way that showed that the number of “successful” and rejected Page 3 of 7 B. (i) (ii) 9. (iii) envelopes tallied with the expected total from the run. As the envelope was no longer in the reject bin, the second and third layers of checks were by-passed, and the envelope was sent out without anyone realising that it contained two extra statements. The manual completion of the QC form by the operator to show that the number of successful and rejected envelopes tallied allowed this to go undetected.

Notice the following:

  • There are line breaks in the middle of sentences. This was where the sentence broke apart for new lines in the document. The computer would read “The operator, in the mistaken belief that the three statements belonged” and then go “What? What happened?”
  • Page footers and headers appears in the document. They make sense when you are viewing a document, but are meaningless in a text.
  • Orphan bullet and paragraph numbers. They used to belong to some text, but now nobody knows. Table contents are also seriously borked.

If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.

I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!

The Preprocessing Solution

Although I decided I would let the computer do the correcting for me, I still had to do some work on my own. Instead of looking for errors, this time I was looking for patterns instead. This involved scanning through the output of the PDF to Text converter and the actual document. Once I figured out how these errors came about, I can get down to fixing it.

Deciding what to take out, what to leave behind

Not so fast. Unfortunately, correcting errors is not the only decision I had to make. Like many legal reports, PDPC decisions have paragraph numbers. These numbers are important. They are used for cross-referencing. In the raw output, the paragraph numbers are present as plain numbers in the text. They may be useful to a human reader who knows they are meta-information in the text, but to a computer it probably is just noise.

I decided to take it out. I don’t doubt that one day we can engineer a solution that makes it useful, but for now, they are just distractions.

Put Regular Expressions to work!

As mentioned earlier, we looked for patterns in the text to help the computer to look for them and correct them. I found regular expressions to be a very powerful way to express such patterns in a way that the computer can look for. A regular expression is sort of like a language to define a search pattern.

For example,. this code in bold looks for feed carriages in the text (a “carriage return” is what happens when you press ‘Enter’ on your keyboard, and is much cooler in a typewriter)

def removefeedcarriage(source): return [x for x in source if not re.search(r'\f', x)]

This python code tells the computer not to include any text which contains feed carriages in the text. This eliminates the multiple blank lines created by the PDF converter (usually blank space in the PDF).

A well crafted regular expression can find a lot of things. For example, the expression in bold looks for citations (“[2016] SGPDPC 15” and others) and removes them.

def remove_citations(source): return [x for x in source if not re.search(r'^[\d{4}]\s+(?:\d\s+)?[A-Z|()]+\s+\d+[\s.]?$', x)]

Figuring out the “language” in regular expressions does takes some time, but it pays dividends. To help the journey, I test my expressions using freely available websites for testing and providing a reference for regular expressions. For python, this is one of the sites I used.

Getting some results!

Besides removing citations, feed carriages and paragraph numbers, I also tried to join broken sentences together. In all, the code manages to remove around 90% of the extra line breaks. Most of the paragraphs in the text actually reads like sentences and I feel much more confident training a model based on these text.

It ain’t perfect of course. The text gets really mushed up once a table is involved, and the headers and footers are not exactly removed all the time. But as Voltaire said, “Perfect is the enemy of the good”. For now, this will do.

Concluding remarks

Hopefully this pretty quick rundown of how I pre-processed the pdpc-decisions will give you some idea as to what to do in your own projects. Now that I have the text, I have got to find something to use it for! :O Is there any other ways to improve the code to catch even more errors? Feel free to comment to let me know!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

#PDPC-Decisions #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wrote this post, the structure of the PDPC’s website was drastically altered. The concepts and the ideas in this post haven’t changed, but the examples are outdated. This gives me a chance to rewrite this post. If I ever get round to it, I’ll provide a link.

Regular readers would already know that I try to run a github repository which tries to compile all personal data protection decisions in Singapore. Decisions are useful resources teeming with lots of information. They have statistics, insights into what factors are relevant in decision making and show that data protection is effective in Singapore. Even basic statistics about decisions make newspaper stories here locally. It would be great if there was a way to mine all that information!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

Unfortunately, using the Personal Data Protection Commission in Singapore’s website to download judgements can be painful.

This is our target webpage today – Note the website has been transformed.

As you can see, you are only able to view no more than 5 decisions at one time. As the first decision dates back to 2016, you will have to go through several pages to grab everything! Actually just 23. I am sure you can do all that in 1 night, right? Right?

If you are not inclined to do it, then get your computer to do it. Using selenium, I wrote a python script to automate the whole process of finding all the decisions available on the website. What could have been a tedious night’s work was accomplished in 34 seconds.

Check out the script here.

What follows here is a step by step write up of how I did it. So hang on tight!

Section 1: Observe your quarry

Before setting your computer loose on a web page, it pays to understand the structure and inner workings of your web page. Open this up by using your favourite browser. For Chrome, this is Developer's Tools and in Firefox, this is Web Developer. You will be looking for a tab called Sources, which shows you the HTML code of the web page.

Play with the structure of the web page by hovering over various elements of the web page with your mouse. You can then look for the exact elements you need to perform your task:

  • In order to see a new page, you will have to click on the page number in the pagination. This is under a section (a CSS class) called group__pages. Each page-number is under a section (another CSS class) called page-number.
  • Each decision has its own section (a CSS class) named press-item. The link to the download, which is either to a text file or a PDF file, is located in a link in each press-item.
  • Notice too that each press-item also has other metadata regarding the decision. For now, we are curious about the date of the decision and the respondent.

Section 2: Decide on a strategy

Having figured out the website, you can decide on how to achieve your goal. In this case, it would be pretty similar to what you would have done manually.

  1. Start on a page
  2. Click on a link to download
  3. Go to the next link until there are no more links
  4. Move on to the next page
  5. Keep repeating steps 1 to 4 until there are no more pages
  6. Profit!

Since we did notice the metadata, let’s use it. If you don’t use what is already in front of you, you will have to read the decision to extract such information In fact, we are going to use the metadata to name our decision.

Section 3: Get your selenium on it!

Selenium drives a web browser. It mimics user interactions on the web browser, so our strategy in Step 2 is straightforward to implement. Instead of moving our mouse like we ordinarily would, we would tell the web driver what to do instead.

WebDriver :: Documentation for SeleniumDocumentation for SeleniumSelenium

Let’s translate our strategy to actual code.

Step 1: Start on a page

We are going to need to start our web driver and get it to run on our web page.

from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options PDPCdecisionssite = “https://www.pdpc.gov.sg/Commissions-Decisions/Data-Protection-Enforcement-Cases" # Setup webdriver options = Options() # Uncomment the next two lines for a headless chrome # options.addargument('—headless') # options.addargument('—disable-gpu') # options.addargument('—window-size=1920,1080') driver = Chrome(options=options) driver.get(PDPCdecisions_site)

Steps 2: Download the file

Now that you have prepared your page, let’s drill down to the individual decisions itself. As we figured out earlier, each decision is found in a section named press-item. Get selenium to collect all the decisions on the page.

judgements = driver.findelementsbyclassname('press-item')

Recall that we were not just going to download the file, we will also be using the date of the decision and the respondent to name the file. For the date function, I found out that under each press-item there is a press-date which gives us the text of the decision date; we can easily convert this to a python datetime so we can format it anyway we like.

def getdate(item: WebElement): itemdate = datetime.strptime(item.findelementbyclassname('press_date').text, “%d %b %Y”) return itemdate.strftime(“%Y-%m-%d”)

For the respondent, the heading (which is written in a fixed format and also happens to be the link to the download – score!) already gives you the information. Use a regular expression on the text of the link to suss it out. (One of the decisions do not follow the format of “Breach … by respondent “, so the alternative is also laid out)

def get_respondent(item): text = item.text return re.split(r”\s+[bB]y|[Aa]gainst\s+“, text, re.I)[1]

You are now ready to download a file! Using the metadata and the link you just found, you can come up with meaningful names to download your files. Naming your own files will also help you avoid the idiosyncratic ways the PDPC names its own downloads.

Note that some of the files are not PDF downloads but instead are short texts in web pages. Using the earlier strategies, you can figure out what information you need. This time, I used BeautifulSoup to get the information. I did not want to use selenium to do any unnecessary navigation. Treat PDFs and web pages differently.

def downloadfile(item, filedate, filerespondent): url = item.getproperty('href') print(“Downloading a File: “, url) print(“Date of Decision: “, filedate) print(“Respondent: “, filerespondent) if url[-3:] == 'pdf': dest = SOURCEFILEPATH + filedate + ' ' + filerespondent + '.pdf' wget.download(url, out=dest) else: with open(SOURCEFILEPATH + filedate + ' ' + filerespondent + '.txt', “w”) as f: from bs4 import BeautifulSoup from urllib.request import urlopen soup = BeautifulSoup(urlopen(url), 'html5lib') text = soup.find('div', class_='rte').getText() lines = re.split(r”ns+“, text) f.writelines([line + 'n' for line in lines if line != “”])

Steps 3 to 5: Download every item on every page

The next steps follow a simple idiom — for every page and for every item on each page, download a file.

for pagecount in range(len(pages)): pages[pagecount].click() print(“Now at Page “, pagecount) pages = refreshpages(driver) judgements = driver.findelementsbyclassname('press-item') for judgement in judgements: date = getdate(judgement) link = judgement.findelementbytagname('a') respondent = getrespondent(link) download_file(link, date, respondent)

Unfortunately, once selenium changes a page, it needs to be refreshed. We are going to need a new group__pages and page-number in order to continue accessing the page. I wrote a function to “refresh” the variables I am using to access these sections.

def refreshpages(webdriver: Chrome): grouppages = webdriver.findelementbyclassname('group_pages') return grouppages.findelementsbyclassname('page-number') . . . pages = refresh_pages(driver)

Conclusion

Once you got your web driver to be thorough, you are done! In my last pass, 115 decisions were downloaded in 34 seconds. The best part is that you can repeat this any time there are new decisions. Data acquisition made easy! At least until the PDPC breaks its website.

Postscript: Is this… Illegal?

I’m listening…

Web scraping has always been quite controversial and the stakes can be quite high. Copyright infringement, Misuse of Computer Act and trespass, to name a few. Funnily enough, manually downloading may be less illegal than using a computer. The PDPC’s own terms of use is not on point at this.

( Update 15 Mar 2021 : OK I think I am being fairly obtuse about this. There is a paragraph that states you can’t use robots or spiders to monitor their website. That might make sense in the past when data transfers were expensive, but I don't think that this kind of activity at my scale can crash a server.)

Personally, I feel this particular activity is fairly harmless — this is considered “own personal, non-commercial use” to me. I would likely continue with this for as long as I would like my own collection of decisions. Or until they provide better programmatic access to their resources.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuIn 2021, the Copyright Act in Singapore was amended to support data analysis, like web scraping? I wrote this follow-up post.

#PDPC-Decisions #Programming #Python #tutorial #Updated

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu