Clojure after a month

Leave a comment

(4 weekends, to be precise)

“A programmer should learn at least one new programming language each year.”

Clojure is a functional programming language. I never particularly liked the term. In discussions related to functional programming, it always feels like a bunch of nerds talking a weird language that has nothing to do with real life: monads, laziness… All the toy stuffs that highlight the intellectual superiority of the cult. Most importantly, I want readability. Code is the means to express yourself to the machine and peer programmers. The more expressive and intuitive it is, the better. This is why I liked Python over C++ and Java so much. My friends did Scheme before, and it just seems like a luxurious brain exercise, while I have plenty of other things to spend my brain on.

The main reason I tried Clojure was the way it handles concurrency. At work I’ve seen a couple of enterprise softwares that let people drag and drop components into a diagram that will do data processing. It’s easy to convert those diagrams into serial code, but not the other way round. It’s partly why people cannot simply have a compiler that translates serial codes into programs with concurrency. We need a programming language that is also expressive of the dependencies of the tasks in a program – the intent of the programmer.

From this perspective, syntax is the first thing that strikes me. I don’t like the prefix order, as it’s unfamiliar, but it’s not too bad as it’s very quick to learn. Except for the mathematical formula. Turns out, it allows for a lot of flexibility. You can even write a code to write codes. You can also visualize codes as diagrams.

Looping is the next thing that strikes me. I cannot write loops the way I did with C, Java or Python. Instead I needed to use tail-recursion to replicate this. I avoided using loops, as it doesn’t work well with the way my brain was reasoning about functions. Instead, map and a bunch of other functions that operate on collections are preferred. Usually this means cleaner code, and sometimes codes that can be parallelized. It’s embarrassingly hard to write this kind of code compared to the old ways of simply looping, as you are now trying to communicate your intention & organization, not only your instructions.

Along with looping was immutable persistent data structures. It makes it harder to write codes, and the code is less efficient. Clojure did a great job reducing the inefficiency to a constant factor with that “persistent” property. With these data structures, now one can really talk about functional programming, as data is now value, and a pure function will always give the same output from the same input. Memoization can easily be supported as a result (a plus for me since I have to implement dynamic programming algorithms now and then).

Because you now communicate with the compiler in terms of values and functions rather than instructions, it is easier to visualize your code. You can write the code on one side of the screen, while the other side automatically update the values obtained by the function (kudos to LightTable). At the same time, you tend to use smaller functions, because the program becomes visually ugly as it grows pass a certain size. The combined effect is a new programming style, where you write small functions, unit test it instantly by typing in different values and observe how the end result is updated on the output screen. Small functions can be called to composed into other functions naturally, as you don’t need to worry about side effects.


After a few days familiarizing myself with Clojure, the old Python code now looks ugly to me. My old way of writing code was clearly harder to test, and the dependencies between code blocks weren’t clear.

It was one way Clojure changed my thinking about code. It’s not always desirable, for example when you want to write some quick dirty patchwork. But for a long term project, I believe that laying down those code organisation, and reasoning in terms of functions and values, are always things you should strike for. Not that it’s always possible though, and I like that Clojure allows for flexibility built into its design that lets you leave the functional world graciously.

Another great idea in Clojure is about abstraction. It provides a bunch of tools to generate new classes and new objects with different properties. This took me another few days to get a hang on how Clojure does OOP. It’s like you have different kinds of loops for different intentions (foreach loop, map-reduce loop, iteration loop), you also have different sorts of inheritance and polymorphism for different intentions. You can add functions to old types, and you can also add new types that support old functions. Sometimes you can just get away with a new instance, rather than having to create a new class/interface. A legacy class/object now becomes something alive that can be broken down, combined, or taught new tricks.

If Perl’s way is “there are different ways to do things”, Python’s way is “there is one prefered way to do things”, then Clojure’s way is probably “there is an ideal way and a practical way to do things, and we’ll tell you when you need to make a choice”.

I haven’t discussed the most prominent use of Clojure – how it handles concurrency. I haven’t mastered it now, but it is definitely a good topic for another time.

Now I’ll discuss how the implementation details are affected when you port an algorithm to Clojure. We are finding all the primes under a threshold.

Notice the main body of the code:

(primes-from 11 wheel))

It takes a function primes-from, starts from 11, and use the wheel, which is an infinite sequence as to instruct the steps it should take to find the next prime.
Primes-from perform a primality test as follows:

(some #(zero? (rem n %))
(take-while #(<= (* % %) n) primes))

… where #(zero? (rem n %)) is the test of division, and (take-while #(<= (* % %) n) primes)) is the list of potential factors.

A typical implementation of this algorithm in non-functional programming language would be the Sieve of Eratosthenes:

for i in range(n):
    if isPrime[i]:
        j = i
        while j<n:
              isPrime[j] = False

I believe the latter code is faster to write and to run. However, its meaning is only clear if one has a rough idea of the algorithm it’s trying to implement. The former code can be broken down into independent pieces, and the reader can try to understand each piece before combining them altogether. Of course, you can also implement the latter code in Clojure, just that it doesn’t give the natural feel. When you leave the functional programming realm, you know right away.


Personal content filter

1 Comment

Internet is a leap of human intellect: it lets individuals learn from the collective, and the collective learn from individuals.

Google is the first accelerator to that leap. Without it, people cannot effectively manage the huge number of websites. Previously, personal memory and colleague network are the main search engines.

Web 2.0 is the next accelerator. Facebook, twitter, reddit…

These are all tools to obtain better content. Google gives you what you search for. Facebook gives you what your friends search for. Twitter & reddit give you what your like-minded search for. Quora is also a good one that assist you where Google fails.

This commonality is because, as cost of sharing information goes down, cost of obtaining information goes up: “Information is abundance, but attention is scarce.” To be precise, it’s not the cost of obtaining information, but the cost of filtering for the right amount of information that match YOUR attention. Each day, for example, you have 1-2 hours to read news; but the internet provides you with thousands of hours worth of news. Not surprisingly, working adults (stereotypically) lose their interest in many things, as their free time is significantly shorter, and there is no effective way to scale down the time for each interest: either you do it, or you don’t.

Of course, this is a good problem. The solution is very simple: each of us has a personal filter that keeps track of what we have read, our preference, and filter the right information for us, according to our available attention. I would want to see friends updates, then major news, then minor development tips, then active technical discussion… depending on how much time I have that day/week.
If it’s so simple, why has no one else implemented this?

Because it would kill the internet.

The internet is free, upon the premise that you pay for it using your attention. Google is powered by ads. Facebook is powered by ads. Anything that you don’t pay for is powered by ads.

Of course, individually, you don’t have to care about TOS and just go ahead implementing your kick-ass filtering algorithm, whichever way you like it. Oh, and don’t forget to block those ad domains, too!

But futuristically, if you want to fix this one and for all, I think we need a micropayment platform.

User U joins platform P by paying up-front cost 5$. Platform P then provides user U with the content filters they want: ads blocked, API, programmable filters… User U then visits sites S1, S2, S3, S4… Sites S1 then charges P all the cost from users U1, U2,… the cost they’d get from advertisers.
This system is engineered such that no real transaction is actually micro: user U pays in chunks of 5$, and platform P also pays in chunks of 5$.

Turns out, it’s not easy to implement this system: platform P would have to deal with so many available websites. Well, but that’s exactly the problem Google ran into a decade ago. This time it’s a little bit harder: not only content matters, but representation matters, too.


Leave a comment

From the top: Calibri, Teen, Times New Roman, Arial. Of course the best font for this would not be any of them. But which one works better? What does each font tells you?

My (digital) life is broken!

Leave a comment

tl;dr: I look for processes/apps that may improve my digital life dramatically. The key is to minimize cognitive cost, while maximize searchability.

Started working a few months ago, the amount of information I have to handle goes up, while my free time goes down. Besides, I just got my 1st smart phone. In this post I’ll go through my digital life, and make certain decision to fix it up in the process.

Digital life is all about information. It lets you handle information more effectively: you can search, automate, and re-search stuffs. It operate on 4 platforms: PC, remote server (S3/EC2/VPS..), mobile, and websites that you participate. The last platform is the one that you have no control upon.

Below are the activities within the information flow:

1. Find
+ Pull: newspaper, personal blogs, reddit/stack overflow/hackernews/TechCrunch/IEEE Spectrum
+ Push: friends, google search subscription, quora
Traditional newspaper is too manipulated by agendas. Personal blogs are good. HackerNews is a bit too overwhelming. Fb feed is not very customizable. Quora is surprisingly good. Technically-wise, I think I’m well covered. But for local news in Vietnam and Singapore, or in general Southeast Asia region, I can’t find a reliable neutral news source with good discussion contributed by readers. Yahoo! is rubbish. Science-wise maybe Science & Nature would do.

2. Access
I rely on RSS heavily. However recently I tend to consume on Kindle. If there’s an automatic way to send news article to Kindle that’d be nice. But for now RSS still serve, as sometimes links are not suitable for Kindle.

SeenBefore is a good Chrome app that let you archive and search through what you read. As it only works on chrome, I now consider move my feeds to chrome for reading, or wait till SeenBefore is ported to Firefox.

3. Note taking
There’s a note.txt at the desktop and a Note document in google docs. There should be a way to combine both of them. Also, there should be a quick way to note on the go with my phone. Evernote seems to be a very good choice.

4. Writing
Vim is a good editor. Google API lets you programatically talk to its service, so there are command line options that let you use vim to write to google docs. Besides, Sublime Text is receiving some buzz recently.

I code mostly in Python. IPython is a tool that let you save the whole interactive session. Besides, I need a git repo, both for private and public access. There are non-trivial options: github, bitbucket, dropbox, S3. Sometimes I want to view the code, sometimes I want to run it remotely. Sometimes the code runs with data/library that would be costly to be maintained online. Since it’s personal, I think a reasonable approach would be dropbox. I’d be the only one editing any piece of code at any given time.

5. Share
Note sharing can be done with googlde docs. For blog posts I made it share automatically on twitter and fb. For code I don’t often share, but if I do, github would be ok.

6. Backup
I’m contemplating Amazon Glacier. With Ubuntu’s Deja Dup, that’s a potent combination. However, the pricing model of Amazon Glacier still needs to be ironed out.

All the activities in 1-6 needs to be logged so that I can archive and search later on. I’m exploring GNOME activity journal.

Besides, there are new options for mobile platform, too. Calls, SMS-es, GPS should also be logged.

As you see, there are much to be done. It makes me feel like a weirdo now :P. Guess many people never care about all this..

Fortune 500

Leave a comment

I composed the full list of Fortune 500 throughout the years 1955-2012. The reason was that I needed some list of prominent companies for some text analysis of news articles, and also needed some dataset for simple visualiation. Anyway, it may be of interest to you, so I shared it here. Enjoy!

World news in an image

1 Comment

You can tell when I’m stressed: that’s when I do not post on this blog. My first job was kind to me, but still, I almost stressed myself to burning out. Leaving that aside…

Last week I had an idea that instead of a fixed wallpaper, I’ll generate a new wallpaper everyday from world news. After some thoughts, I decided to take world news from 4 sources: bbc, newyorktimes, reddit and google news. Their rss are parsed, and urls are visited. Certain effort is spent avoiding non informative images such as banners and buttons. Then I tried to pick a few pictures (automatically) and lay them out on a single wallpaper (in a 3-column layout). It looks like this:

Of course you would wonder the meaning of each image, and possibly which article they are from. The code therefore generate an html page with relevant summaries and references. Sometimes I enjoyed the html page even more. That’s when you realize that you’re constantly bombarded with so much text and irrelevant images online.

The code is available for your tweaking, here. See run.bat for the running instruction.

Introduction to Bayesian Statistics, William M. Bolstad

Leave a comment

While reading introductory texts in machine learning, I found that my statistics background was not sufficient. Classical statistics is not often used in nowadays machine learning: hypothesis tests, sample distribution, and p-values are not often mentioned. Instead, people start to talk about priors and posteriors. This book was recommended, so I gave it a try during winter vacation.

Over all, the book was a good experience. It’s targeted at people with some exposure to statistics: they had to deal with data and wanted to make sense out of it at some point. It delivered what it promised. The materials are very accessible.

The first half of the book is the part worth reading. The latter half is not very well written, possibly because more sophisticated Bayesian methods have not been developed, and the author prefers analytic solution rather than numerical computation.

Given those caveats, this is a MUST read for those who just experience either frequentist or Bayesian approach. The book compared the two approaches while solving several classical statistical questions in a clear and concise manner.

Will I be converted to do Bayesian stastistics after reading this book? Yes and no.

Bayesian statistics modeled our belief explicitly. It is suitable to machine learning tasks. It is also suitable for thorough statistical studies & surveys. However, it’s hard to come up with a reasonable prior. For an amateur statistician, it makes more sense to just take an off-the-shelf procedure that gives p-values and whatever measures that have been curated in publications. It’s just not worth the hassle. And the frequentist approach gives formulas that are easier to explain. You prefer k/n rather than (k+1)/(n+2), because the former is simpler and easier to understand for people with no statistical background.

When forced to compare between frequentist approach and Bayesian approach, one is constantly reminded of his philosophy. It seems that after all, human is always subjective. As long as numbers are calculated, you are on the safe side of mathematics. But once you try to make any sense out of those numbers, you start to impose assumptions. Reasonable assumptions result in reasonable conclusions, and otherwise. For those hiding their heads in the sand, thinking that numbers can give them the definitive answer, well, they are wrong. To sum up, I’d like to say

Lies, damned lies, and statistics

Older Entries