Getting users via Reddit

It’s tempting to focus purely on the engineering or research of a project. Hmm tempting isn’t the right word… it’s the default approach. In a typical software engineering or research job, you’re trained to leave other aspects of the project to marketing/business/etc.

However for side projects the entire solution is your responsibility. That means not just making a great product but also acquiring customers or users. And even making a great product involves graphic design, user feedback, support, and other areas outside of a research-and-development role.

So last week I tried plugging Over 9000 on Reddit to grow the user base and learn more about achieving traction.

What I did

I made a thread in /r/dataisbeautiful and /r/anime with a link and short description tailored for each subreddit. /r/dataisbeautiful has 3.4 million subscribers and /r/anime has 288k. After making the two threads I emailed some friends and asked them to upvote. My goal was to get enough upvotes to get onto the default/hot list for each subreddit for a few minutes. Or at least to stay on the first page of the new list for a while. I monitored comments for a few hours and replied to any questions quickly.

Feedback

There were unfortunately no comments on /r/dataisbeautiful. I was hoping someone would have a good idea of how to make better use of my data.

The comments on /r/anime were useful and the biggest bug report is that shows are sometimes split into two. I’m working on a solution for that.

Data

I’m using Google Analytics and it shows some interesting trends. First up are page views for June 2015:

Page views by day, peaking on 6/23

I made the Reddit threads on 6/23. Before that I’d been getting at 0-80 pageviews per day. But the Reddit threads produced 533 views on 6/23 and 131 on 6/24. Since then it’s back to regular levels but there are no days with zero views.

Demographics before Reddit

One of the difficulties in using Google Analytics is that much of the traffic is fake. 14% of the traffic is from fake referrals (see this).

Looking over the flow, 98% visit the front page then leave.

Geographically most the traffic is from the United States but also some from China, Japan, Germany, and Brazil and a little from other countries. Location is not set for 24% of sessions and those sessions have average duration 0 sec (probably web bots). Here’s the raw data.

Map of the page views before Reddit

99% of visits are from desktop over this time period. The only 3 visits from mobile were all from my own phone.

Demographics after Reddit

Fake referrals make up only 4% now!

96% of visits view the front page and leave (down from 98%). That means more users are visiting the Winter 2015 or About pages (but only a few).

Geographically the distribution has only 7% unset locations (vs 24%). The United States is 51% of sessions (vs 34%). English-speaking countries follow the US: Canada, UK, and Australia. Maybe Reddit is biased more towards English-speaking countries? Raw data link.

Map of page views after Reddit

The device demographics from Reddit are different: only 72% from desktop (vs 99%). 23% from mobile and 5% from tablet. I’m glad the site is responsive to device size!

In browsing the data I saw that Google Analytics tracks page load speed:

main page: 2.45 sec (that’s horrible!)
winter2015: 0.45 sec
about: 0.00 sec

When I look at pre-Reddit numbers they’re all 0 sec. The slow load of the front page is due to external CSS/JS which is fast for repeat users and slow for new users. Although the resources are hosted in CDN, Materialize isn’t popular enough that new visitors already have it cached. Likewise my own CSS isn’t going to be in cache for new visitors. In the future I’ll try inlining these.

If I could do it over again

  • I’d track how many referrals came from each subreddit. If I remember correctly you can add a param like utm_source=blah to the referring URL to do that.
  • I’d solicit more friends in advance to keep the threads trending for a little longer and try to get them in sync to upvote at around the same time. Once the thread is off the first page it’s basically dead.
  • I’d look up more ways to get a post onto a subreddit front page.
Advertisements

REST API design tips

For the server side of Pollable we iterated on the REST API several times and learned a lot the hard way. And this excellent article is great for design tips.

We’re using RESTful calls of Json only and https only; much of this deals with the design of the Json objects.

Know your API user

Of course you should understand the client’s use case. But more than that: You should understand what libraries they’re using.

For example, GSON is commonly used on Android but doesn’t cope well with optional fields: does 0 mean the value is missing or is set to 0? (A developer using GSON would need to use Integer rather than int to tell.)

Other client-side factors to consider, in no particular order:

  • Will your Json field names bleed over into model objects in Java or Objective-C? Will that lead to awkward conflicts with variable names in those languages? GSON has field renaming policies but you have to consider whether a typical API user knows about that or wants to write the code.
  • Will the returned Json objects be dumped into a SQL database like an Android ContentProvider? If so you may want to avoid hierarchical objects to work easier with a flat table.
  • Optional fields/values cause tons of trouble.
    • One of our pain points was a field for your vote on a poll. At first we didn’t include it for unvoted polls or related polls but it led to excess client code to update client data values rather than overwrite.

Consistency is king

The more consistent your API, the less code the client has to write (and probably fewer bugs).

One issue is error handling. It’s nice and clean if you can use HTTP error codes. Suppose an endpoint like PUT /poll returns a 400 error for invalid data.

But what about a version of the endpoint to add 100 polls as a batch? Do you return HTTP 400 if one poll is bad? In that case do you accept all the others or reject them all? Or do you return 200 and failed status codes/messages for each individual poll? We do the latter and have an inconsistent API. :(

Possible options, none ideal:

  • Eliminate singleton endpoints, forcing clients to wrap data in a singleton array. This forces code reuse but it’s not user-friendly.
  • 400 status return for any bad data at all, reject the entirety of array input. For some apps this is exactly what’s necessary – maybe the objects in your import don’t make sense without each other.
  • Wrap responses in an envelope to convey success/failure, use the same envelope for singleton and array data. This can lead to confusion because you’re returning HTTP 200 OK but it seems best in the long run.

Fail early

When you’re prototyping a Python/Tornado server with MongoDB backend, it’s tempting to interpret the Json from clients, process a little bit, and insert straight to MongoDB. But Json in Python is decoded to a dict and MongoDB doesn’t have a schema.

You might be inserting bad data to MongoDB. A completely separate endpoint may fail the next day. Debugging gets harder the further the error propagates from the root cause.

In the case of Python with MongoDB it’s very helpful to use basic Json validation. If you need to process any fields they’ll generate errors if missing or incorrect.

But when you’re starting out you really can’t afford 500 lines of code spent on validation that you update twice a day. We went with jsonvalidator due to simplicity: provide an example Json object and it’ll check that all the fields are present and the right types. We also tried jsonschema in conjunction with orderly but it was verbose and didn’t deal as well with required fields.

Documentation helps even in a team of three but save yourself some pain and catch errors from typos early on.

General consideration: Should the client do more work or the server?

We started off with one client implementation so at the time I felt that it didn’t matter. We’d only have internal clients anyway. It seemed like a debate about who’d spend their time on the work really.

Since then I’ve changed my tune: Eliminate as much client code as you can. This is partly due to iOS: even bug fix releases can take two weeks to make it through app approval.  But you can update the server much faster.

Quick and dirty mongodb backups

I’m using MongoLab via Heroku for data storage in current projects. It’s very easy to setup but backups aren’t free. When you’re still exploring ideas you don’t necessarily want to spend $15/mo per mongo instance for backups. And also to restore a backup you need to download and run a bunch of commands manually anyway. So I wasn’t doing backups.

But then I accidentally cleared out the tables once or twice and that was inconvenient for testers. I wanted a solution that’s cheap(er) or free. It should run most days and it can degrade performance a bit.

So I have a cronjob that runs this bash script once a day from my laptop. It saves the POLLS and USERS tables in dev and prod instances in folders with the date.

#!/bin/bash

source ~/.bashrc

SCRIPT_DIR=$(cd $(dirname "$0"); pwd)
BACKUP_ROOT=${SCRIPT_DIR}/../backups
DATE=`date +%Y-%m-%d`

function backup_mongo {
    HOST=$1
    DB=$2
    USER=$3
    PASS=$4
    LABEL=$5

    OUTDIR=${BACKUP_ROOT}/${LABEL}/${DATE}
    mkdir -p ${OUTDIR}

    for COLLECTION in "POLLS" "USERS"
    do
        mongoexport -h ${HOST} -d ${DB} -c ${COLLECTION} -u ${USER} -p ${PASS} -o ${OUTDIR}/${COLLECTION}.json
    done
}

backup_mongo "dev_host:dev_port" dev_db_name dev_backup_username dev_backup_password "dev"
backup_mongo "prod_host:prod_port" prod_db_name prod_backup_username prod_backup_password "prod"

If my laptop isn’t on, it doesn’t run. But most days I’ll get a backup. The documentation clearly doesn’t want me to use mongoexport but it doesn’t affect the kind of data I’m storing. And there are side benefits to a Json copy: it’s easier to just open in an editor or even Python.

I also made read-only users for doing backups for a tiny bit of added safety.

This isn’t a big triumph but it saves me $30/mo (2 dbs) and gives me easier access to the data.

In a nutshell the documentation clearly advises against this for good reasons: Json doesn’t perfectly store MongoDB BSON, backups aren’t guaranteed to get a consistent state of your db, and it’s doing full collection processing so it’s doing heavy load. But when things are just starting you want something just barely good enough so you can move on to the important stuff.

Software engineering estimates and League of Legends

This might only make sense to players of MOBA-style games.

There’s been renewed debate lately over time estimates in software engineering such as the no-estimates debate. It’s incredibly hard to estimate how long a new feature or bug fix will take.

And it baffles people when you don’t know how long it’ll take. I prefer to say “It’ll take 3-4 hours if things go according to plan, but could take 1-2 weeks depending on how things break.” I’ve had to explain over and over, which probably means my explanation is not good enough. This time I’ll try an analogy with League of Legends.

“I need a ward at their red buff asap”

In some ways, making a software change is like trying to get a deep ward in the enemy jungle. There are many possible outcomes:

  • (Best case) You run there, drop the ward, and run out. It cost you maybe a minute plus 100g.
  • You run by a stealth ward in the process. Your ward is cleared out when you leave and you need to do it all over again. And you lost time and gold that you needed for other tasks.
  • You get killed by the enemy jungler. So it costs you extra time and feeds gold. It also increases risk for the rest of the team.
  • You get caught by enemy jungler and try to run. You barely make it to a teammate and then 2 more enemies pop out and you both die. You lose even more time, transfer even more gold, and give up more map pressure.
  • And plenty of worse cases…

In all the bad cases you need to try again and there’s chance of the second attempt failing as well.

So you can’t say exactly how much it’ll cost you to ward. The cost of the best case is very clear. But there isn’t a clear bound on just how badly it could go wrong.

Warding isn’t always this dangerous and similarly software isn’t always so unreliable.

Reducing risk in League of Legends

You reduce risk in League by gathering information. If you can see all five enemies on the map, you make a good guess about whether it’s safe to ward or not. If you watch the people moving in and out of lane, you guess where the stealth wards are. And so on. Your understanding makes it more predictable.

In League there’s are maybe a few dozen ways your task can go wrong but you can eliminate many of them by gathering information and reasoning about what’s possible.

Reducing risk in software engineering

Unlike League, you generally have vision (access) to all the source code. In theory you could have a perfect understanding of how the software works. But that’s rarely the case.

Usually I’ve worked at an intersection of complex systems. They may span hundreds of thousands of lines of code, span multiple programming languages, been written by 50-100 people many of whom are gone, etc. Although you have visibility into the code, you may not have complete understanding of it. For instance, you may assume an index variable starts from 0 but it’s starting from 1. Or you may assume that a function has no side-effects only to find that it sometimes does.

The better you understand the system, the tighter your estimates can be. It’s unlikely that you’ll ever have 100% understanding of the full stack so improving estimates is a lifelong process.

Unlike league there are probably hundreds of ways a seemingly simple task can go wrong because the tasks themselves are more complex.

Back to software estimates

Rough estimates are often necessary. But it’s problematic when an estimate is turned into a hard client commitment. It’s better when both sides of the table understand that estimates are really a range and that there are almost countless ways that software can break and take more time.

Using Chromecast with Coursera

I’m taking the Machine Learning class on Coursera with a friend and we wanted to watch the lectures together on the TV.  Recently he got a Chromecast, which is compatible with many Google apps, Netflix, and Hulu.  But the Coursera lectures aren’t on YouTube so it’s not easy to use the Chromecast.  Someday the Chromecast will support more third-party apps and Coursera will have an app, but in the meantime…

Solution 1: Batch download videos, upload to YouTube, use Chromecast

This is the first solution we tried.  It takes a couple of steps:

  1. Install coursera-dl (requires Python 2.7 and pip)
  2. Download all lecture materials for your course to your local disk
    Note:  The first time I tried I mixed up the order of the destination folder and the course code, and it gave funny error messages like it couldn’t find the course “.”, which was entirely my fault but confusing.  Also, the course number is in the URL of the course webpage.
  3. (Windows) Search for *.mp4 in the folder you specified for download.  Non-Windows operating systems should have a similar equivalent.
  4. Upload to YouTube:
    1. Sign in to YouTube
    2. Click Upload at the top (or you can navigate to your video manager and go from there)
    3. Set the default privacy (Private)
    4. Drag and drop the files from your folder search
  5. If you haven’t verified your YouTube account, it’ll lock any videos over 15 minutes long.  If that’s the case, navigate to the video and there should be a link to verify your account via SMS.  Then it’ll unlock the longer videos.
  6. Create playlists from the video manager and make sure that they’re in-order so you can play them smoothly.

The problem is that the Chromecast can’t play Private videos.  To get around this, go into YouTube’s video manager, select all, and change privacy to Unlisted.  Now you’ll be able to watch on Chromecast!  We just stream from our phones with this method.  (YouTube/Chromecast isn’t actually streaming from the phone so it doesn’t drain your battery.)

The main drawback is that it takes a bit of work.  You also lose the features of playing back via Coursera webpage, such as speeding up the video, automatic pause during quizzes, and good subtitles.  Otherwise, this approach is great.  Playback is perfectly smooth.

Also note

  • Don’t forget to change the videos back to Private after you’re all done!
  • This should also work for watching lectures on Playstation 3, Wii, Xbox, smart Blu-ray players, smart TVs, etc.  The hardest part of streaming on those devices will probably be navigating to your playlists.

Solution 2: Use Chrome on a laptop, stream to Chromecast

You can also stream a tab from Chrome on your computer to the Chromecast.  This is more straightforward and requires much less setup than the other approach.  But these are the drawbacks we’ve experienced:

  • You have to keep the laptop on and nearby so you can do the quizzes, which drains your battery.
  • Chrome streaming doesn’t work nearly as well.  In about an hour of videos, the connection failed twice and we had to reconnect.  In contrast, the connection didn’t drop at all for about 2 hours with the YouTube method.

Thoughts

The YouTube method is hacky, but it works.  I’m looking forward to being able to stream directly, perhaps from Coursera Companion or an official app when it’s available.  I wonder what the speed and quiz experience will be like?  Video playback on the TV and quiz on the phone?  Quiz as just a pause like the mp4 files?

Also, taking a course with a friend is much better than taking it alone!

latex + url = ???

Learn something new every day.

When I had a url in LaTeX in \url{} it looked like it was clickable in the output, but it didn’t link to anywhere because I left out the http://. I’m so used to web browsers that I expected it to default to http without the protocol specified, but it doesn’t. It’s just a dead link that doesn’t do anything without the http.

Note:  For all I know, the behavior could be specific to the PDF reader you use.

GmailTeX

I came across this post on GmailTeX today, which is a browser plugin to render LaTeX in Gmail.  It looks like it may take up extra space on the screen from what I can tell, but I can imagine that it’d simply life a lot for some people.

If only there were a good, simple version for OS X Mail.