Daniel Harrison's Personal Blog

Personal blog for daniel harrison

Brisbane YOW Talk October November 3, 2011

Filed under: business,development,Uncategorized — danielharrison @ 3:17 am

I went to the latest Brisbane YOW talk in October which had a focus on cloud computing and analytics.

First was Readify and MYOB to .Net Cloud. ¬†This was interesting as I haven’t come accross the .net cloud in practice due to it being late on the scene comparatively and most projects that I’ve come accross for distributed computing tending not to be .net. I’ve tended to favour an environment where there’s more control ala ec2 and java\scala\c based solutions even over app engine. It seems like a competent solution and to have adopted common standard practices; sql hosting or a big table like data store, worker and worker queues etc. It’s been a few years since I’ve shipped products in .net but it gave me confidence if I was stuck without any alternative the same patterns and practices I’ve been using could be brought accross pretty easily.

One point of interest was that of single tenant vs multi tenant data hosting. My experience is multitennant is much much harder architecturally, particularly doing things like managing upgrade and concurrent version support. While being the holy grail for the potential efficiencies, it seems to have lost the impetus that it once had as the shining light on the hill. The pattern that I seem to be seeing is that multi-tennancy is losing to virtualised single tenancy stacks. Due to the speed and cost effectiveness of being able to spin up on demand instances, ease of backup and tools like chef and puppet that make provisioning much easier, it seems like a pattern of single tenancy is becomming the default. My theory is that it’s become _so_ cheap to run virtualised stacks in a public cloud provider, that the cost of architecting and development of multtennant solutions isn’t cost efficient for most classes of problems.

One thing I don’t think we got a really clear answer on was legal implications on the cloud and offshore hosting. From my understanding even if encrypted, for various requirements like PCI-DSS and others it makes it almost impossible to use an offshore cloud for data persistence. In Aus this rules out most public cloud providers but I strongly suspect most companies are rolling this out at a dev level and not really concerning themselves with the legal implications. I was really hoping that we get an amazon cloud here but it seems like singapore will be the local hub for the generic cloud providers. Just given the sheer size of the govt market I can see a few providers lured onshore but with expensive fibre until the NBN really gets cranking it doesn’t seem like it would be very cost effective for them.

Dave Thomas was the presentation I was looking forward to most. It was focussed on end user computing for what he terms thinkers (analysts, data scientests, economists, etc). This is a topic dear to my heart with my original degree being in economics and for a new project I’m looking at kicking off, I’ll be working with some exceptional analysts that we’ll need to empower. I’ve been thinking alot about how to harvest and collect data and with some kind of cooperative process, build a toolchain for experimental and then deployable models. This is an area that is awash with hype and money at the moment due to the promise it can deliver. It really feels like the early days of Hari Seldon . The main takeaway I had was that empowering these users means that to be effective, the tools we will write as engineers cross the boundaries from high performance computing to language design and most importantly, usability with a view of the analyst at the centre. These are all individually hard problems to solve as it is, and we’re in very early days. It explains why companies such as palantir et al are growing so fast and getting alot of serious attention and money. If you get a good solution I think it’s very easy to see that it will revolutionalise business data processing as did the database before it.

The tool he demoed would have been particularly useful as a generic data anaylsis tool and seemed to me a general purpose tool to start understanding the data , visualising it etc with view for determining a specific answer. It was a very brief glimpse but seemed oriented on solving those segmentation queries, eg tracking down a subset of a larger population given various tracers, patterns etc. It seemed pretty effective and gave analysts that ability to mine large amounts of data and segment down to some subset of interest in what seemed close to realtime. Part of what I see as an excellent data modeling habit is to get down and play with the dirty dirty dirty data. You need to understand its characteristics and this tool would fit the bill. It’s wierd when you think about it in a way; These very expensive tools are processing peta and terabytes of data to produce formats where an analyst can apply their superior pattern recognition ability to it to solve the problem and draw often non-intuitive deductions. It’s all about getting it to a format our highly fallable brains can work on. Both this tool, and from what I’ve seen of the new trends of tools such as palantir, mean you can process massive amounts of data to identify and visualise data to segment interesting items that only years ago was simply too slow to be able to respond to in any meaningful way. You can do lots of experiments and visualize the data and then go on and discover more interesting trends and pointers etc in realtime, so I really see these tools changing the face of the analytics profession. In uni we would run through data and get some dodgy little black and white line graph that was next to unintelligible and would have to kill -9 if you were overly ambitious in your data usage, it’s changed so much in a decade. With this ability to record everything everywhere and now analyze it quickly and get initial results in near real time it means businesses and govt can be much more responsive to dealing with everything to planning and breaking emergencies. While I think this is a boon for social research and faster and improved responsiveness for governments I strongly suspect it’s really going to be most used in finance and getting us to buy more, faster ūüėČ

It did get me thinking though and¬†spurred¬†a few conversations with a few colleagues doing big big data analytics. ¬† From my experience in economic modeling and some peripheral fraud detection getting an answer is the /start/ of the job; the next step is to build a tunable¬†predictive¬†model and hook it up to some actions. My feeling is that /most/ of the time you are trying to build a model that then learns (in a constrained manner) and reacts on it’s own. It will be customized and monitored by less analytical staff by tweaking parameters based on current trends and observations. It’s obviously the first half in this model to idenfify the trend, but you need to do something with it and I wonder where tools will take us. I guess in engineering parlance instead of returning a value I’m returning a function that changes based on the inputs. How do we develop tools that allow the building of dynamic models we can use as filters, event drivers, adapters in our systems we ship today. Things that are not static but given core parameters and a stream of information to eat, adjust within a predictable manner. Will we ever have a scenario where we have an analyst that will model, analyse data and output a compiled artifact we slot into our systems as a core observor and action initiating blob. It seems to me like we’re heading to some kindof model which is part rule system, part integration code and part tunable analysis system. My previous role at Oracle was leading development for a high performance rule modelling system for policy experts. I think coupled with a dynamic and probabilistic model it would be capable to put something together that would operate this way and operate over large, real time data sets and streams.

Overall the YOW nights are excellent and this was no exception. It’s great that these high quality speakers are comming to Oz now and I’m really looking forward to the conference in dec.

 

Things to watch out for in HTML5 IndexedDB as at 21 June 2011 June 21, 2011

Filed under: development,internet,javascript,web — danielharrison @ 6:59 am

I’m between contracts at the moment so taking the opportunity to play with some bleeding edge technology.¬† With it seeming like everyone’s jumping on the HTML5 bandwagon, even microsoft with windows 8, seemed like a good opportunity to restart my side project playing with the latest web tech.

So there’s a few things to note if you pick up indexedDB..¬† It’s bleeding edge and to be expected but here’s my experiences over the last week.

IndexedDB is in webkit (chrome) and firefox but not yet in safari.¬† The database visualisation in the webkit developer tools isn’t linked in yet so you can’t mange the database that way yet.¬† You can’t delete a database programmatically yet in either chrome or firefox.¬† If you’re writing unit tests this is going to be a bit of a pain ;).¬† Also you can’t yet access the indexedDB from webworkers.¬† At this stage it’s attached to the window.¬† One of the things I’m playing with is a stemming and text sorting index which was all running via webworkers.¬† It’s an easy workaround, you just take the results from the webworkers and at a convenient time, merge and store instead of doing it directly.¬† Still, will be cool when this works.

The other thing I’ve noticed is that it feels very different than other data stores, even other kvp such as cassandra etc.¬†¬† It really is a javascript data store.¬†¬† The feeling I get is the asynchronous model is the preferred interaction method which again feels different that other API’s.¬† I’m still getting the feel, but it feels right for client side javascript.¬† In my opinion if I had to choose between the sqllite model and this, I’d choose this as a better technology direction for browser based client structured storage.¬† Sqllite would have just recreated the sql feeling of datastores and I don’t think it would have felt quite right for javascript in the long term.

I’m sure these will be addressed pretty shortly, I’m running chrome alpha and dev channels, ffox 5 and will post back when I notice a change.

 

Services and Contracts May 19, 2011

Filed under: development,web — danielharrison @ 10:10 am

I’ve been playing a bit this week with WADL.¬† The service I’m playing with is a JSON REST service that’s representing the service contract in WADL.¬† It’s got me thinking about the role for service descriptions in a post WSDL world.¬† Fundamentally, if I release a service for the world must there be a method of specifying A) the endpoints exposed by a service and B) the data format that it accepts in a published standard.¬† So really the question is; If I release a service what’s the best way of helping people to write a client that will interact with it?¬† Or maybe; If the integrators are knife wielding coding maniac’s, what should I do?

First my thoughts on WADL.¬† WADL works well with XML based services.¬† The data can be typed with XSD and in combination with param path(an XPATH to the node of interest), has sufficient information to generate an implementation that interacts with that service.¬† It breaks down when trying to represent JSON (or alternate formats), path being undefined and as JSON doesn’t have an official schema definition, there’s no way of specifying the contract of complex JSON types or the path in the payload.¬† There’s almost schema’s such as JSONSchema of course and a number of notable others.¬† So it’s possible with a hodgepodge of almost standards to fully specify a JSON REST service that could be theoretically used for code generation.¬† The major impedance being tool support. ¬† At the current moment in time with a JSON service you end up with multiple points of truth, the WADL and then the documentation around the JSON payload; what the parameters mean and their business logic +/- sample integrations.

So if WADL isn’t by itself sufficient for JSON, the question is; how do you hand over a JSON service to an integration team and get them to use it effectively?¬† At the moment this answer seems to be; here’s REST learn it, here’s JSON it’s simple, here’s our documentation about the data we expect.¬† It’s easy as a experienced developer to expect that these technologies are mastered and it’s a simple 1/2 day task to get it up and running.¬† However having shipped and supported a product that external developers have had to write an integration to, the lesson I’ve learned, is that it’s never simple enough!¬† Other developers will not bother to understand the technologies, will not read your documentation and will consider it your problem. ¬† JSON and REST are fundamentally simple building blocks …¬† once you’ve mastered a number of other technologies and building blocks.¬† My experience is people writing integrations (mainly in the enterprise space) against your API are time pressed and often the most inexperienced developers.¬† So how do you cater for them?

The main benefit of WSDL in my experience is code generation.¬† Integration developers don’t need to understand SOAP, it’s point at the WSDL (WS-I compliant of course) and boom, get a client with a mostly understandable business object model.¬† Put your app on top, populate with the data and let the generated code take care of the rest.¬†¬† Does this need to exist for JSON REST services?¬† The immediate answer is YES, of course! but it runs bit deeper.¬† The downsides of the WSDL¬† approach is that it’s a lot of magic.¬† When things work, it works well.¬†¬† As soon as a problem crops up it can be an almost impossible task to understand what and where it’s not working.¬† By not having to fundamentally understand the technology stack and relying on the generated magic the masked complexity becomes an insurmountable problem, complexity always escapes. ¬† The WSDL stack is actually quite deep and complex, the solution to the complexity being code generation, wizards and, well, magic.¬† JSON REST in my view is a fundamental shift in the solution of the problem.¬† Not let’s specify this more completely, to ensure better interoperability with another standard and ensure that we can generate software where the complexity is masked; but a re-orientation where with a bit of basic knowledge the problem is sufficiently simple not to require that additional complexity.¬†¬† If JSON REST services get sufficiently complex in order to require the overhead of complex integration specifications and code generation then to some extent they’ve failed, the technology stack has failed to be sufficiently simple!

So here’s my conclusions.¬† It’s like most complex problems in software development, a human problem.¬† I think the choice to use JSON is a choice about the users that you want and expect to use your service.¬† WSDL may be a more appropriate solution, particularly in the enterprise space.¬† If shipping a JSON REST service; ship documentation, an example stub program in all of the languages you want to officially support and JSON samples (drop the WADL).¬† In the best of both worlds of a product you ship BOTH and allow users to self select.¬† Most of the core concepts are identical and with a little clever architecture in your products service it’s pretty easy to do.

 

Options and Tradeoffs for Rich Text Editing in HTML5ish Technologies March 11, 2011

Filed under: development,internet — danielharrison @ 9:22 pm

There’s a number of options for adding rich text editing to your website, all have a number of tradeoffs that will be guided around the amount of control you need.

Content Editable

Content editable is the default solution for text editing on the web. ¬†Originating from Microsoft’s pioneering work in 4.0 browsers all browsers now support the basic API. ¬†It’s the technology behind most rich editors, tinyMCE, YUI editor, CKEditor. ¬†The problem though is that the technology is quite old in internet time and the API doesn’t smell quite right in 2010.¬† The API isn’t one that will feel familiar to developers familiar with javascript, jQuery etc and dom manipulation. ¬†It lives at a higher abstraction via the document.execCommand. ¬†If you apply the bold command to a set of text it doesn’t return a selection, the new element or set of elements and doesn’t really care about the DOM at that level. ¬†If you do want to take a DOM centric approach you’ll need to attach listeners for node operations etc and get a bit clever about understanding what changed.¬† Most frameworks mean you don’t really need to care and abstract it away sufficiently that it’s easy to have a competent, performant solution ready in a couple of hours.¬† The contentEditable technology does address some of the complexity that can arise in complex formatting that if you took a ownership position you’d have to solve.¬† For example applying bold or converting to a list works on nested content and gets it right enough. ¬†It doesn’t produce what would be considered the cleanest html, eg every paragraph is <p><br><p> (<div><br><div> in webkit based browsers). ¬†It’s the good enough solution and if you’re happy enough to make it a desktop browser based experience and want a quick solution, this solution is the easiest. ¬† You also get things like spell checking for free (most browsers now support this by default). ¬†One extension to contentEditable is to use the selection API.¬† This tool has facilities to surround content, insert elements at the start of selection etc and manipulate HTML based on user input.¬† In some ways the selection API is easier to use as it has a DOM based view of the world which makes it much easier to integrate it with bleeding edge technologies like html5 history.

I’ve been keenly monitoring the ADC for news of when content editable will be supported on the ipad with mobile safari but it doesn’t seem like this is a near term priority.¬† It’s still not supported in the latest 4.3 iOS release.¬†¬† So contentEditable is ruled out if you’re targeting the iPad; other tablets I’m not so sure of.¬† To some extent this is not surprising as getting the experience right for tablet devices is going to take some thinking given the experience certainly wasn’t envisiged with tablets in mind.

Bind to a an element, monitor keystrokes, insert into DOM.

The you bought it you own it solution.¬† The advantage over contentEditable is you can make it work on iPad and other devices that don’t support content editable. ¬†I believe this is the solution that google now uses in it’s docs experience. ¬†If the text editing is a core competency you need to own and if you’re developing a custom solution then this is a feasible option. ¬†It’s alot of work but owning everything gives you great power and it uses standard DOM operations so is well supported by the browsers you’ll care about.¬† If you’ve got an product where you’re using OT or causal trees to synchronise changes in a collaborative environment, this works well as likely you already have that information to send to the server to synchronise user edits anyway.

Canvas

Canvas is the newest technology you can implement text editing with. ¬†This is another solution where you need to own the whole stack, monitor keystrokes and insert glyphs. ¬†Canvas is fast; very fast, which makes doing things like displaying graphics a very fluid experience in modern browsers. ¬†It has a pixel coordinate system which gives you fine grained control over everything, even more so than any html generating example. ¬†My early prototypes did raise a blocker that ruled it out for me though. ¬†The canvas API uses methods like fillText to write text and measureText to determine the space it’s going to take.¬† One of the core features of a text editor is that it requires overlay of a cursor to indicate position of active editing. ¬†The problem is measureText only works reliably on fixed width (monospace) fonts. ¬†This is why it works in programming environments like Bespin/SkyWriter which uses code oriented monospaced fonts. ¬†The measureText gives you the width in pixels. ¬†When using a proportional font this width will not be consistent due to aliasing and the proportional algorithms that make it look pretty on your screen. ¬†For example with the term ‘cat’. ¬†Measuring ‘cat’ will give you the width of the whole word.¬† If you want to shift the cursor to between the a and the t you’ll need to know how much space ‘ca’ takes of the whole word.¬† Due to the calculation (particularly if you start worrying about bold and italics) the measureText of ‘ca’ will include a few extra pixels to account for the fact that a is now the end letter of a word.¬† So for measureText it’s the total space to print out ‘ca’ as a word including all styles applied to the font and padding at the end letter. ¬†If you wanted to overlay a cursor next to the ‘a’ in ‘cat’ using measureText to calculate where the a ended, then by default you’d end up with the cursor sitting in the ‘t’ somewhere. ¬†Obviously being off a few pixels matters in the UI. ¬†As the calculation of proportional fonts is quite complex and goes into low level technology, in order to determine a feasible cursor position more information is needed than is currently available. ¬†In proportional fonts particularly when dealing with italics, letters technically overlap, eg. /la/ the l actually pushes into the top space over the a depending on the font, so where should the cursor go?¬† At the end of ¬†the l or at the beginning of the a (beginning of the a, on top of some of the l).¬† The obvious solution would be to add this information to the API so that it can record where letters start and end and their general dimensions. ¬† That said given the non accessibility of canvas and the fact it’s not meant to be a text editing environment, there’s good reasons why the API designers probably don’t want to facilitate this madness. ¬† There are hacks of course to figure this out.¬† I played with writing the text to a white background, getting the written text as an image and then using pixel sampling to determine where the letter really started, yuck!¬† It’s a lot of work and when you care more about the input over absolute control for display, contentEditable or rolling your own direct dom manipulation solutions are the quickest and easiest path.

 

Playing with Cassandra Again September 30, 2010

Filed under: cassandra,development,internet,Uncategorized — danielharrison @ 12:59 am

I’ve been recently playing with the latest version of Cassandra again.¬†¬† Some new things going in the direction I like is that it seems to be growing into a more enterprise keystore model rather than something that is solving a specific high volume websites requirements only.¬† To me it felt like there’d been a lot of work in beefing up the management and making it solve a more generic problem.¬† The programmatic adhoc schema editing was a good improvement and based on the direction, 1.0 is shaping up to be really good.

My previous access code was using the thrift API directly.¬† For this prototype I tried out a few libraries; Pelops and Hector.¬† Both seemed to still be thrift focused and I’m not sure how this works with the change to AVRO.¬† Thrift always felt clumsy to me.¬† Technologies like thrift and avro, where you’re expressing a language independent communication protocol that various languages need talk in, in my view can’t help bleeding those idioms and generality up to the client.¬† It means client access code often feels, well slightly awkward.¬† It feels a bit like the good old days with IIOP/CORBA and EJB communication.¬† My personal preference is targeted hand coded adapters which feel like a good fit for the language, but the downside of course is that the clients can lag and not always be available for your choice of language.¬† So it’s a tradeoff as always.¬† Hector seems like it’s actively trying to avoid this but still has wrappers where if feels a bit thrifty, eg HKSDef instead of KSDef used to create a keystore.¬† If you are trying out and evaluating these libraries I would highly recommend you bite the bullet and just get the cassandra source for your targeted library and build it yourself.¬† Due to the fast moving nature it looks like the current releases are out of date and to get it working you really need the latest trunk versions of everything.¬† For example I don’t think beta2 of 0.7 cassandra is available as a package but it seems to be required with the current version of Pelops and Hector, Pelops is source only on github, so you’ll likely be building things yourself anyway.¬† I was impressed by both and it feels like there’s alot of room for future improvement and both seem to be shaping up as strong client access libraries.

Another good thing is that it seems like there’s some valuable resources coming through.¬† At the moment it’s a lot of google and reading the forums to nut out problems.¬† I bought the ‘Cassandra, the definitive guide’ rough cuts book from Oreilly and it seems like it’s taken a lot of the information, focused it and made it a good source for explanation of idioms and general wisdom.¬† So my recommendation would be to buy as it seems like it’s going to be an invaluable reference.

My biggest problem for using cassandra at the moment is support for multitenancy.¬† For the problem I have in mind it requires text indexing and content that is private per account.¬† With a model like cassandra you need to know what you will be searching for first and basically you build column families representing those indexes.¬† Now in my case I have users, accounts (many users), objects (storing text) and various indices around that text that drive my application.¬† Think a little bit like an RDF store with accounts and users.¬† Now in a traditional database model I would probably store this as a separate database for each account.¬† This may mean each running datastore instance has 10’s to 1000’s of databases.¬† With cassandra and the way this is structured this would not be advisable.¬† Each keystore maintains memory etc and to take advantage of it’s model of replication etc it’s more advisable to have less keyspaces.¬† Now one of the easy wins in the database server world of having separate databases per account is you’re guaranteed to not see other accounts data, you’re connecting to the datastore for that client which makes it very easy to guarantee and maintain security.¬† Under cassandra this makes it an application concern at the moment.¬† For my prototype I wasn’t happy with the extent that this was invading my code and required extra indices to make it all work, all of which increased the cognitive load of developing the application.¬† There’s work afoot around multi-tennancy requirements, but until that’s addressed, for me at least, it rules cassandra out.¬† The cassandra team are working on it and there’s some interesting proposals (the namespace one seems interesting) and I’m sure once it’s complete it will really make cassandra the first choice for an enterprise keystore.

 

Congratulations to Bitbucket

Filed under: business,development,mercurial,startups — danielharrison @ 12:01 am

I saw that bitbucket has been acquired by aussie company Atlassian.¬† I was a pro user as I had a few private repositories (hg didn’t originally support sub repositories).¬† I was always impressed by the customer service at bitbucket and from my dealings I got the impression they were good guys who had the customers interests first.¬† I changed credit cards and paypal subscriptions stopped working for me and rather than make a big deal out of it, Jesper basically stopped charging me money.¬† I got it working again eventually, but it’s that kind of attitude that convinced me that they had my interests as a customer first and that I’d made a good choice over competitors or doing it myself.¬† I know this experience means I recommended them and as a early stage startup it’s an experience that I’ll remember when I’ve (hopefully) got paying customers ūüėČ

So I saw my billing had been cancelled and now it looks like with my current usage I won’t have to pay anything.¬† It also looks like there’s been a few UX changes around teams etc.¬† I like the strategy of at the same time as announcing it, it’s rebranded and working.¬† I previously introduced Atlassians suite into my former workplace (confluence, bamboo, jira, greenhopper, crowd, … ) running over subversion and it always seemed that not having a SCCM system was a weak point to their competitors; so it seems like this is a good strategic investment.¬† When evaluating tools, the competitors for the most part seemed to be SCCM companies with a layer on top.¬† The reason I chose Atlassian was that integrated layer on top with confluence, bamboo, jira etc meant for an internationally distributed team, it gave us the focal point for development that we needed.¬† It will be interesting to see if this is offered for on-premises installation as Atlassian tools are java based and bitbucket with hg I suspect is python based.¬† I looked at running hg with jython when it first came out but it had a few native modules which would have had to be ported from c to get it running.¬†¬† Maybe python is ok though, my experience is the people who tend to look after and maintain these systems tend to be biased towards a particular model, eg java or .net, python might be ok for unix guys, but for windows I’m not sure.¬† Asking either to play outside their comfort area was playing with fire in terms of support, at least in my previous company that’s why we maintained ‘native’ versions with some neat technologies that were baked in house.

So congratulations bitbucket and I’m looking forward to see where it goes from here.

 

Wave Good Bye August 5, 2010

Filed under: collaboration,development,internet,Uncategorized — danielharrison @ 3:44 am

It looks like google wave’s been sent to the knackers.¬† It was an ambitious product trying to change the technology we use to collaborate.¬† I’m sure we’ll see it come back in various products but as a standalone product it looks like it won’t be around any more.¬† I remember when it first came out, the general consensus at least in the office I was working in, was; neat technology but what problems can it help me solve, is this really that much better than email?¬† There’s been alot of casualties in the groupware space and I guess google wave is another victim in the war on collaboration.¬† The current email communication hegemony seems like ripe pickings for disruption; technology stack from another era, massive implications and cost savings if you can make people more productive etc.

My startup knowtu operates in the enterprise collaboration and communication market which wave kind of did and the lessons I think I see are:

  • Email still rules and will for the foreseeable future.
  • Technology is important but by itself doesn’t solve problems.
  • Good enough wins.

While I think wave had it’s issues, it’s disappointing to see it end particularly as this seemed to have a local Australian connection with a large contingent in Google’s Sydney office. ¬† I always felt over time the tech platform meant neat stuff would be built on top and slowly it would succeed.¬† Alot of the tech is open source so maybe it will come back at some point in the future, I guess we’ll just have to wait and see.