JoelNothman.com

25 March, 2012

Reporting data with datatemplate

Filed under: Open software,Research,Technology by Joel @ 10:30 pm, 25 March 2012.

You have a lot of data, but you only need to show a little here, a little there. The data also might change, and you want to easily update a table in LaTeX or HTML. You could just format it by hand, but I see a lot of copy-paste, regex or shell hacking in your future… and you risk forgetting to update your table to match the changed raw data.

datatemplate intends to be a generic tool for loading, extracting and
formatting the necessary data. It draws on three tools:

  • Django templates
  • An easy Python or command-line interface for loading common data sources (CSV, JSON, etc.) into the template context
  • New template tags for using SQL directly in Django templates, plus the ability to import comma-delimited or tab-delimited data as a SQLite table for random access

Django templates are a fairly powerful way to include custom formatting and template control flow. Datatemplate is not limited to producing tables either, and might also be used to generate LaTeX \newcommand definitions from data.

See also the readme at the datatemplate github page.

Example

Say you have experiments.csv containing:

experiment,variable,precision,recall
e1,A,0.5,0.4
e1,B,0.6,0.7
e2,A,0.5,0.6
e2,B,0.4,0.6
e3,A,0.5,0.3
e3,B,0.8,0.7
e4,A,0.2,0.4
e4,B,0.5,0.6
e5,A,0.6,0.6
e5,B,0.4,0.5

You want to output a LaTeX table of F1 scores, with rows for experiments e1, e3 and e5, and columns for variables A and B.

Write experiments.tpl:

\begin{tabular}{l|*{ {{columns|length}} }{r}}
\hline
Experiment {% for variable in columns %}& {{variable}}{% endfor %} \\
\hline
\hline
{% for row_label, experiment in rows %}
  {{row_label|texescape}}
  {% for variable in columns %}{% select 2 * precision * recall / (precision + recall) AS f1 FROM data WHERE experiment = "{{experiment}}" AND variable = "{{variable}}" %}
    & {{f1|floatformat}}
  {% endselect %}{% endfor %} \\
{% endfor %}
\hline
\end{tabular}

This depends on:

  • an SQLite database with a table called data loaded with the CSV content
  • rows, a sequence of (row label, selected experiment value) tuples
  • cols, a sequence of selected variable values

The {% select ... %} statement selects a single row from the database and places its results in context. {% forselect ... %}, which iterates over SQL query results, could have also been used. In loading the CSV data, columns are recognised as FLOAT or INT as necessary, so you can perform numerical select queries, or use Django filters such as floatformat.

We may then execute:

datatemplate --csvsql data=experiments.csv \
  --json-var rows='[["Baseline", "e1"], ["Improved", "e3"], ["Best", "e5"]]' \
  --json-var columns='["A", "B"]' \
  < experiments.tpl

This generates the LaTeX code for the following table:

Sample output table

[Apologies, the data is nonsense.]

Related tools

23 August, 2009

Facebook frustrations

Filed under: General,Technology by Joel @ 12:41 pm, 23 August 2009.

Many things have annoyed me about Facebook lately. I ranted at their representative at ACL the other week, but his job was natural language processing, not bug-fixing.

Things have become especially frustrating when dealing with two of their most under-baked utilities: Pages (whataretheyanyway?!) and Events (beentheresinceforeverandstilldon’twork) in order for my Page, Barefoot, to advertise its concert series.

Pages have confused people since their inception. They might be better titled Organisations. They are a bit like groups, but they have fans instead of members, and I think Facebook didn’t want them created quite so freely as groups are (were?). They’re also a bit like personal Profiles in that they have a wall and apps and everything, and they can be fans of other pages. And they’re publicly viewable (and crawlable) on the web, so they act as web sites, and can help bring in FB’s bread. Basically, they’re what groups should have been, but never were.

Events are very popular, but they’re impossible for something like a concert series. You can only state one start and end time, so people get confused by our concert being over 6 days long, or say they can’t come because they’re unavailable on the first evening.

In the intersection between Pages and Events is an abyss of madness. One of the neatest features Facebook ever added to Events was the ability to message groups of people, depending on whether they were attending, not replied, etc. But if you create the event through a Page, you can’t do that:

if an event is hosted by a Page, the Page admin will not see the option to send a message to event guests. Individuals may be added as event admins in order to have this option. (Facebook FAQ)

Oh, I just need to make myself an event admin! But I can’t because Page admins can’t be added as event admins if the Page hosts the events.

Yet, if I remove another Barefoot member from being a Page admin, I can add them as an Event admin. And indeed, then I can add them back as a Page admin.

So somehow, very dodgily, now a person is both a Page and Event admin and can (yay!) send messages to event guests.

Another frustration: after finally finding this hole in Facebook’s foolishness, I could no longer send to people who’ve replied Not Attending. I can understand that those not attending are probably not interested in hassle messages. But when people reply Not Attending because Facebook makes it seem like we only have one concert, not five, I’d like a way to send them a clarification… :(

End rant.

10 May, 2009

Finally, a zemirot wiki

Filed under: Chazanut,Music,Siddur,Technology by Joel @ 4:07 pm, 10 May 2009.

Of sorts. One project I no longer need to do because someone else has. I don’t know how long zemirotdatabase.org has been around, but I’ve long intended to create a site where people can share Jewish tunes with each other. And break down a monopoly of tunes from the Virtual Cantor, who is being over-used now that taped chazanut is no longer as popular.

Of course (in my way of doing things), my idea was somewhat more ambitious. Which is why it never got done. I’d like to see:

  • More annotation of the origin of lyrics and tunes
  • Links between tunes which are applied to different prayers

Essentially this means that the tune and the words are separated, and each of them could be annotated with Hebrew, transcription, translation, authorship/variant notes… and somewhere in the intersection people would upload recordings. Maybe I can ask Mendy and Gabe to work on it. Or mabye it was just too much to ever make a site out of and they’ve got it right.

Either way, I’ll need to find some time to record some tunes. (Because most of their voices are terrible…)

9 November, 2008

Giving birth and being reborn

Filed under: Computational linguistics,Wikipedia by Joel @ 4:45 pm, 9 November 2008.

After 6 years as an undergraduate student, I have finally handed in my honours thesis:

Words 24,000+
Pieces of paper 62
Thesis pages 82
Front matter pages 9
Back matter pages 24
Chapters 8
Sections 33
Appendices 3
References 116
Footnotes 56
Tables 47 (or 67)
Figures 16 (or 22)
Project time in months 8
Days since starting to write 110

I pity my markers.

And here it is, in case anyone cares: Learning Named Entity Recognition from Wikipedia.

And now, I am reborn. What to do with myself? So much to do with myself. But at least I have time to work it out… =)

28 September, 2008

Hebrew-English online translation

Filed under: Hebrew,Technology by Joel @ 12:22 pm, 28 September 2008.

It seems Google Translate has finally added Hebrew to its canon of transled languages (along with another 35). It seems they don’t have translation from web search enabled yet, but you can play with it (translate Dutch to Hebrew for instance) at Google Translate.

I borrow the example text used in one reporting blog:

משטרת גרמניה עצרה שני צעירים בחשד שהתכוונו לבצע פיגוע במטוס של חברת התעופה ההולנדית קיי-אל-אם. כוחות משטרת גרמניה פשטו על המטוס שחנה בשדה התעופה בקלן, זמן קצר לפני שהמריא בחזרה להולנד והוציאו ממנו את שני הצעירים, אזרח גרמני יליד סומליה בן 24 ואזרח סומליה בן 23.

Google Translate says:

German police arrested two youths suspected Shaatcwano an attack on the plane of Dutch airline Kay – to – if. German police forces raided the plane parked at the airport Cologne, shortly before Smria Leclnde back and took him to the two young men, a German citizen born in Somalia 24 Uezarh Somalia age 23.

There are a number of interesting things here:

Assuming something is a proper name if it can’t otherwise be understood is quite a normal approach. But it’s unusual that Google has particular trouble with “שהתכוונו”, “שהמריא” and “ואזרח”, which I don’t consider particularly uncommon words. These, and the messed up “להולנד” all have the common feature of attached prefixes (proclitics), and Google gets it right for all but “המריא” when these are removed. Obviously their word segmentation systems could be improved, or could be adjusted so that if the end system resorts to considering it a proper noun, it might go back and check whether there were some proclitics it failed to lop off. In practice, implementing such a feedback loop may not be worthwhile if the system wants to be fast.

Go take a look at the proper names it forms. It puts some funny letters in there, transliterating:

  • ה ([h]) as nothing (which a lot of Israelis do, but I’m guessing that the system is being hugely biased by the silent הs at the ends of many female names);
  • ו ([v]) as “w”, maybe because “w” always translates to Hebrew in names as ו, but it makes Google look very academic (or Iraqi/Yemenite) to transliterate the vavs in words as waws.
  • כ ([k]) becomes “c”, but so does some non-existant letter in להולנד! What’s going on there?
  • ח (usu. [x]) becomes “h” (rather than “ch” or “kh”), but I guess it is only ever found when transliterating Arabic names, and Ahmed is more common than Achmed.
  • The vowels are also interesting. Especially the spurious “e” on the end of להולנד, but it’s already clear that it’s done a strange job on that one.

Kay – to – if (KLM) is obviously entertaining, but there’s not really much to say about it (except that apparently they split tokens on hyphens).

The most interesting phrase translation is “and took him to the two young men” from “והוציאו ממנו את שני הצעירים”. It would appear as if they took the ו on the end of והוציאו as referring to the object (והוציאוֹ) rather than the subject (והוציאוּ), but seeing as the former is quite rare in contemporary written Hebrew, this may mean they have a wide variety of texts from various ages. And then ממנו seems to disappear altogether. So maybe I’ve just misinterpreted how the system makes a mistake. At the end of the day, the system is all numbers, so no one can really be certain how it made the mistake…

One of the few other online Hebrew-English translation services is Reverso:

A police of Germany stopped two young on suspicion that meant to execute an attack in the airplane of the Dutch airline KAY but them. Forces a police of Germany spreaded on the airplane that parked in the airfield Bkln, a short time before took Off back/in return to Holland and withdrew from him you two the young, German born citizen Somalia ben 24 and citizen Somalia ben23.

Comparing to this translation, we see that Reverso generally does a better job of splitting off proclitics and so makes less apparent mistakes. But its grammar is certainly much poorer, both in English and in Hebrew, thinking for instance that “צעירים” should be understood as an adjective rather than a noun; and that one makes an attack in a plane rather than on it; or that the singular משטרת should be translated “a police”; or that “את” is better translated “you” than as a direct-object marker. Compare also Google’s handling of the compound noun phrase “כוחות משטרת גרמניה” as “German police forces” rather than “Forces a police of Germany”. Also interesting is Reverso’s offering of a choice for בחזרה as “back/in return”.

Overall, while reverso handles word segmentation somewhat better, Google has a much more fluid grammar and chooses more appropriate words in translation.

I haven’t tried translating the other direction (English to Hebrew) yet, or any other combination of languages where I would be under-qualified. I leave that as an exercise to the reader.

And no, they don’t do Yiddish yet. Real Soon Now.

Yes, it’s been a long time. Yes, I won’t be talking much here till November. Shana tova anyway! Enjoy translating your New Year cards from strange Israeli rellies…

22 June, 2008

Wikipedia categories ≠ ontology

Filed under: Wikipedia by Joel @ 2:07 pm, 22 June 2008.

I think I’m probably stating the obvious here. If we take a single trace of an article such as Tom Cruise through the category hierarchy in Wikipedia, we find out that he is merely a theory…

Tom Cruise1962 births1960s births20th century birthsBirths by yearPeopleHumansApesPrimatesMammalsVertebratesChordatesAnimalsEukaryotesOrganismsLifeCore issues in ethicsEthicsBranches of philosophyPhilosophyBeliefSpiritualityHuman behaviourBehaviourBranches of psychologyPsychologyInterdisciplinary fieldsAcademic disciplinesAcademiaEducationPersonal developmentPersonal lifeSelfMetaphysicsRealityPhilosophical conceptsPhilosophical terminologyTerminologyVocabularyLanguageCommunicationSocial psychologySocial philosophyPhilosophical movementsMovementsIdeologiesEpistemologyPhilosophy of scienceAnalytic philosophy20th century philosophy20th century2nd milleniumMilleniaYearsChronologyMeasurementScientific observationData collectionData managementComputer dataComputer storageComputer memoryDigital mediaDigital technologyElectronicsElectromagnetismSpecial relativityRelativityTheoretical physicsTheories → …

And yes, this isn’t completely irrelevant. It relates to my honours research work. It means that the Wikipedia category hierarchy is only useful as a folksonomy, or perhaps only for a very small hierarchical depth beneath each article…

29 October, 2007

Regular expressions for Mishnaic tractates

Filed under: Hebrew,Judaism,Technology by Joel @ 4:37 pm, 29 October 2007.

Various transliteration conventions (or a lack thereof) and dialectal differences make it very difficult at times to gather all possible variations for transcribing Hebrew words into English characters. This can make using search engines to find Hebrew terms in English sources very difficult, or could make it hard for a piece of software to identify what someone is referring to when they enter a string of text. For example, biblical book names each have a number of ways of being written, and my BibRef solves this by simply storing a list of alternative names and abbreviations.

Another way of identifying an entered string with one of many options is with regular expressions. As such, I have attempted below to devise regular expressions to match all expected spellings for each tractate (masechet, masekhet, maseches, meseches, etc.) of the Mishnah. Please note that this is only a draft: I expect to improve the regular expressions, and feedback is much appreciated.

Using this as a background study, it may be possible to automate the building of regular expressions for Hebrew words (with vowels given), although many of the expressions below also cover a number of irregularities that would be hard to incorporate into such a builder. Consequently, one could also build a list of all possible alternative spellings for a word, which could then be used with a search engine to make searches of these Hebrew words comprehensive. (Edit: the current expressions below overgenerate way too much and would probably be inappropriate for that task.)
(more…)

6 October, 2007

Apps for Facebook groups

Filed under: Online communities by Joel @ 9:32 pm, 6 October 2007.

Facebook’s applications platform/API has made it a much more versatile world of activity. Many, or most, are basically useless, but the idea of third-party extensibility in general has allowed Facebook’s uses to multiply (and has given developers an easy development and deployment framework).

But Facebook groups (or other features) could do with the same versatility being available. Applications could make groups a powerful framework for tasks like:

  • charting fundraising by or for the group
  • publishing regular event times
  • better-than-forum planning and discussion tools
  • polls, voting or surveys
  • rostering
  • game tournaments
  • friend wheels to show how group members are connected
  • Countdowns, countups and counters (e.g. how many of my yeargroup have got married)
  • hundreds of thousands of other things only other Facebook users could come up with.

There’s a good chance Zuckerberg and his team have thought of this already, but the privacy arrangements would have to be quite complicated: at the moment individual users consent to individual applications having access to their personal information. Just because a user is a member of a group with an app, that doesn’t mean they consent to the app knowing about them. Will users have to consent to a group’s apps when they join it, or each time the group admin adds another app? That’s potentially a lot of bureaucracy.

Basically, this could get messy. But the future tells of bright and endless possibilities.

6 September, 2007

Opera alpha very exciting

Filed under: Opera by Joel @ 12:43 am, 6 September 2007.

Opera - The Fastest Browser on Earth I had to write about it some time, and it’s just too hard to avoid now. The Opera web browser, which I have been using dedicatedly since 2001, on Tuesday released an alpha version of their upcoming version 9.50, codenamed Kestrel. I’m very excited.

(more…)

23 July, 2007

My wheel of friends

Filed under: Online communities by Joel @ 2:04 am, 23 July 2007.

Friends wheel thumbnailAfter hesitantly accepting a few and later removing them, I’ve generally avoided the craze of Facebook applications. While I could imagine great potential for them, without more centralisation, most are highly redundant and plain annoying.

Nonetheless, I have for a long time wanted to know the relationships between my friends on Facebook. I.e., who of my friends know each other? More so, who is a common friend of a lot of my friends, but is not listed as my own? (more…)

Next Page »

Powered by WordPress