Blog CeON-u

30.08.2013 – 22:09

Turkish kinship vocabulary: Amca, Dayı, Enişte, Hala, Teyze, Yenge

Turkish has three commonly used words for “uncle”: Amca, Dayı, Enişte; and three for “aunt”: Hala, Teyze, Yenge. Here is how to use them.

Turkish words for father's (“Baba”) and mother's (“Anne”) siblings and their spouses.

Father's brother is “Amca”, mother's brother is “Dayı”. An uncle not related by blood is “Enişte”. Father's sister is “Hala”, mother's sister is “Teyze”. An aunt not related by blood is “Yenge”.

In all but formal circumstances, it is common to address considerably older strangers as “Amca” (uncle) or “Teyze” (aunt). However, I was once referred to as “Enişte” — probably due to the fact that I'm a foreigner and therefore I belong to the large family of all Turks by marriage, not by blood.

By Łukasz Bolikowski | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

30.08.2013 – 22:09

Turkish kinship vocabulary: Amca, Dayı, Enişte, Hala, Teyze, Yenge

Turkish has three commonly used words for “uncle”: Amca, Dayı, Enişte; and three for “aunt”: Hala, Teyze, Yenge. Here is how to use them.

Turkish words for father's (“Baba”) and mother's (“Anne”) siblings and their spouses.

Father's brother is “Amca”, mother's brother is “Dayı”. An uncle not related by blood is “Enişte”. Father's sister is “Hala”, mother's sister is “Teyze”. An aunt not related by blood is “Yenge”.

In all but formal circumstances, it is common to address considerably older strangers as “Amca” (uncle) or “Teyze” (aunt). However, I was once referred to as “Enişte” — probably due to the fact that I'm a foreigner and therefore I belong to the large family of all Turks by marriage, not by blood.

By Łukasz Bolikowski | Posted in ceon, Turkish | Tagged ceon | Komentowanie nie jest możliwe

09.08.2013 – 16:54

Tagging CS journals and conferences with arXiv subject areas

I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.

There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.

To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix = arXiv, set = cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs.” Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.

To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).

Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.

Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set it there, so feel encouraged to find better heuristic.

One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes. Oh well, that's another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.

By Łukasz Bolikowski | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

09.08.2013 – 16:54

Tagging CS journals and conferences with arXiv subject areas

I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.

There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.

To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix = arXiv, set = cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs.” Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.

To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).

Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.

Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set it there, so feel encouraged to find better heuristic.

One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes. Oh well, that's another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.

By Łukasz Bolikowski | Posted in ceon, FORCE11 | Tagged ceon | Komentowanie nie jest możliwe

10.05.2013 – 01:39

Package intergraph goes 2.0

Yesterday I submitted a new version (marked 2.0-0) of package ‘intergraph’ to CRAN. There are some major changes and bug fixes. Here is a summary:

The package supports “igraph” objects created with ‘igraph’ version 0.6-0 and newer (vertex indexing starting from 1, not 0) only!
Main functions for converting network data between object classes “igraph” and “network” are now called asIgraph and asNetwork.
There is a generic function asDF that converts network object to a list of two data frames containing (1) edge list with edge attributes and (2) vertex database with vertex attributes
Functions asNetwork and asIgraph allow for creating network objects from data frames (edgelists with edge attributes and vertex databases with vertex attributes).

I have written a short tutorial on using the package. It is available on package home page on R-Forge. Here is the direct link.

Usage experiences and bug reports are more than welcome.

By Michał | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

04.04.2013 – 16:34

Assorted links

Some assorted links collected this week:

A new interestingly looking book “Web Social Science” by Robert Ackland coming out in July 2013.
In recent issue of Nature (Match 28): a special on the future of scientific publishing.
An interesting TEDtalk by Colin Camerer on neuroscience and experimental economics
Nice paper analyzing world email traffic, co-authored by Michael Macy. Another example of using ‘igraph’ package for network analysis.
Gary King and Stuart Shieber on Open Access science and publishing.

There are discussions in various places about merits, pitfalls, and misunderstandings related to buzzwords “bigdata”, “data science” (what a useless term it is…) etc., analysis being “data-driven” or “evidence-based” etc. Perhaps I will make a separate post on that at some point… For now:

“Let the Data Speak for themselves”, a guest post by Joseph Rickert on Revolutions blog
Echoes and comments of Nate Silver’s acclaimed book “The Signal and the Noise”, for example:
- Matt Asay on readwrite (hat tip to Dominik Batorski), and here
- at NYT
David Brooks at NYT
Petr Keil on data-driven science

By Michał | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

13.11.2012 – 14:20

Writing research papers can be a tiny bit easier

Recently I came across two useful web services that make it a bit easier to write research papers: Netspeak and Detexify².

As a non-native English speaker, I often have problems with choosing the right words and I used to ask Google to help me. For example, I would formulate a query "our research * that", look for the most frequent words in the search results, and issue additional queries like "our research indicates that" and "our research shows that" to count hits.
With Netspeak, it is easier, I simply write: our research ? that and I instantly get the most popular phrases with their counts. Netspeak can also find the most popular synonyms of a given word in a given context, or find the most frequent order of given words:

Detexify² solves another small inconvenience: when I didn't remember the LaTeX instruction for a less-common math. symbol, I needed to consult looong lists of symbols and corresponding instructions. Now I can simply draw the symbol and Detexify² will tell me the instruction and the package which I need to use!

Interestingly, the back-end is written in Haskell, and its source code is available on GitHub.

By Łukasz Bolikowski | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

13.11.2012 – 14:20

Writing research papers can be a tiny bit easier

Recently I came across two useful web services that make it a bit easier to write research papers: Netspeak and Detexify².

As a non-native English speaker, I often have problems with choosing the right words and I used to ask Google to help me. For example, I would formulate a query "our research * that", look for the most frequent words in the search results, and issue additional queries like "our research indicates that" and "our research shows that" to count hits.
With Netspeak, it is easier, I simply write: our research ? that and I instantly get the most popular phrases with their counts. Netspeak can also find the most popular synonyms of a given word in a given context, or find the most frequent order of given words:

Detexify² solves another small inconvenience: when I didn't remember the LaTeX instruction for a less-common math. symbol, I needed to consult looong lists of symbols and corresponding instructions. Now I can simply draw the symbol and Detexify² will tell me the instruction and the package which I need to use!

Interestingly, the back-end is written in Haskell, and its source code is available on GitHub.

By Łukasz Bolikowski | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

19.06.2012 – 23:52

Correction to intergraph update

It turned out that I wrote the last post on “intergraph” package too hastily. After some feedback from CRAN maintainers and deliberation I decided to release the updated version of the “intergraph” package under the original name (so no new package “intergraph0″) with version number 1.2. This version relies on legacy “igraph” version 0.5, which is now called “igraph0″. Package “intergraph” 1.2 is now available on CRAN.

Meanwhile, I’m working on new version of “intergraph”, scheduled to be ver. 1.3, which will rely on new version 0.6 of “igraph”.

I am sorry for the mess.

By Michał | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe

18.06.2012 – 13:05

Updates to package ‘intergraph’

On June 17 a new version (0.6) of package ”igraph” was released. This new version abandoned the old way of indexing graph vertices with consecutive numbers starting from 0. The new version now numbers the vertices starting from 1, which is more consistent with the general R convention of indexing vectors, matrices, etc. Because this change is not backward-compatible, there is now a separate package called “igraph0″ which still uses the old 0-convention.

These changes affect the package “intergraph“.

A new version of “intergraph” (ver 1.3) is being developed to be compatible with the new “igraph” 0.6. Until it is ready, there is now package “intergraph” version 1.2 available on CRAN, which still uses the old 0-convention. It relies on legacy version of “igraph” (version 0.5, now called “igraph0″ on CRAN).

To sum up:

If you have code that still uses the old version of “igraph” (earlier than 0.6) you should load package “igraph0″ instead of “igraph”, and use package “intergraph” version 1.2.
If you already started using the new version of “igraph” (0.6 or later), unfortunately you have to wait until a new version of “intergraph” (1.3) is released.

Edit

As I wrote in the next post, in the end there is no package “intergraph0″, just the new version 1.2. Consequently, I have edited the description above.

By Michał | Posted in ceon | Tagged ceon | Komentowanie nie jest możliwe