Author Archives: Łukasz Bolikowski

Turkish kinship vocabulary: Amca, Dayı, Enişte, Hala, Teyze, Yenge

Turkish has three commonly used words for “uncle”: Amca, Dayı, Enişte; and three for “aunt”: Hala, Teyze, Yenge. Here is how to use them.
Turkish words for father’s (“Baba”) and mother’s (“Anne”) siblings and their spouses.

Father’s brother is “Amca”, mother’s brother is “Dayı”. An uncle not related by blood is “Enişte”. Father’s sister is “Hala”, mother’s sister is “Teyze”. An aunt not related by blood is “Yenge”.
In all but formal circumstances, it is common to address considerably older strangers as “Amca” (uncle) or “Teyze” (aunt). However, I was once referred to as “Enişte” — probably due to the fact that I’m a foreigner and therefore I belong to the large family of all Turks by marriage, not by blood.

Turkish kinship vocabulary: Amca, Dayı, Enişte, Hala, Teyze, Yenge

Turkish has three commonly used words for “uncle”: Amca, Dayı, Enişte; and three for “aunt”: Hala, Teyze, Yenge. Here is how to use them.
Turkish words for father’s (“Baba”) and mother’s (“Anne”) siblings and their spouses.

Father’s brother is “Amca”, mother’s brother is “Dayı”. An uncle not related by blood is “Enişte”. Father’s sister is “Hala”, mother’s sister is “Teyze”. An aunt not related by blood is “Yenge”.
In all but formal circumstances, it is common to address considerably older strangers as “Amca” (uncle) or “Teyze” (aunt). However, I was once referred to as “Enişte” — probably due to the fact that I’m a foreigner and therefore I belong to the large family of all Turks by marriage, not by blood.

Tagging CS journals and conferences with arXiv subject areas

I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.
There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.
To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix = arXiv, set = cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs.”  Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.
To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).
Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.
Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set it there, so feel encouraged to find better heuristic.
One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes.  Oh well, that’s another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.

Tagging CS journals and conferences with arXiv subject areas

I have recently launched DBLPfeeds, a simple service providing RSS feeds with the latest papers from over a thousand DBLP-indexed computer science journals and conferences (one RSS feed per journal/conference). I had some positive feedback, which is of course very nice. Paul Groth suggested to group feeds into categories, e.g. journals and conferences related to AI. I liked this idea (a great feature), but I lacked the required data, i.e., the tags (e.g. “Information Processing and Management is about Digital Libraries”, or “ACM Transactions on (Office) Information Systems is about Information Retrieval”). Here is how I solved the problem.

Data at hand

DBLPfeeds are generated using XML dumps of DBLP, which are available under ODC-BY 1.0 license. For each article indexed in DBLP I have its title, authors, link to full text, publication year and publication venue (conference or journal). This is exactly what I need to generate RSS feeds for each venue.
There is another valuable source of information: arXiv. Many computer scientists deposit their preprints there, before the papers appear in journals or conference proceedings. Articles deposited in arXiv are classify using tags such as cs.AI (for Artificial Intelligence), cs.CC (for Computational Complexity), or cs.DL (for Digital Libraries). For a comprehensive description of the tags go here. Everyone has convenient access to metadata of articles deposited in arXiv via OAI-PMH protocol.
To sum up, there are two openly accessible sources of data (DBLP and arXiv), which – combined – contain the information I need.

Merging

I have harvested arXiv using OAI-PMH (metadataPrefix = arXiv, set = cs), which produced approx. 58,000 records. From each record I took the title and the categories starting with “cs.”  Next, I combined that with title and venue fields in approx. 2,100,000 records taken from DBLP.
To match the titles, I lowercased the strings and removed all non-alphanumeric characters (I also removed whitespace characters). Thus for example “Proof-Pattern Recognition in ACL2” became “proofpatternrecognitioninacl2”. I calculated the number of times when a given venue co-occurs with a given arXiv category. The table is available at figshare (CC-0 license).
Finally, I had to select the most representative venues (journals, conferences) for a given tag. I have arbitrarily chosen the following criterion: a given tag will bee assigned to a given venue if at least 30% and at least 5 of the papers at the venue have the tag.

Summary

Firstly, a trivial observation: Open Access is great. By combining two publicly available data sources I was able to add a nice feature to DBLPfeeds.
Now about methodology: I took a quick and dirty approach, which leaves a lot of room for improvement. The papers are joined by looking at titles only (author names are ignored), so one can easily imagine both false positives and false negatives. I used totally arbitrary criteria for assigning tags, but the complete data set it there, so feel encouraged to find better heuristic.
One more obvious shortcoming: if a journal does not permit self-archiving, preprints of its papers will not appear on arXiv and, consequently, it will not be tagged with arXiv subject area codes.  Oh well, that’s another small reason to go green, I guess ;)

Open code, open data

The code is publicly available at GitHub on BSD license (take a look at code/tags.sh) while the table of co-occurrences is publicly available at figshare on CC-0 license.

Writing research papers can be a tiny bit easier

Recently I came across two useful web services that make it a bit easier to write research papers: Netspeak and Detexify².

As a non-native English speaker, I often have problems with choosing the right words and I used to ask Google to help me.  For example, I would formulate a query “our research * that”, look for the most frequent words in the search results, and issue additional queries like “our research indicates that” and “our research shows that” to count hits.
With Netspeak, it is easier, I simply write: our research ? that and I instantly get the most popular phrases with their counts. Netspeak can also find the most popular synonyms of a given word in a given context, or find the most frequent order of given words:

Detexify² solves another small inconvenience: when I didn’t remember the LaTeX instruction for a less-common math. symbol, I needed to consult looong lists of symbols and corresponding instructions.  Now I can simply draw the symbol and Detexify² will tell me the instruction and the package which I need to use!

Interestingly, the back-end is written in Haskell, and its source code is available on GitHub.

Writing research papers can be a tiny bit easier

Recently I came across two useful web services that make it a bit easier to write research papers: Netspeak and Detexify².

As a non-native English speaker, I often have problems with choosing the right words and I used to ask Google to help me.  For example, I would formulate a query “our research * that”, look for the most frequent words in the search results, and issue additional queries like “our research indicates that” and “our research shows that” to count hits.
With Netspeak, it is easier, I simply write: our research ? that and I instantly get the most popular phrases with their counts. Netspeak can also find the most popular synonyms of a given word in a given context, or find the most frequent order of given words:

Detexify² solves another small inconvenience: when I didn’t remember the LaTeX instruction for a less-common math. symbol, I needed to consult looong lists of symbols and corresponding instructions.  Now I can simply draw the symbol and Detexify² will tell me the instruction and the package which I need to use!

Interestingly, the back-end is written in Haskell, and its source code is available on GitHub.

Typoglycemia in Haskell

Can you decipher the following sentences?All hmuan biegns are bron fere and euqal in dgiinty and rgiths. Tehy are ednwoed wtih raeosn and cnocseicne and sohlud act tworads one aonhter in a sipirt of borhtreohod.It’s the first article of the Universal D…

Typoglycemia in Haskell

Can you decipher the following sentences?All hmuan biegns are bron fere and euqal in dgiinty and rgiths. Tehy are ednwoed wtih raeosn and cnocseicne and sohlud act tworads one aonhter in a sipirt of borhtreohod.It’s the first article of the Universal D…

Assorted curiosities: Geography

Fun facts learned while clicking through Wikipedia:Treasure Island in Ontario, Canada is probably the largest island in a lake in an island in a lake.Liechtenstein and Uzbekistan are doubly landlocked countries, i.e., all the neighbouring countries are…

Assorted curiosities: Geography

Fun facts learned while clicking through Wikipedia:Treasure Island in Ontario, Canada is probably the largest island in a lake in an island in a lake.Liechtenstein and Uzbekistan are doubly landlocked countries, i.e., all the neighbouring countries are…