Wednesday, December 12

How to upgrade R version without losing your existing installed packages


R is a language & environment for statistical computing and graphics. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. Currently it is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories.

Here are the steps I performed to reuse the downloaded libraries (saves the paint o reinstall each library) when I upgraded R from 3.5.0 to 3.5.1.

Before any update of R, start the software or Rstudio to know where all packages are installed by typing - .libPaths()

1. Before you upgrade, build a temp file with all of your old packages.



tmp <- installed.packages()
installedpkgs <- as.vector(tmp[is.na(tmp[,"Priority"]), 1])
save(installedpkgs, file="installed_old.rda")
 
2. Install the new version of R ( as of Oct 2018 latest version is R3.5.1)
3. Once you’ve got the new version up and running, reload the saved packages and re-install them from CRAN.




tmp <- installed.packages()
installedpkgs.new <- as.vector(tmp[is.na(tmp[,"Priority"]), 1])
missing <- setdiff(installedpkgs, installedpkgs.new)
install.packages(missing)
update.packages()

Thursday, November 15

Why is Data Science or Datalogy important?

Why is Data Science or Datalogy important?

Have you sat in a meeting discussing various point of views without going forward because the participants are not able to convince the team regarding suggested action because they do not have data to substantiate their argument? The right way for an enterprise to take a decision is to analyze data and take decision based on data points. Data Analytics helps us to analyze data and take informed decisions.  

Data Science is the combination of
  1. Statistics / Mathematics skills
  2. Coding skills
  3. Domain Knowledge / Business Knowledge
Data is about numbers – and when you are working with numbers, you are going to use statistical and mathematical concepts. Coding skills are required because the data you will work with is often hard-to-access, broken, messy, has missing values and so on and code can help solve these issues once and for all. Finally the domain knowledge and business thinking are as essential as statistics and coding. If you don’t have the business knowledge, you won’t be able to evaluate whether your data makes a difference or not.

For a data scientist the languages that can be useful are SQL, Python, Bash & R.
SQL is a simple query language. It’s well structured and easy to interpret.
Python is also easy to interpret and easy to learn as well, but much more complex than SQL. Python is better for certain data tasks and SQL is better for others.

What is the origin of Data Science?
Over the years, data science has become an integral part of many industry like agriculture, marketing optimization, risk management, fraud detection, marketing analytics and public policy among others.  By using data preparation, statistics, predictive modeling and machine learning, data science tries to resolve many issues within individual sectors and the economy at large.

Data science emphasizes the use of general methods without changing its application, irrespective of the domain. This approach is different from traditional statistics tend to focus on providing solutions that are specific to particular sectors or domains.

The traditional methods depend on providing sectors with solutions that tailored to each problem rather than applying the standard solution.

Today, data science has far reaching implications in many fields, both academic and applied research domains like machine translation, speech recognition, digital economy on one hand and fields like healthcare, social science, medical informatics, on the other hand.

It effects the growth and development of brand by providing a lot of intelligence about consumers and campaigns, through techniques like data mining and data analysis.

The history of data science can be traced to 1960. Peter Naur a Danish computer science pioneer and Turing award winner disliked the very term "computer science" and suggested it be called "datalogy" or "data science". In the year 1974, Peter published Concise Survey of Computer Methods, where he used the term data science in its survey of the contemporary data processing methods.

These methods were then used in a number of applications. Almost twenty two years later in 1996, the members of the International Federation of Classification Societies met Kobe for their biennial conference, where the term data science was used for the first time, in the title of conference which was called Data Science, classification and related methods. C.F. Jeff Wu in 1997 gave an inaugural lecture on the topic where he spoke about statistics being a form of data science.

Later in 2001, William S. Cleveland introduced data science as an independent discipline. In his article, Data Science: An Action Plan for Expanding the Technical Areas of Statistics, he incorporated advances in computing with data, which was published in the the International Statistical Review. In his report, William mentions six areas which he thought formed the base of data science: these includes multidisciplinary investigations, models and methods for data, pedagogy, computing with data, theory and tool evaluation.

The International Council for Science: Committee on Data for Science and Technology started the publication of Data Science Journal in 2002. DSJ focuses on topics related to data science like description of data systems, their publication on the internet, application and legal issues. Columbia University also began the publication of the Journal of Data Science which was a platform for data workers to share their opinions and exchange ideas about the use and benefits of data science. A journal that was devoted to the application of statistical methods and qualitative research, this journal was a platform that provided data workers with a voice of their own in the field of data science.

In 2005, the National Science Board published long lived digital Data Collections: Enabling Research and Education in the 21st century.

This article defined data scientist as the information and computer scientists, database and software programmers, disciplinary experts, curators and expert annotators, librarians who are extremely important for the successful management of digital data collection.

The primary activity of Data Scientist is to conduct creative inquiry and analysis so that data can be utilized in a proper and effective manner, by organization across all sectors.

Saturday, November 3

Why did #Gartner advice #GoSlow or #SayNo to blockchain? Happy that Gartner agrees with me about #blockchain

According to the updated definition by #Gartner A blockchain is an expanding list of cryptographically signed, irrevocable transactional records shared by all participants in a network. Each record contains a time stamp and reference links to previous transactions. With this information, anyone with access rights can trace back a transactional event, at any point in its history, belonging to any participant. 
                          For those who are new to blockchain technology, a blockchain is one architectural design of the broader concept of distributed ledgers. While everybody is most people are giving a Thumbs-up to blockchain it is interesting that #Gartner has advised to #GoSlow or rather #SayNo to those intending to implement  #blockchain in near future.
                                                                        Nothing wrong in changing your view about a technology, Gartner researchers have changed their views and gone against majority opinion about about a technology hype in the past. It is interesting that Gartner researcher have enough data & case studies to advice the world to #SayNo to investment in blockchain. I am happy #Gartner think so and would love to know the data points that have made them change their opnion but but I am not surprised.




Note - You can read what #Gartner has to say on this link  (Just say no to blockchain (for now) advises Gartner)

Like many other experts I have always maintained that
  1. Blockchain is a great idea but as of 2018 it is premature to implement blockchain for the enterprise
  2. Distributed Ledger in the blockchain technology is a great concept but it's implementation is also the biggest challenge
  3. The 'Concept of Ledger' is not new and 'immutable ledger' has been implemented using various technologies but no one has a perfectly secure, distributed & cost effective solution for a truly distributed ledger.
  4. More research and thought needs to go into the technologies to implement blockchain before throwing traditional databases ledgers out of the window.
As I have posted earlier, the concept of Distributed Ledger or a Cryptographically signed ,  irrevocable transactional records is great idea and but for all we know it may have been implemented in various ways in the past. What I am not sure is whether Blockchain is the 'ONLY WAY' to implement a 'signed, immutable, irrevocable transactional record'. Concern is particularly the blockchain environment/technology that is required to support a blockchain solution and I feel it may not be feasible to sustain blockchain for long time. Of course it is a new idea, there is a huge hype and someone ( no not the inventor!) is going to make lots of money if and when blockchain finally takes off.
 If you want to know why I am excited about immutable ledger yet skeptical about blockchain you can go through my past posts on #blockchain -

 

 

MUSTREAD : How can you use Index Funds to help create wealth? HDFC MF Weekend Bytes

https://www.hdfcfund.com/knowledge-stack/mf-vault/weekend-bytes/how-can-you-use-index-funds-help-create-wealth?utm_source=Netcore...